[Netarchivesuite-curator] BnF NAS update for March
peter.stirling at bnf.fr
peter.stirling at bnf.fr
Fri Mar 24 16:30:51 CET 2017
Hello all,
After performing our last tests on Netarchivesuite 5.3 and Heritrix 3, we
went into production and started our first crawls this week! We will give
more details in our next update.
The beginning of the year is also the time for writing our annual report.
In 2016, we crawled 125.47 TB of data including the largest broad crawl in
our collection (90.5 TB). This year we chose to study the top level
domains (TLDs) in the broad crawl to measure the impact of including new
regional TLDs in the seed list. The use of the TLD varies from one region
to another (commercial purposes, public purposes, personal websites...)
and the number of active websites is not proportional to the geographical
area. We also analysed Epub files, as we did last year, to see if there is
any evolution: their number is quite similar but the number of domains
where they are hosted is growing. Overall, we exceeded our predictions due
to the increase of the average weight of the harvested files.
Best regards,
The BnF digital legal deposit team
Pass BnF lecture/culture illimité à 15 € – Pass Recherche à 50 € - Tout lire, tout voir, tout écouter ! Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20170324/adbd7dfc/attachment.html>
More information about the Netarchivesuite-curator
mailing list