[Netarchivesuite-curator] BnF NAS update for March

peter.stirling at bnf.fr peter.stirling at bnf.fr
Fri Mar 24 16:30:51 CET 2017

Hello all,

After performing our last tests on Netarchivesuite 5.3 and Heritrix 3, we 
went into production and started our first crawls this week! We will give 
more details in our next update.

The beginning of the year is also the time for writing our annual report. 
In 2016, we crawled 125.47 TB of data including the largest broad crawl in 
our collection (90.5 TB). This year we chose to study the top level 
domains (TLDs) in the broad crawl  to measure the impact of including new 
regional TLDs in the seed list. The use of the TLD varies from one region 
to another (commercial purposes, public purposes, personal websites...) 
and the number of active websites is not proportional to the geographical 
area. We also analysed Epub files, as we did last year, to see if there is 
any evolution: their number is quite similar but the number of domains 
where they are hosted is growing. Overall, we exceeded our predictions due 
to the increase of the average weight of the harvested files. 

Best regards,
The BnF digital legal deposit team

Pass BnF lecture/culture illimité à 15 € – Pass Recherche à 50 €  - Tout lire, tout voir, tout écouter ! Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20170324/adbd7dfc/attachment.html>

More information about the Netarchivesuite-curator mailing list