[Netarchivesuite-curator] BnF NAS update for January

peter.stirling at bnf.fr peter.stirling at bnf.fr
Tue Jan 13 10:15:16 CET 2015


Hello all,

Our 2014 broad crawl finished at the beginning of December after 50 days. 
We gathered a total of 1.7 billion URLs for 67 TB. The volume was much 
bigger than expected because of the increase in the average size of each 
URL (42 KB in 2014, 35 KB in 2013). Our explanation for this is that the 
greater bandwidth (867 KB in 2014, 529 KB in 2013) gave the opportunity to 
harvest complete video files; in fact, we suppose that lots of them were 
truncated in 2013.

We had some difficulties with workspaces used to generate the different 
kinds of indexes. As the time taken to generate each deduplication index 
increased too much, the schedule for focused crawls was disturbed. At the 
same time, the indexing of harvested URLs needed resources used by the 
crawlers, so we decided to wait the end of the broad crawl to finish this 
process.

The seed list had 4.1 million domains. We paid attention to new geographic 
TLDs as France had two new extensions in 2014 (.paris for the capital and 
.bzh for Brittany). In addition, we analysed and checked more particularly 
23,600 active websites from the French overseas departments.

Since last week we have also been collecting sites in relation to the 
terrorist attacks in France, with URLs being selected both by librarians 
at the BnF and IIPC members. We would like to thank you for the sites you 
have sent us for these crawls.

Best regards,
The BnF digital legal deposit team


Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20150113/13b5c9e2/attachment.html>


More information about the Netarchivesuite-curator mailing list