[Netarchivesuite-curator] BnF NAS Update for November

peter.stirling at bnf.fr peter.stirling at bnf.fr
Thu Nov 8 13:36:02 CET 2012


Hello all,

The first step of our broad crawl 2012 is finished. We harvested about 2.8 
million domains with a budget of 1000 URLs each. For the first time the 
BnF didn't respect the robots.txt for its broad crawl.

One of the consequences of this choice is an augmentation in the number of 
complaints although their number remains very reasonable.
The second consequence is an increase in the number of parking websites at 
the end of the jobs. We added generic crawler traps to stop them.
And we hope that the third consequence will be an improvement in the 
quality of the collection especially regarding the stylesheets and 
pictures.

Finally, this first step has cost 17 Tb and 600 million URLs collected, 
compared to 13.7 Tb and 413 million URLs in 2011. We have still to decide 
on the budget for the second step, but it will be between 5,000 and 10,000 
URLs, and the second step should be started later this week.

Best regards,
The BnF web archiving team 


Europeana, bibliothèque numérique européenne : 24 millions de documents Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20121108/7d0b9e09/attachment.html>


More information about the Netarchivesuite-curator mailing list