[Netarchivesuite-curator] BnF NAS update for September

peter.stirling at bnf.fr peter.stirling at bnf.fr
Wed Sep 4 15:42:05 CEST 2013

Hello all,

Last summer, BnF tried a new type of harvest for blog platforms. We were 
satisfied with the result except that we had only a small sample of blogs: 
the volume of images for free.fr was really big and we had to stop the 
harvest after 15 days. So in 2013, we decided not to collect free.fr and 
to reduce the budget to 800 URLs per host. We had a list of 225,000 seeds 
which we harvested during a period of 50 days. The problem this year is 
that, with a depth of "host", Heritrix generated an exponential list of 
inactive queues: it seemed we would never finish the crawl! And so we have 
to think of yet another choice of parameters? 

We are also working on a specific QA for large domains. From the host 
reports generated by Heritrix, we can regularly analyze the "Top domains" 
of each run. This summer, we made a general observation of "Top domains" 
for the whole year 2013 with the objective of finding new filters and thus 
eliminating "noise" in the crawls. It showed many of the domains are not 
chosen as seeds: there is a very large amount of image databases that we 
need to keep but also social networks which could be filtered (for 
example, facebook.com in all languages of the world!). For big domains 
from our seedlists, we found that we can sometimes exclude some hosts 
(e.g. betadev.cnrs.fr) or we can exclude some URLs (for example, URLs 
having HTTP 404 as response code because of Heritrix generating false URLs 
from Javascript). 

Best regards,
The BnF digital legal deposit team

Fermeture annuelle des sites François-Mitterrand et Richelieu du 2 au 15 septembre 2013 Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20130904/6980ef6a/attachment.html>

More information about the Netarchivesuite-curator mailing list