[Netarchivesuite-curator] BnF NAS update for September
peter.stirling at bnf.fr
peter.stirling at bnf.fr
Wed Sep 4 15:42:05 CEST 2013
Hello all,
Last summer, BnF tried a new type of harvest for blog platforms. We were
satisfied with the result except that we had only a small sample of blogs:
the volume of images for free.fr was really big and we had to stop the
harvest after 15 days. So in 2013, we decided not to collect free.fr and
to reduce the budget to 800 URLs per host. We had a list of 225,000 seeds
which we harvested during a period of 50 days. The problem this year is
that, with a depth of "host", Heritrix generated an exponential list of
inactive queues: it seemed we would never finish the crawl! And so we have
to think of yet another choice of parameters?
We are also working on a specific QA for large domains. From the host
reports generated by Heritrix, we can regularly analyze the "Top domains"
of each run. This summer, we made a general observation of "Top domains"
for the whole year 2013 with the objective of finding new filters and thus
eliminating "noise" in the crawls. It showed many of the domains are not
chosen as seeds: there is a very large amount of image databases that we
need to keep but also social networks which could be filtered (for
example, facebook.com in all languages of the world!). For big domains
from our seedlists, we found that we can sometimes exclude some hosts
(e.g. betadev.cnrs.fr) or we can exclude some URLs (for example, URLs
having HTTP 404 as response code because of Heritrix generating false URLs
from Javascript).
Best regards,
The BnF digital legal deposit team
Fermeture annuelle des sites François-Mitterrand et Richelieu du 2 au 15 septembre 2013 Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20130904/6980ef6a/attachment.html>
More information about the Netarchivesuite-curator
mailing list