[Netarchivesuite-curator] BnF NAS update for May
peter.stirling at bnf.fr
peter.stirling at bnf.fr
Mon May 22 10:41:25 CEST 2017
Hello all,
As of the middle of March, the BnF is using NetarchiveSuite 5 and Heritrix
3 for its selective crawls.
The first conclusion is that the quality of the crawl is better than with
Heritrix 1:
- The percentage of URLs with a HTTP response code 2XX is higher than 80
% whereas with Heritrix 1, it's around 74 %.
- The number of 4XX is lower than with Heritrix 1. The duration of the
crawls is shorter.
- Heritrix 3 crawls less content on domains outside the seed list than
Heritrix 1, as a consequence there is a decrease in the percentage of
images in the crawls.
Despite the lack of deduplication for the first selective crawls, the
volume of the crawls is less than we had estimated. For example, the news
crawl was previously between 0.25 and 0.3 TB per month whereas now it is
0.14 TB. To avoid going over our storage budget, we had decreased by 10 %
all our budgets (in terms of URLs collected) with the change to Heritrix
3.
For the moment, the most significative improvement is the crawl of HTTPS
URLs. For example, the news crawl contains more than 30 % of seeds in
HTTPS and with Heritrix 3, more than 80 % are harvested against 69 % with
Heritirix 1.
Heritrix 3 has also allowed us to simplify the harvest of subscription
press sites. For this specific crawl, thanks to the new functionalities of
Heritrix 3, the engineers were able to merge our 9 harvest templates into
only 2: one for the HTTP and HTML authentication and one for the FTP
crawl. The monitoring and the QA are really optimized.
We are continuing to analyse Heritrix 3 to better prepare the broad crawl.
Best regards,
The BnF digital legal deposit team
Événement – La BnF fait son Festival – samedi 20 et dimanche 21 mai 2017 – François-Mitterrand Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20170522/67f07831/attachment.html>
More information about the Netarchivesuite-curator
mailing list