[Netarchivesuite-curator] BnF NAS update for May

peter.stirling at bnf.fr peter.stirling at bnf.fr
Mon May 22 10:41:25 CEST 2017


Hello all,

As of the middle of March, the BnF is using NetarchiveSuite 5 and Heritrix 
3 for its selective crawls.

The first conclusion is that the quality of the crawl is better than with 
Heritrix 1:
- The percentage of URLs with a HTTP  response code 2XX is higher than 80 
% whereas with Heritrix 1, it's around 74 %.
- The number of 4XX is lower than with Heritrix 1. The duration of the 
crawls is shorter.
- Heritrix 3 crawls less content on domains outside the seed list than 
Heritrix 1, as a consequence there is a decrease in the percentage of 
images in the crawls.

Despite the lack of deduplication for the first selective crawls, the 
volume of the crawls is less than we had estimated. For example, the news 
crawl was previously between 0.25 and 0.3 TB per month whereas now it is 
0.14 TB. To avoid going over our storage budget, we had decreased by 10 % 
all our budgets (in terms of URLs collected) with the change to Heritrix 
3.

For the moment, the most significative improvement is the crawl of HTTPS 
URLs. For example, the news crawl contains more than 30 % of seeds in 
HTTPS and with Heritrix 3, more than 80 % are harvested against 69 % with 
Heritirix 1. 

Heritrix 3 has also allowed us to simplify the harvest of subscription 
press sites. For this specific crawl, thanks to the new functionalities of 
Heritrix 3, the engineers were able to merge our 9 harvest templates into 
only 2: one for the HTTP and HTML authentication and one for the FTP 
crawl. The monitoring and the QA are really optimized.

We are continuing to analyse Heritrix 3 to better prepare the broad crawl.

Best regards,
The BnF digital legal deposit team
Événement –  La BnF fait son Festival  – samedi 20 et dimanche 21 mai 2017 – François-Mitterrand Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20170522/67f07831/attachment.html>


More information about the Netarchivesuite-curator mailing list