[Netarchivesuite-curator] Netarchive NAS update for May
sas at kb.dk
Tue Jun 6 10:58:20 CEST 2017
Herby an update from Denmark ☺
In these weeks, we focus very much on getting familiar to the use of BCWeb and the adaption of BCWeb to our needs. There will be local and regional elections in the end of the year and we would very much like to have a “Netarchive-BCweb” at that time, because we want to involve researchers and experts in for instance using social media with helping us to find url’s.
One crucial issue is, that we need the implementation of the possibility for bulk upload of url’s.
From: Netarchivesuite-curator [mailto:netarchivesuite-curator-bounces at ml.sbforge.org] On Behalf Of peter.stirling at bnf.fr
Sent: Monday, May 22, 2017 10:41 AM
To: netarchivesuite-curator at ml.sbforge.org
Subject: [Netarchivesuite-curator] BnF NAS update for May
As of the middle of March, the BnF is using NetarchiveSuite 5 and Heritrix 3 for its selective crawls.
The first conclusion is that the quality of the crawl is better than with Heritrix 1:
- The percentage of URLs with a HTTP response code 2XX is higher than 80 % whereas with Heritrix 1, it's around 74 %.
- The number of 4XX is lower than with Heritrix 1. The duration of the crawls is shorter.
- Heritrix 3 crawls less content on domains outside the seed list than Heritrix 1, as a consequence there is a decrease in the percentage of images in the crawls.
Despite the lack of deduplication for the first selective crawls, the volume of the crawls is less than we had estimated. For example, the news crawl was previously between 0.25 and 0.3 TB per month whereas now it is 0.14 TB. To avoid going over our storage budget, we had decreased by 10 % all our budgets (in terms of URLs collected) with the change to Heritrix 3.
For the moment, the most significative improvement is the crawl of HTTPS URLs. For example, the news crawl contains more than 30 % of seeds in HTTPS and with Heritrix 3, more than 80 % are harvested against 69 % with Heritirix 1.
Heritrix 3 has also allowed us to simplify the harvest of subscription press sites. For this specific crawl, thanks to the new functionalities of Heritrix 3, the engineers were able to merge our 9 harvest templates into only 2: one for the HTTP and HTML authentication and one for the FTP crawl. The monitoring and the QA are really optimized.
We are continuing to analyse Heritrix 3 to better prepare the broad crawl.
The BnF digital legal deposit team
Événement – La BnF fait son Festival<http://www.bnf.fr/fr/la_bnf/anx_actu_bib/a.festival_bnf.html> – samedi 20 et dimanche 21 mai 2017 – François-Mitterrand
Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Netarchivesuite-curator