[Netarchivesuite-curator] BnF NAS update for November

peter.stirling at bnf.fr peter.stirling at bnf.fr
Thu Nov 6 12:15:03 CET 2014

Hello all,

Our 2014 broad crawl was launched on the 20th October. For this crawl we 
have installed NAS version 4.4 (with a patch to handle prefixing of 
metadata files), and we have also started crawling in WARC - this has been 
the main focus of our developments this year, as we have had to adapt all 
parts of our production (mainly indexing, access, quality control and 
preservation) to handle the new format. With the exception of SPAR, all 
our applications now handle WARC, and SPAR should follow next year.

The 2014 budget for broad crawl is limited to 55 TB with a identical 
number of seed domains (4 million) as in 2013. So we have to face the same 
problem as last year, that is, how to cover the French perimeter while 
having a limited storage volume:
- No contact has been made with a new registrar. So we are not extending 
the seed list. Therefore the harvest covers only 60% of the legal scope 
defined by French Heritage Code.
- An analysis has been made ??of the 1,000 largest domains collected in 
2013. It appeared that 26 domains could be excluded through global crawler 
traps in NetarchiveSuite. These were sites in English (trade, pornography, 
parking) representing approximately 0.61 compressed TB.
- An analysis was done on HTTP 4XX errors. Excluding all these files would 
get rid of errors caused by Heritrix; but also some real web pages. The 
volume of 4XX error has been calculated for the 2013 broad crawl, and 
represented a maximum of 1 uncompressed TB. This did not justify putting 
up other studies to reduce this type of URL.
- We are only crawling data on a single stage basis to avoid a double 
crawl of certain files.
- The error penalty parameter was set at 100 to exclude bad URLs faster.
- The budget for each domain is limited to 3,000 URL (around 100 MB).

The second challenge is to boost the speed. The crawlers are powerful and 
now we have to work on processing space and web connections. At the 
beginning of the broad crawl, we launched only 10 crawlers at a time to 
progressively build up to 40 crawlers. But in the Heritrix logs, we found 
that too many URL had a code ?time trunc? or ?time out? because the 
connection was lost. We have a limited bandwidth for digital legal deposit 
of  200 MB/s during the day and 850 MB/s during the night. The lower limit 
is too small: we are trying to get it increased, but failing that, we will 
change the number of threads in Heritrix (from 150 threads, down to 100 
threads during the day and up to 200 threads during the night).

Best regards,
The BnF digital legal deposit team

Participez à l'acquisition d'un Trésor national - Le manuscrit royal de François I er Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20141106/91908b6e/attachment.html>

More information about the Netarchivesuite-curator mailing list