[Netarchivesuite-curator] BnF NAS update for November
peter.stirling at bnf.fr
peter.stirling at bnf.fr
Thu Nov 6 12:15:03 CET 2014
Our 2014 broad crawl was launched on the 20th October. For this crawl we
have installed NAS version 4.4 (with a patch to handle prefixing of
metadata files), and we have also started crawling in WARC - this has been
the main focus of our developments this year, as we have had to adapt all
parts of our production (mainly indexing, access, quality control and
preservation) to handle the new format. With the exception of SPAR, all
our applications now handle WARC, and SPAR should follow next year.
The 2014 budget for broad crawl is limited to 55 TB with a identical
number of seed domains (4 million) as in 2013. So we have to face the same
problem as last year, that is, how to cover the French perimeter while
having a limited storage volume:
- No contact has been made with a new registrar. So we are not extending
the seed list. Therefore the harvest covers only 60% of the legal scope
defined by French Heritage Code.
- An analysis has been made ??of the 1,000 largest domains collected in
2013. It appeared that 26 domains could be excluded through global crawler
traps in NetarchiveSuite. These were sites in English (trade, pornography,
parking) representing approximately 0.61 compressed TB.
- An analysis was done on HTTP 4XX errors. Excluding all these files would
get rid of errors caused by Heritrix; but also some real web pages. The
volume of 4XX error has been calculated for the 2013 broad crawl, and
represented a maximum of 1 uncompressed TB. This did not justify putting
up other studies to reduce this type of URL.
- We are only crawling data on a single stage basis to avoid a double
crawl of certain files.
- The error penalty parameter was set at 100 to exclude bad URLs faster.
- The budget for each domain is limited to 3,000 URL (around 100 MB).
The second challenge is to boost the speed. The crawlers are powerful and
now we have to work on processing space and web connections. At the
beginning of the broad crawl, we launched only 10 crawlers at a time to
progressively build up to 40 crawlers. But in the Heritrix logs, we found
that too many URL had a code ?time trunc? or ?time out? because the
connection was lost. We have a limited bandwidth for digital legal deposit
of 200 MB/s during the day and 850 MB/s during the night. The lower limit
is too small: we are trying to get it increased, but failing that, we will
change the number of threads in Heritrix (from 150 threads, down to 100
threads during the day and up to 200 threads during the night).
The BnF digital legal deposit team
Participez à l'acquisition d'un Trésor national - Le manuscrit royal de François I er Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Netarchivesuite-curator