[Netarchivesuite-curator] BnF NAS update for November

peter.stirling at bnf.fr peter.stirling at bnf.fr
Fri Nov 8 17:35:22 CET 2013


Hello,

We started our 2013 broad crawl on the 21st October, with a list of just 
over 4 million domains. We have had a few problems :

- New management hardware in the storage rack had some unpredictable side 
effects. Under heavy charge, too long delays appeared in writing data into 
ARC files which run Heritrix into timeout errors. We therefore had to 
reduce the number of threads per job. This has worked, although we still 
have to be careful when launching our selective crawls.
- The broad crawl also caused problems for the creation of the 
deduplication index for our daily crawl of subscription press. We now put 
the jobs from the broad crawl on pause each day while the index is 
created, which means we lose about 40 minutes per day.
- We discovered a bug in the new version of NAS, which affected the number 
of configurations per job in the broad crawl - when a selective crawl was 
launched before all the jobs of the broad crawl were created, this caused 
the number of configurations per job to be set to the number used in the 
selective crawls (500 instead of 3,500). This bug seems to be random so we 
hadn't seen it during our tests. We suspended the launch of selective 
crawls until all the jobs for the broad crawl had been created, and will 
need to do the same for the second stage of the broad crawl.

We have managed to work round these problems and the crawl is continuing, 
but the first stage will take slightly longer than planned.

Also during November we will be starting a crawl for the centenary of 
World War I. The first crawl will be based on the official commemoration 
sites, and we will be expanding the scope for subsequent crawls during the 
period 2014-2018.

Best regards,
The BnF digital legal deposit team










Participez à la Grande Collecte 1914-1918 Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20131108/a8d1d9f2/attachment.html>


More information about the Netarchivesuite-curator mailing list