[Netarchivesuite-curator] BnF NAS update for November
peter.stirling at bnf.fr
peter.stirling at bnf.fr
Fri Nov 8 17:35:22 CET 2013
Hello,
We started our 2013 broad crawl on the 21st October, with a list of just
over 4 million domains. We have had a few problems :
- New management hardware in the storage rack had some unpredictable side
effects. Under heavy charge, too long delays appeared in writing data into
ARC files which run Heritrix into timeout errors. We therefore had to
reduce the number of threads per job. This has worked, although we still
have to be careful when launching our selective crawls.
- The broad crawl also caused problems for the creation of the
deduplication index for our daily crawl of subscription press. We now put
the jobs from the broad crawl on pause each day while the index is
created, which means we lose about 40 minutes per day.
- We discovered a bug in the new version of NAS, which affected the number
of configurations per job in the broad crawl - when a selective crawl was
launched before all the jobs of the broad crawl were created, this caused
the number of configurations per job to be set to the number used in the
selective crawls (500 instead of 3,500). This bug seems to be random so we
hadn't seen it during our tests. We suspended the launch of selective
crawls until all the jobs for the broad crawl had been created, and will
need to do the same for the second stage of the broad crawl.
We have managed to work round these problems and the crawl is continuing,
but the first stage will take slightly longer than planned.
Also during November we will be starting a crawl for the centenary of
World War I. The first crawl will be based on the official commemoration
sites, and we will be expanding the scope for subsequent crawls during the
period 2014-2018.
Best regards,
The BnF digital legal deposit team
Participez à la Grande Collecte 1914-1918 Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20131108/a8d1d9f2/attachment.html>
More information about the Netarchivesuite-curator
mailing list