[Netarchivesuite-curator] BnF NAS update for March

peter.stirling at bnf.fr peter.stirling at bnf.fr
Tue Mar 6 09:30:54 CET 2018


Hello all,

In the middle of February, we launched our bi-annual crawl which should 
collect around 2.25 TB. At the launch, we encountered two problems. The 
first one concerned the saturation of the server storage used for the 
creation of the deduplication index: we need to rethink all our server 
workspaces with the new infrastructure. A few crawlers lost the connexion 
with the NFS server when we restarted the crawl and some jobs failed. We 
didn't restart the failed jobs individually because in this case some 
information is missing from the warcinfo record in the WARCs.

When we relaunched the whole crawl, we again encountered the problem of 
two exact same jobs being created with the same ID: the harvest definition 
was paused automatically  before all the jobs were created. So we decided 
to stop the crawl and relaunch it once again.

In conclusion, there's almost no deduplication for the bi-annual crawl and 
the amount of data crawled will therefore be larger than expected.

Since that time, Lam has fixed the problem of the resubmitted jobs: the 
harvestInfo.xml fields are now correctly added to the warcinfo records for 
these jobs. And we must therefore change NAS version to include this 
correction before launching our annual crawl.

Best regards,
The BnF digital legal deposit team
20 ans de Gallica : la plus grande bibliothèque numérique en accès libre fête son anniversaire Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20180306/90026290/attachment.html>


More information about the Netarchivesuite-curator mailing list