[Netarchivesuite-curator] BnF NAS update for March
peter.stirling at bnf.fr
peter.stirling at bnf.fr
Tue Mar 6 09:30:54 CET 2018
Hello all,
In the middle of February, we launched our bi-annual crawl which should
collect around 2.25 TB. At the launch, we encountered two problems. The
first one concerned the saturation of the server storage used for the
creation of the deduplication index: we need to rethink all our server
workspaces with the new infrastructure. A few crawlers lost the connexion
with the NFS server when we restarted the crawl and some jobs failed. We
didn't restart the failed jobs individually because in this case some
information is missing from the warcinfo record in the WARCs.
When we relaunched the whole crawl, we again encountered the problem of
two exact same jobs being created with the same ID: the harvest definition
was paused automatically before all the jobs were created. So we decided
to stop the crawl and relaunch it once again.
In conclusion, there's almost no deduplication for the bi-annual crawl and
the amount of data crawled will therefore be larger than expected.
Since that time, Lam has fixed the problem of the resubmitted jobs: the
harvestInfo.xml fields are now correctly added to the warcinfo records for
these jobs. And we must therefore change NAS version to include this
correction before launching our annual crawl.
Best regards,
The BnF digital legal deposit team
20 ans de Gallica : la plus grande bibliothèque numérique en accès libre fête son anniversaire Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20180306/90026290/attachment.html>
More information about the Netarchivesuite-curator
mailing list