<font size=2 face="sans-serif">Hello all,</font><br><br><font size=2 face="sans-serif">In the middle of February, we launched

our bi-annual crawl which should collect around 2.25 TB. At the launch,

we encountered two problems. The first one concerned the saturation of

the server storage used for the creation of the deduplication index: we

need to rethink all our server workspaces with the new infrastructure.

A few crawlers lost the connexion with the NFS server when we restarted

the crawl and some jobs failed. We didn't restart the failed jobs individually

because in this case some information is missing from the warcinfo record

in the WARCs.</font><br><br><font size=2 face="sans-serif">When we relaunched the whole crawl,

we again encountered the problem of two exact same jobs being created with

the same ID: the harvest definition was paused automatically  before

all the jobs were created. So we decided to stop the crawl and relaunch

it once again.</font><br><br><font size=2 face="sans-serif">In conclusion, there's almost no deduplication

for the bi-annual crawl and the amount of data crawled will therefore be

larger than expected.</font><br><br><font size=2 face="sans-serif">Since that time, Lam has fixed the problem

of the resubmitted jobs: the harvestInfo.xml fields are now correctly added

to the warcinfo records for these jobs. And we must therefore change NAS

version to include this correction before launching our annual crawl.</font><br><br><font size=2 face="sans-serif">Best regards,</font><br><font size=2 face="sans-serif">The BnF digital legal deposit team</font><font face="sans-serif"><hr />

<p><strong><a href="http://www.bnf.fr/fr/collections_et_services/anx_bib_num/a.gallica_20ans.html">20 ans de Gallica : la plus grande bibliothèque numérique en accès libre fête son anniversaire</a></p>

<p style="color:#008000"><strong>Avant d'imprimer, pensez à l'environnement.</strong></p></font>