[Netarchivesuite-curator] BnF NAS update for November

alexandre.chautemps at bnf.fr alexandre.chautemps at bnf.fr
Wed Nov 4 14:22:23 CET 2020


Dear all,

BnF's annual broad crawl is currently running. In the middle of the crawl, 
we noticed that the size of the data stored per job was significantly 
smaller than expected, and the projection came to less than 100 TB for the 
full crawl, instead of 110 to 115 TB in our initial estimate. So we 
decided to increase the maximum number of URLs by domain from 2000 to 
2600. The crawl goes on and should be terminated on next Thursday or next 
Friday, with a size of 112 to 114 TB. The crawl will have lasted 30 days, 
that is our fastest broad crawl since 2012 (but the 2012 one had only a 
size of 33 TB).

France is in lockdown again since 30th of October. We continue to select 
websites linked to the pandemic and to work on promotion of the 
coronavirus collection constituted on the first half of this year. Within 
the framework of our ephemeral news collection, we cover the terrorist 
attacks of October and November (Conflans Sainte Honorine, Nice, Lyon) by 
selecting online newspaper articles and Twitter hashtags.

Best regards,

The BnF digital legal deposit team
Exposition  Josef Koudelka. Ruines  – jusqu'au 16 décembre 2020 | François-Mitterrand Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20201104/0268b4f6/attachment.html>


More information about the Netarchivesuite-curator mailing list