[Netarchivesuite-curator] BnF NAS update for November

peter.stirling at bnf.fr peter.stirling at bnf.fr
Tue Nov 7 10:37:25 CET 2017


Hello all,

Our 2017 broad crawl was launched on the 16th October. The settings are 
1500 URLs per domain, with a limit of 3 days per job. Our prediction of 
the overall volume based on our tests seems to have been underestimated: 
we had calculated around 77 TB with these settings and after three weeks 
of crawling we are now expecting a final volume of around 97 TB. This is 
still within our overall storage budget but we are keeping a close watch 
on the volume of data collected. So far we have encountered no major 
problems, both H3 and the new infrastructure are functioning correctly. 

We are also continuing to work on updating our full-text indexing process 
with the aim of indexing our news crawls since 2016. We have been updating 
the indexing schema to follow recent developments on warc-indexer and we 
will be working on the organisation of the index to improve query 
performance. The research project that will use this index to study 
neologisms is starting this week, so we will be working closely with a 
research engineer over the next few weeks.

Best regards,
The BnF digital legal deposit team


Exposition  Paysages français – Une aventure photographique (1984 - 2017)  - du 24 octobre 2017 au 4 février 2018 - BnF - François-Mitterrand Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20171107/7076e5a0/attachment.html>


More information about the Netarchivesuite-curator mailing list