<font size=2 face="sans-serif">Hello all,</font><br><br><font size=2 face="sans-serif">Our 2017 broad crawl was launched on
the 16th October. The settings are 1500 URLs per domain, with a limit of
3 days per job. Our prediction of the overall volume based on our tests
seems to have been underestimated: we had calculated around 77 TB with
these settings and after three weeks of crawling we are now expecting a
final volume of around 97 TB. This is still within our overall storage
budget but we are keeping a close watch on the volume of data collected.
So far we have encountered no major problems, both H3 and the new infrastructure
are functioning correctly. </font><br><br><font size=2 face="sans-serif">We are also continuing to work on updating
our full-text indexing process with the aim of indexing our news crawls
since 2016. We have been updating the indexing schema to follow recent
developments on warc-indexer and we will be working on the organisation
of the index to improve query performance. The research project that will
use this index to study neologisms is starting this week, so we will be
working closely with a research engineer over the next few weeks.</font><br><br><font size=2 face="sans-serif">Best regards,</font><br><font size=2 face="sans-serif">The BnF digital legal deposit team</font><br><br><font face="sans-serif"><hr />
<p>Exposition <strong><em><a href="http://www.bnf.fr/fr/evenements_et_culture/anx_expositions/f.paysages_francais.html">Paysages français – Une aventure photographique (1984 - 2017)</a></em></strong> - du 24 octobre 2017 au 4 février 2018 - BnF - François-Mitterrand</p>
<p style="color:#008000"><strong>Avant d'imprimer, pensez à l'environnement.</strong></p></font>