<font size=2 face="sans-serif">Hello all,</font><br><br><font size=2 face="sans-serif">As mentioned in previous updates last

year, over the past months we have been working on our full-text indexing

process and the search interface. Last week we opened access to the new

version of our full-text search application "Archives de l'internet

Labs", which is now integrated with the main interface. It has been

updated with new graphics and new functions, in particular the grouping

of identical URLs in search results. We have to thank Toke for his help

with this, as we had real problems regarding performance when we applied

this grouping to searches with large numbers of results - Toke advised

us to deactivate in Solr the count of the number of groups generated (which

we used to calculate the number of pages of results) and this made a huge

improvement. We also tidied up the code of the application with a view

to allowing its use by other institutions who use warc-indexer.</font><br><br><font size=2 face="sans-serif">On the indexing side, the main objective

of the project was to index our daily news crawl since its creation at

the end of 2010; we originally aimed to index the period up to the end

of 2016 but we were able to extend this to the end of 2017. This means

we were also able to treat WARCs containing revisit records, which we have

been producing since changing to Heritrix 3 last year. We worked with the

community on the latest version of warc-indexer, in particular to define

collection names based on W/ARC filenames, and therefore on the harvest

definitions in NAS. The news crawl represented an increase in the amount

of data indexed compared to the collections previously indexed (around

13 TB, compared to around 2.5 TB) and also in terms of the final index

size, which roughly doubled to around 2.4 TB. To handle with this we also

put in place a new infrastructure, and the performances are much better

than out previous prototype, though we will aim to continue work on the

configuration in future developments.</font><br><br><font size=2 face="sans-serif">Work on the research project that is

using the news crawl to study neologisms is ongoing. We are working with

the research team to see if some of the analyses that they apply, such

as Named Entity Recognition and Topic Modelling, can be included in our

indexing and search systems.</font><br><br><font size=2 face="sans-serif">Best regards,</font><br><font size=2 face="sans-serif">The BnF digital legal deposit team</font><font face="sans-serif"><hr />

<p><strong><a href="http://www.bnf.fr/fr/collections_et_services/anx_bib_num/a.gallica_20ans.html">20 ans de Gallica : la plus grande bibliothèque numérique en accès libre fête son anniversaire</a></p>

<p style="color:#008000"><strong>Avant d'imprimer, pensez à l'environnement.</strong></p></font>