[Netarchivesuite-curator] BnF NAS update for April

Fri Apr 6 13:48:13 CEST 2018

Hello all,

As mentioned in previous updates last year, over the past months we have 
been working on our full-text indexing process and the search interface. 
Last week we opened access to the new version of our full-text search 
application "Archives de l'internet Labs", which is now integrated with 
the main interface. It has been updated with new graphics and new 
functions, in particular the grouping of identical URLs in search results. 
We have to thank Toke for his help with this, as we had real problems 
regarding performance when we applied this grouping to searches with large 
numbers of results - Toke advised us to deactivate in Solr the count of 
the number of groups generated (which we used to calculate the number of 
pages of results) and this made a huge improvement. We also tidied up the 
code of the application with a view to allowing its use by other 
institutions who use warc-indexer.

On the indexing side, the main objective of the project was to index our 
daily news crawl since its creation at the end of 2010; we originally 
aimed to index the period up to the end of 2016 but we were able to extend 
this to the end of 2017. This means we were also able to treat WARCs 
containing revisit records, which we have been producing since changing to 
Heritrix 3 last year. We worked with the community on the latest version 
of warc-indexer, in particular to define collection names based on W/ARC 
filenames, and therefore on the harvest definitions in NAS. The news crawl 
represented an increase in the amount of data indexed compared to the 
collections previously indexed (around 13 TB, compared to around 2.5 TB) 
and also in terms of the final index size, which roughly doubled to around 
2.4 TB. To handle with this we also put in place a new infrastructure, and 
the performances are much better than out previous prototype, though we 
will aim to continue work on the configuration in future developments.

Work on the research project that is using the news crawl to study 
neologisms is ongoing. We are working with the research team to see if 
some of the analyses that they apply, such as Named Entity Recognition and 
Topic Modelling, can be included in our indexing and search systems.

Best regards,
The BnF digital legal deposit team
20 ans de Gallica : la plus grande bibliothèque numérique en accès libre fête son anniversaire Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20180406/08bdffb7/attachment.html>