[Netarchivesuite-curator] BnF NAS update for August

peter.stirling at bnf.fr peter.stirling at bnf.fr
Tue Aug 8 11:06:36 CEST 2017

Hello all,

We are continuing work on full-text indexing of our news collection, as 
mentioned in our June update. We're pleased to say that the research 
project on the study of neologisms has been accepted and we have started 
to work with the research team to prepare their analysis by bringing 
improvements to our indexing process, in particular the treatments that 
are applied to the text.

As part of the work on full-text indexing we have been studying the 
question of how to define collections in the index. In the first instance 
this is to allow searching in the three specific collections we are 
currently indexing (1996-2000, 2015 terrorist attacks and now the news 
crawl), but also in the longer term to allow users to limit searches to 
particular parts of the collections. The organisation of the crawls in NAS 
means that the subject organisation used in our selection tool BCWeb is 
not represented in the data, so we plan to use a high-level distinction 
between broad crawls, focussed crawls and historical collections 
(extractions from Internet Archive), with sub-collections to be used for 
specific areas like the news crawl, election crawls, etc. In order to do 
this we have adapted the way collections are defined in WARCIndexer to 
allow it to use elements from W/ARC filenames.

Best regards,
The BnF digital legal deposit team

Exposition  La bibliothèque, la nuit – Bibliothèques mythiques en réalité virtuelle  - jusqu'au 13 août 2017 - BnF - François-Mitterrand Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20170808/accc9aeb/attachment.html>

More information about the Netarchivesuite-curator mailing list