[Netarchivesuite-curator] BnF NAS update for August
peter.stirling at bnf.fr
peter.stirling at bnf.fr
Tue Aug 8 11:06:36 CEST 2017
Hello all,
We are continuing work on full-text indexing of our news collection, as
mentioned in our June update. We're pleased to say that the research
project on the study of neologisms has been accepted and we have started
to work with the research team to prepare their analysis by bringing
improvements to our indexing process, in particular the treatments that
are applied to the text.
As part of the work on full-text indexing we have been studying the
question of how to define collections in the index. In the first instance
this is to allow searching in the three specific collections we are
currently indexing (1996-2000, 2015 terrorist attacks and now the news
crawl), but also in the longer term to allow users to limit searches to
particular parts of the collections. The organisation of the crawls in NAS
means that the subject organisation used in our selection tool BCWeb is
not represented in the data, so we plan to use a high-level distinction
between broad crawls, focussed crawls and historical collections
(extractions from Internet Archive), with sub-collections to be used for
specific areas like the news crawl, election crawls, etc. In order to do
this we have adapted the way collections are defined in WARCIndexer to
allow it to use elements from W/ARC filenames.
Best regards,
The BnF digital legal deposit team
Exposition La bibliothèque, la nuit – Bibliothèques mythiques en réalité virtuelle - jusqu'au 13 août 2017 - BnF - François-Mitterrand Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20170808/accc9aeb/attachment.html>
More information about the Netarchivesuite-curator
mailing list