[Netarchivesuite-curator] BnF NAS update for March

peter.stirling at bnf.fr peter.stirling at bnf.fr
Tue Mar 12 15:00:07 CET 2013


The big news here for March is that we have started transferring our web 
archives into the BnF digital repository, SPAR, which will ensure the 
long-term preservation of our collections. We have started with the 
current crawls, but we will be progressively loading the retrospective 
collections simultaneously with the ongoing crawls, starting with the most 
recent collections (those harvested with NAS) and working our way back to 
the historical collections from 1996. It will take at least several months 
and possibly up to a few years to complete the transfer of all our 

The ingest into SPAR is closely linked to the functioning of NAS : in 
addition to the crawled data produced by Heritrix, SPAR will also preserve 
the metadata ARC files produced by NAS, containing the configurations, 
reports and logs that describe the crawls. This allows SPAR to create 
coherent collections of data using three levels: the ARC, the crawl job 
(containing ARCs of both data and metadata) and the harvest definition 
(containing the jobs). The data model of SPAR is thus based on that of 
NAS, but will be applied also to previous kinds of crawls (such as 
standalone Heritrix crawls performed by the BnF, broad crawls by Internet 
Archive and historical collections extracted by IA).

As well as ingesting all our existing collections, work will continue on 
SPAR to allow it to handle WARC files, as this is a necessary step before 
we can transfer our harvesting workflow to the production of WARCs.

Best regards,
The BnF digital legal deposit team

Exposition  Salah Stétié, manuscrits et livres d'artistes  - du 5 mars au 14 avril 2013 - BnF - François-Mitterrand / Galerie des donateurs Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20130312/53cad019/attachment.html>

More information about the Netarchivesuite-curator mailing list