[Netarchivesuite-curator] Netarchive NAS update for March

Sabine Schostag sas at statsbiblioteket.dk
Mon Mar 25 15:26:28 CET 2013

Dear all,

Hereby the brief March update form Netarchive:

We have implemented Release 4.0 in our production environment, but that caused us some trouble. So we had to postpone our first broad crawl for 2013. Currently we are on our first step of the brad crawl – domains up to a size of 10 MB.

We have captured at least 15.000 YouTube videos. Jons son had created a special tool to capture the url’s of the videos, but this tool does not work any longer. A new tool is on it’s way and as soon as our documentation is complete, I’ll translate it into English and put it on the NAS curator wiki.

We are doing some special “on demand harvests” for some of our researchers.

We participated in the IIPC popal election url nomination

Otherwise business as usual :)

On behalf of the Netarchive curators

DIREKTE 8946 2148
[cid:image001.png at 01CE296D.1E748590]STATSBIBLIOTEKET
CVR/SE 1010 0682 – EAN 579800079108

From: netarchivesuite-curator-bounces at ml.sbforge.org [mailto:netarchivesuite-curator-bounces at ml.sbforge.org] On Behalf Of peter.stirling at bnf.fr
Sent: Tuesday, March 12, 2013 3:00 PM
To: netarchivesuite-curator at ml.sbforge.org
Subject: [Netarchivesuite-curator] BnF NAS update for March


The big news here for March is that we have started transferring our web archives into the BnF digital repository, SPAR, which will ensure the long-term preservation of our collections. We have started with the current crawls, but we will be progressively loading the retrospective collections simultaneously with the ongoing crawls, starting with the most recent collections (those harvested with NAS) and working our way back to the historical collections from 1996. It will take at least several months and possibly up to a few years to complete the transfer of all our collections.

The ingest into SPAR is closely linked to the functioning of NAS : in addition to the crawled data produced by Heritrix, SPAR will also preserve the metadata ARC files produced by NAS, containing the configurations, reports and logs that describe the crawls. This allows SPAR to create coherent collections of data using three levels: the ARC, the crawl job (containing ARCs of both data and metadata) and the harvest definition (containing the jobs). The data model of SPAR is thus based on that of NAS, but will be applied also to previous kinds of crawls (such as standalone Heritrix crawls performed by the BnF, broad crawls by Internet Archive and historical collections extracted by IA).

As well as ingesting all our existing collections, work will continue on SPAR to allow it to handle WARC files, as this is a necessary step before we can transfer our harvesting workflow to the production of WARCs.

Best regards,
The BnF digital legal deposit team

Exposition Salah Stétié, manuscrits et livres d'artistes<http://www.bnf.fr/fr/evenements_et_culture/anx_expositions/f.salah_stetie.html> - du 5 mars au 14 avril 2013 - BnF - François-Mitterrand / Galerie des donateurs

Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20130325/db958ca1/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 588 bytes
Desc: image001.png
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20130325/db958ca1/attachment.png>

More information about the Netarchivesuite-curator mailing list