[Netarchivesuite-curator] ONB update for March

Mayr Michaela michaela.mayr at onb.ac.at
Tue Mar 26 13:59:44 CET 2013

Dear all,

·         We started our domain crawl with 1.4 mio. seeds mid february. Stage 1 with a budget of 10MB per domain has been completed. Before we start stage 2 we will get new hardware (crawler machines).

·         In 2013 we have a strong focus on politics. We created a new collection including websites from government, administration, political parties, blogs etc. In September parliamentary elections will take place, additionally we have a couple of regional elections.

·         We have a first case where we have to delete data from our webarchive. Last year we imported legacy data from 1997/98 into the webarchive. Legal deposit for online media in in place since 2009. A user requested deletion of his data (from 1997). We are now developing a workflow for the deletion process.

Best Regards


Von: netarchivesuite-curator-bounces at ml.sbforge.org [mailto:netarchivesuite-curator-bounces at ml.sbforge.org] Im Auftrag von Sabine Schostag
Gesendet: Montag, 25. März 2013 15:26
An: 'peter.stirling at bnf.fr'; 'netarchivesuite-curator at ml.sbforge.org'
Betreff: Re: [Netarchivesuite-curator] Netarchive NAS update for March

Dear all,

Hereby the brief March update form Netarchive:

We have implemented Release 4.0 in our production environment, but that caused us some trouble. So we had to postpone our first broad crawl for 2013. Currently we are on our first step of the brad crawl - domains up to a size of 10 MB.

We have captured at least 15.000 YouTube videos. Jons son had created a special tool to capture the url's of the videos, but this tool does not work any longer. A new tool is on it's way and as soon as our documentation is complete, I'll translate it into English and put it on the NAS curator wiki.

We are doing some special «on demand harvests» for some of our researchers.

We participated in the IIPC popal election url nomination 

Otherwise business as usual J

On behalf of the Netarchive curators




DIREKTE 8946 2148 
CVR/SE 1010 0682 - EAN 579800079108

From: netarchivesuite-curator-bounces at ml.sbforge.org <mailto:netarchivesuite-curator-bounces at ml.sbforge.org> [mailto:netarchivesuite-curator-bounces at ml.sbforge.org <mailto:netarchivesuite-curator-bounces at ml.sbforge.org> ] On Behalf Of peter.stirling at bnf.fr <mailto:peter.stirling at bnf.fr> 
Sent: Tuesday, March 12, 2013 3:00 PM
To: netarchivesuite-curator at ml.sbforge.org <mailto:netarchivesuite-curator at ml.sbforge.org> 
Subject: [Netarchivesuite-curator] BnF NAS update for March



The big news here for March is that we have started transferring our web archives into the BnF digital repository, SPAR, which will ensure the long-term preservation of our collections. We have started with the current crawls, but we will be progressively loading the retrospective collections simultaneously with the ongoing crawls, starting with the most recent collections (those harvested with NAS) and working our way back to the historical collections from 1996. It will take at least several months and possibly up to a few years to complete the transfer of all our collections.

The ingest into SPAR is closely linked to the functioning of NAS : in addition to the crawled data produced by Heritrix, SPAR will also preserve the metadata ARC files produced by NAS, containing the configurations, reports and logs that describe the crawls. This allows SPAR to create coherent collections of data using three levels: the ARC, the crawl job (containing ARCs of both data and metadata) and the harvest definition (containing the jobs). The data model of SPAR is thus based on that of NAS, but will be applied also to previous kinds of crawls (such as standalone Heritrix crawls performed by the BnF, broad crawls by Internet Archive and historical collections extracted by IA).

As well as ingesting all our existing collections, work will continue on SPAR to allow it to handle WARC files, as this is a necessary step before we can transfer our harvesting workflow to the production of WARCs.

Best regards,
The BnF digital legal deposit team

Exposition Salah Stétié, manuscrits et livres d'artistes <http://www.bnf.fr/fr/evenements_et_culture/anx_expositions/f.salah_stetie.html> - du 5 mars au 14 avril 2013 - BnF - François-Mitterrand / Galerie des donateurs

Avant d'imprimer, pensez à l'environnement.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20130326/e5b6e955/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 588 bytes
Desc: not available
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20130326/e5b6e955/attachment.png>

More information about the Netarchivesuite-curator mailing list