[Netarchivesuite-curator] BnF NAS update for January

Sabine Schostag sas at statsbiblioteket.dk
Mon Jan 14 15:43:03 CET 2013

Dear all.

Also the Netarchive team wishes you all the best for the year 2013 and we are also looking forward to an ongoing fruitful cooperation with BNF and ONB teams.

Now our non technical update for the last month:

We successfully finished our fourth broad crawl for 2012 in the beginning of January. The crawl lasted 57 days, we harvested 29.139.586.007.781 Bytes / 692.387.251 objects.

Our script for downloading videos from YouTube does not work any more, we have to find a solution.

Another big challenge are the news sites, most of them are introducing pay walls. Netarchive prefers IP-validation as access to the login content, but most of the News sites don’t want to give us IP-validation. They mostly give us an html login. Frankly spoken, we do not have the resources to maintain html logins.

Our archive has now 15 users, our wayback machine is up to date with the last broad crawl.
We have installed one stand-alone pc at SB for restricted wayback access to Netarchive data.

DIREKTE 8946 2148
[cid:image001.png at 01CDF26D.D6D17260]STATSBIBLIOTEKET
CVR/SE 1010 0682 – EAN 579800079108

From: netarchivesuite-curator-bounces at ml.sbforge.org [mailto:netarchivesuite-curator-bounces at ml.sbforge.org] On Behalf Of peter.stirling at bnf.fr
Sent: Friday, January 11, 2013 2:10 PM
To: netarchivesuite-curator at ml.sbforge.org
Cc: DBN_DLWEB at bnf.fr
Subject: [Netarchivesuite-curator] BnF NAS update for January

Hello all,

The BnF digital legal deposit team wishes you an excellent new year! And we are looking forward to continuing to work with you all in 2013.

It is time for our annual report: in 2012, we harvested 2.2 billion URLs and went beyond the volume we had forecast with 90 Tb (instead of an expected 80 Tb). For the first year, the volume for selective crawls (57 Tb) is greater than that for the broad crawl (33 Tb) due to the harvests of elections, Dailymotion and blog platforms.

In terms of MIME type, video files came on top (28%) before text files (26%). In fact, the harvest of Dailymotion returned this year many more videos than in previous years. Meanwhile, this success should not hide the fact that we collected three different qualities for each video which may be not too useful, and the fact that we cannot yet collect other platforms as Youtube or Viméo. This gives us new objectives for 2013.

And just for fun, one last number: as an average, 6 URL have been harvested per second over the course of the year.

If we look at the technical part:
- the Petaboxes were becoming too old, so we transfered all the data onto new storage racks.
- the nomination tool BCWeb is now completely integrated into the production workflow.
- the migration of 30,000 seeds from BCWeb to NetarchiveSuite was highly efficient.
- the NetarchiveSuite database contained 3.2 million domains at the end of the year.
- the interactions between NetarchiveSuite and Heritrix were sometimes complicated for the Dailymotion harvest and for the new project of harvesting contents protected by passwords (for regional newspapers).

Best regards,
The BnF digital legal deposit team

Ouverture exceptionnelle des expositions L'âge d'or des cartes marines<http://www.bnf.fr/fr/evenements_et_culture/anx_expositions/f.age_dor_cartes_marines.html> et La photographie en 100 chefs-d'oeuvre<http://www.bnf.fr/fr/evenements_et_culture/anx_expositions/f.100_chefs_doeuvre.html> jusqu'à 20h les samedis et dimanches 19, 20, 26 et 27 janvier 2013 | site François-Mitterrand.

Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20130114/cc4ed330/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 588 bytes
Desc: image001.png
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20130114/cc4ed330/attachment.png>

More information about the Netarchivesuite-curator mailing list