[Netarchivesuite-curator] BnF NAS update for January

peter.stirling at bnf.fr peter.stirling at bnf.fr
Fri Jan 11 14:09:49 CET 2013


Hello all,

The BnF digital legal deposit team wishes you an excellent new year! And 
we are looking forward to continuing to work with you all in 2013.

It is time for our annual report: in 2012, we harvested 2.2 billion URLs 
and went beyond the volume we had forecast with 90 Tb (instead of an 
expected 80 Tb). For the first year, the volume for selective crawls (57 
Tb) is greater than that for the broad crawl (33 Tb) due to the harvests 
of elections, Dailymotion and blog platforms.

In terms of MIME type, video files came on top (28%) before text files 
(26%). In fact, the harvest of Dailymotion returned this year many more 
videos than in previous years. Meanwhile, this success should not hide the 
fact that we collected three different qualities for each video which may 
be not too useful, and the fact that we cannot yet collect other platforms 
as Youtube or Viméo. This gives us new objectives for 2013.

And just for fun, one last number: as an average, 6 URL have been 
harvested per second over the course of the year.

If we look at the technical part:
- the Petaboxes were becoming too old, so we transfered all the data onto 
new storage racks.
- the nomination tool BCWeb is now completely integrated into the 
production workflow. 
- the migration of 30,000 seeds from BCWeb to NetarchiveSuite was highly 
efficient.
- the NetarchiveSuite database contained 3.2 million domains at the end of 
the year.
- the interactions between NetarchiveSuite and Heritrix were sometimes 
complicated for the Dailymotion harvest and for the new project of 
harvesting contents protected by passwords (for regional newspapers).

Best regards,
The BnF digital legal deposit team
Ouverture exceptionnelle des expositions  L'âge d'or des cartes marines  et  La photographie en 100 chefs-d'oeuvre  jusqu'à 20h les samedis et dimanches 19, 20, 26 et 27 janvier 2013 | site François-Mitterrand. Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20130111/8a0d7ac0/attachment.html>


More information about the Netarchivesuite-curator mailing list