[Netarchivesuite-curator] BnF NAS update for January
peter.stirling at bnf.fr
peter.stirling at bnf.fr
Fri Jan 11 14:09:49 CET 2013
Hello all,
The BnF digital legal deposit team wishes you an excellent new year! And
we are looking forward to continuing to work with you all in 2013.
It is time for our annual report: in 2012, we harvested 2.2 billion URLs
and went beyond the volume we had forecast with 90 Tb (instead of an
expected 80 Tb). For the first year, the volume for selective crawls (57
Tb) is greater than that for the broad crawl (33 Tb) due to the harvests
of elections, Dailymotion and blog platforms.
In terms of MIME type, video files came on top (28%) before text files
(26%). In fact, the harvest of Dailymotion returned this year many more
videos than in previous years. Meanwhile, this success should not hide the
fact that we collected three different qualities for each video which may
be not too useful, and the fact that we cannot yet collect other platforms
as Youtube or Viméo. This gives us new objectives for 2013.
And just for fun, one last number: as an average, 6 URL have been
harvested per second over the course of the year.
If we look at the technical part:
- the Petaboxes were becoming too old, so we transfered all the data onto
new storage racks.
- the nomination tool BCWeb is now completely integrated into the
production workflow.
- the migration of 30,000 seeds from BCWeb to NetarchiveSuite was highly
efficient.
- the NetarchiveSuite database contained 3.2 million domains at the end of
the year.
- the interactions between NetarchiveSuite and Heritrix were sometimes
complicated for the Dailymotion harvest and for the new project of
harvesting contents protected by passwords (for regional newspapers).
Best regards,
The BnF digital legal deposit team
Ouverture exceptionnelle des expositions L'âge d'or des cartes marines et La photographie en 100 chefs-d'oeuvre jusqu'à 20h les samedis et dimanches 19, 20, 26 et 27 janvier 2013 | site François-Mitterrand. Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20130111/8a0d7ac0/attachment.html>
More information about the Netarchivesuite-curator
mailing list