[Netarchivesuite-curator] BnF NAS update for June
peter.stirling at bnf.fr
peter.stirling at bnf.fr
Fri Jun 7 16:27:14 CEST 2013
Hello all,
We have just finished our crawl of videos. We tried to harvest Youtube
without success so we concentrated on Dailymotion as in previous years.
A total of 1,800 members' main pages was first selected by curators
working on music, cinema or audiovisual companies, on institutional
communication, or on personnalities. These URLs were registered in BCWeb
and then extracted from the database to constitute a single list. Then,
using a script, the IT team generated a complementary list of pages linked
to the member's main page and containing videos ; and put it in
NetarchiveSuite. This process is different than other harvests where URLs
go directly from BCWeb to NAS. Another script is then used to extract the
URLs of the video files from within the pages and add them to job to be
crawled, as Heritrix is not able to identify these URLs by itself.
For the first time, we decided not to collect video pages that we have
already collected last year. To do this, the IT team put the list of URLs
in an external file called "exclude.txt", so that when Heritrix processes
its queues, it doesn't collect the URLs included in this file.
We had one job running from April 2nd to May 17th (45 days). And it took
10 extra days to index the data, during which period the indexing of the
other harvest definitions was put on hold. In total, we collected 2.5
million harvested URLs for a volume of 11 Tb, including 487,000 videos
(MP4 format).
Best regards,
The BnF digital legal deposit team
Exposition Martin Karplus, la couleur des années 50 - du 14 mai au 25 août 2013 - BnF - François-Mitterrand / Allée Julien Cain Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20130607/0d0d4aca/attachment.html>
More information about the Netarchivesuite-curator
mailing list