[Netarchivesuite-curator] BnF NAS update for October

peter.stirling at bnf.fr peter.stirling at bnf.fr
Tue Oct 9 11:04:15 CEST 2018

Hello all,

After the two workshops on crawling YouTube (covered in our June update), 
we were able in July to launch a production crawl using the process 
previously outlined. This first crawl lasted 20 days. The curators 
selected 42 channels and we crawled all the videos from these channels: 28 
063 videos, with the exception of 10 videos that had been removed and one 
video excluded because of our filters. The crawl represents 1.8 TB and 
more than 3 000 hours of video. A second crawl is planned in November.

We have also finished work on giving access to these videos, as well as 
those crawled during the elections last year. To replay the videos within 
YouTube pages, we built on the system already used for Dailymotion. A 
specific rule is applied to pages for which videos have been collected, 
allowing us to replace the YouTube player with another called FLV Player, 
which is present in our archives. We use the metadata collected during the 
crawl to establish the link between the web page and the correct video 
file. As the page listing all the videos on a channel is not fully 
collected by Heritrix, we created pages within our access application with 
the full list of videos collected for each channel, and inserted a button 
within the YouTube page to link to this list. Finally, we created a 
"guided tour", similar to that which already exists for news sites, with a 
list of all the YouTube channels collected. This is also based on the 
metadata, with additional description added by curators.

In other news, we have just started our broad crawl for 2018. It will be 
the biggest broad crawl we have yet performed, with a budget of 110 TB and 
4.7 million domains in the seed list. The budget per domain is 2 500 URLs 
(compared to 1 500 URLs last year). During this crawl, the total size of 
the BnF web archives is expected to exceed 1 Petabyte.

Best regards,
The BnF digital legal deposit team

Exposition  Épreuves d'imprimeur. Estampes de l'Atelier Franck Bordas  – du 2 octobre au 25 novembre 2018 | François-Mitterrand Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20181009/f21686ab/attachment.html>

More information about the Netarchivesuite-curator mailing list