[Netarchivesuite-curator] BnF NAS update for June

Tue Jun 12 17:39:37 CEST 2018

Hello all,

In March and May, we studied, tested tools and finally designed a process 
for crawling Youtube videos which is integrated into NetarchiveSuite and 
Heritrix. We ran a "large" scale test and finally managed to crawl 21 
channels with the navigation tabs, the videos pages and the videos files 
within the same job. The scope was about 18 000 videos for a total 
duration of 97 hours. It took around 1 hour to crawl the navigation tabs 
and 13 days to crawl the videos. 

We created a new harvest template set up to crawl only the specific pages 
of the channels that we want (white list). We used youtube-dl to collect 
metadata about the channels and structure it in a json file, and then 
extracted the video page URLs. In the first phase, the job crawls only the 
channel pages, then it focuses on the video pages and the associated video 
files thanks to a beanshell script which uses youtube-dl to extract the 
video file URL  and adds it to Heritrix frontier. Because the video file 
URL is valid for only 6 hours, we split the process into groups of 50 
videos. To document the process, we keep in the job WARC metadata file the 
inital metadata file, the videos page seed list and a videos report which 
matches the video page URL and the corresponding video URL file. These 
files are used for the crawl, access and preservation purposes. The 
attached diagram shows the workflow.

We are also working on how to access these videos with OpenWayback, both 
to allow us to replay the videos within the web pages, and to provide a 
way for users to access a list of all the channels and videos that have 
been collected. This work is underway.

Finally, we have also been working on the preservation of video crawls in 
SPAR (the BnF digital repository). In the first instance this concerns the 
results of the crawl of election videos on Youtube that was performed by 
Internet Memory Research last year - these crawls do not match the 
existing preservation workflow as they did not use NAS, and include 
metadata produced by the Youtube API in the form of JSON files (archived 
as WARC records). The data model is being defined to document the role of 
these files in the crawl process. We are also taking into account the work 
that has been done on video crawls as described above, as the approach 
used is different again, however this will join the existing preservation 
workflow for NAS crawls.

Best regards,
The BnF digital legal deposit team

Exposition  Icônes de Mai 68 – Les images ont une histoire  - jusqu'au 26 août 2018 - BnF - François-Mitterrand Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20180612/8c14be51/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: BnF-ADM-2018-056979-01 (p2).pptx
Type: application/octet-stream
Size: 528591 bytes
Desc: not available
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20180612/8c14be51/attachment-0001.obj>