<font size=2 face="sans-serif">Hello all,</font><br><br><font size=2 face="sans-serif">In March and May, we studied, tested

tools and finally designed a process for crawling Youtube videos which

is integrated into NetarchiveSuite and Heritrix. We ran a "large"

scale test and finally managed to crawl 21 channels with the navigation

tabs, the videos pages and the videos files within the same job. The scope

was about 18 000 videos for a total duration of 97 hours. It took around

1 hour to crawl the navigation tabs and 13 days to crawl the videos. </font><br><br><font size=2 face="sans-serif">We created a new harvest template set

up to crawl only the specific pages of the channels that we want (white

list). We used youtube-dl to collect metadata about the channels and structure

it in a json file, and then extracted the video page URLs. In the first

phase, the job crawls only the channel pages, then it focuses on the video

pages and the associated video files thanks to a beanshell script which

uses youtube-dl to extract the video file URL  and adds it to Heritrix

frontier. Because the video file URL is valid for only 6 hours, we split

the process into groups of 50 videos. To document the process, we keep

in the job WARC metadata file the inital metadata file, the videos page

seed list and a videos report which matches the video page URL and the

corresponding video URL file. These files are used for the crawl, access

and preservation purposes. The attached diagram shows the workflow.</font><br><br><br><br><font size=2 face="sans-serif">We are also working on how to access

these videos with OpenWayback, both to allow us to replay the videos within

the web pages, and to provide a way for users to access a list of all the

channels and videos that have been collected. This work is underway.</font><br><br><font size=2 face="sans-serif">Finally, we have also been working on

the preservation of video crawls in SPAR (the BnF digital repository).

In the first instance this concerns the results of the crawl of election

videos on Youtube that was performed by Internet Memory Research last year

- these crawls do not match the existing preservation workflow as they

did not use NAS, and include metadata produced by the Youtube API in the

form of JSON files (archived as WARC records). The data model is being

defined to document the role of these files in the crawl process. We are

also taking into account the work that has been done on video crawls as

described above, as the approach used is different again, however this

will join the existing preservation workflow for NAS crawls.</font><br><br><font size=2 face="sans-serif">Best regards,</font><br><font size=2 face="sans-serif">The BnF digital legal deposit team</font><br><br><font face="sans-serif"><hr />

<p>Exposition <strong><em><a href="http://www.bnf.fr/fr/evenements_et_culture/anx_expositions/f.icones_mai_68.html">Icônes de Mai 68 – Les images ont une histoire</a></em></strong> - jusqu'au 26 août 2018   - BnF - François-Mitterrand</p>

<p style="color:#008000"><strong>Avant d'imprimer, pensez à l'environnement.</strong></p></font>