<font size=2 face="sans-serif">Hello all,</font><br><br><font size=2 face="sans-serif">In March and May, we studied, tested
tools and finally designed a process for crawling Youtube videos which
is integrated into NetarchiveSuite and Heritrix. We ran a "large"
scale test and finally managed to crawl 21 channels with the navigation
tabs, the videos pages and the videos files within the same job. The scope
was about 18 000 videos for a total duration of 97 hours. It took around
1 hour to crawl the navigation tabs and 13 days to crawl the videos. </font><br><br><font size=2 face="sans-serif">We created a new harvest template set
up to crawl only the specific pages of the channels that we want (white
list). We used youtube-dl to collect metadata about the channels and structure
it in a json file, and then extracted the video page URLs. In the first
phase, the job crawls only the channel pages, then it focuses on the video
pages and the associated video files thanks to a beanshell script which
uses youtube-dl to extract the video file URL and adds it to Heritrix
frontier. Because the video file URL is valid for only 6 hours, we split
the process into groups of 50 videos. To document the process, we keep
in the job WARC metadata file the inital metadata file, the videos page
seed list and a videos report which matches the video page URL and the
corresponding video URL file. These files are used for the crawl, access
and preservation purposes. The attached diagram shows the workflow.</font><br><br><br><br><font size=2 face="sans-serif">We are also working on how to access
these videos with OpenWayback, both to allow us to replay the videos within
the web pages, and to provide a way for users to access a list of all the
channels and videos that have been collected. This work is underway.</font><br><br><font size=2 face="sans-serif">Finally, we have also been working on
the preservation of video crawls in SPAR (the BnF digital repository).
In the first instance this concerns the results of the crawl of election
videos on Youtube that was performed by Internet Memory Research last year
- these crawls do not match the existing preservation workflow as they
did not use NAS, and include metadata produced by the Youtube API in the
form of JSON files (archived as WARC records). The data model is being
defined to document the role of these files in the crawl process. We are
also taking into account the work that has been done on video crawls as
described above, as the approach used is different again, however this
will join the existing preservation workflow for NAS crawls.</font><br><br><font size=2 face="sans-serif">Best regards,</font><br><font size=2 face="sans-serif">The BnF digital legal deposit team</font><br><br><font face="sans-serif"><hr />
<p>Exposition <strong><em><a href="http://www.bnf.fr/fr/evenements_et_culture/anx_expositions/f.icones_mai_68.html">Icônes de Mai 68 – Les images ont une histoire</a></em></strong> - jusqu'au 26 août 2018 - BnF - François-Mitterrand</p>
<p style="color:#008000"><strong>Avant d'imprimer, pensez à l'environnement.</strong></p></font>