[Netarchivesuite-curator] BnF NAS update for June
peter.stirling at bnf.fr
peter.stirling at bnf.fr
Tue Jun 12 17:39:37 CEST 2018
Hello all,
In March and May, we studied, tested tools and finally designed a process
for crawling Youtube videos which is integrated into NetarchiveSuite and
Heritrix. We ran a "large" scale test and finally managed to crawl 21
channels with the navigation tabs, the videos pages and the videos files
within the same job. The scope was about 18 000 videos for a total
duration of 97 hours. It took around 1 hour to crawl the navigation tabs
and 13 days to crawl the videos.
We created a new harvest template set up to crawl only the specific pages
of the channels that we want (white list). We used youtube-dl to collect
metadata about the channels and structure it in a json file, and then
extracted the video page URLs. In the first phase, the job crawls only the
channel pages, then it focuses on the video pages and the associated video
files thanks to a beanshell script which uses youtube-dl to extract the
video file URL and adds it to Heritrix frontier. Because the video file
URL is valid for only 6 hours, we split the process into groups of 50
videos. To document the process, we keep in the job WARC metadata file the
inital metadata file, the videos page seed list and a videos report which
matches the video page URL and the corresponding video URL file. These
files are used for the crawl, access and preservation purposes. The
attached diagram shows the workflow.
We are also working on how to access these videos with OpenWayback, both
to allow us to replay the videos within the web pages, and to provide a
way for users to access a list of all the channels and videos that have
been collected. This work is underway.
Finally, we have also been working on the preservation of video crawls in
SPAR (the BnF digital repository). In the first instance this concerns the
results of the crawl of election videos on Youtube that was performed by
Internet Memory Research last year - these crawls do not match the
existing preservation workflow as they did not use NAS, and include
metadata produced by the Youtube API in the form of JSON files (archived
as WARC records). The data model is being defined to document the role of
these files in the crawl process. We are also taking into account the work
that has been done on video crawls as described above, as the approach
used is different again, however this will join the existing preservation
workflow for NAS crawls.
Best regards,
The BnF digital legal deposit team
Exposition Icônes de Mai 68 – Les images ont une histoire - jusqu'au 26 août 2018 - BnF - François-Mitterrand Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20180612/8c14be51/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: BnF-ADM-2018-056979-01 (p2).pptx
Type: application/octet-stream
Size: 528591 bytes
Desc: not available
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20180612/8c14be51/attachment-0001.obj>
More information about the Netarchivesuite-curator
mailing list