[Netarchivesuite-users] video content
Kåre Fiedler Christiansen
kfc at statsbiblioteket.dk
Fri Apr 24 15:00:47 CEST 2009
On Thu, 2009-04-09 at 12:24 +0200, Dariusz Paradowski wrote:
> Does NetArchiveSuite archives video content such as video news or youtube
> Thank you
Hi. Sorry about the late reply.
The answer is a little complex. The short version is that videos are not
harvested "out of the box" for most sites, but you can configure the
system so that they are.
The actual handling of the harvesting in NetarchiveSuite is done by
Heritrix, see http://crawler.archive.org. Heritrix, with its default
configuration, will usually not harvest videos, It is possible to
configure it to do so, though.
You can control Heritrix in a quite fine grained manner. In this case,
the difficulty in harvesting the video files, is that usually the actual
URL of the video is not written anywhere in the HTML page it is embedded
to the end user. Thus, when Heritrix is harvesting the page, it will not
be able to know which URL to get the video file from.
However, it is usually possible to calculate the URL of the video file
from the URLs Heritrix actually has access to. You can then configure
Heritrix to harvest the video files, by automatically adding these
On the page:
there is a contributed piece of code that automatically calculates these
URLs where YouTube is involved.
(Note tha tthe code on the page is for Heritrix 2 - NetarchiveSuite
still uses the Heritrix 1 branch, which has slightly different names of
some classes and methods. I have a attached a file with an updated
In this specific case, you would need to add the following to the
order.xml-file, under 'extract-processors':
(You would need to replace '/tmp/youtube.script' with wherever you place
To set up NetarchiveSuite to use Heritrix with this configuration you
need to add an order.xml template with this behaviour. You can read more
about order.xml templates in NetarciveSuite here:
I hope this helps.
Kaare Fiedler Christiansen - NetarchiveSuite developer
THE STATE AND UNIVERSITY LIBRARY,
Universitetsparken 1, 8000 Aarhus C, Denmark.
Phone: +45 89462036
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 5773 bytes
Desc: not available
More information about the NetarchiveSuite-users