[Netarchivesuite-users] video content

Kåre Fiedler Christiansen kfc at statsbiblioteket.dk
Fri Apr 24 15:00:47 CEST 2009


On Thu, 2009-04-09 at 12:24 +0200, Dariusz Paradowski wrote:
> Hi
> Does NetArchiveSuite archives video content such as video news or youtube 
> films?
> Thank you

Hi. Sorry about the late reply.

The answer is a little complex. The short version is that videos are not
harvested "out of the box" for most sites, but you can configure the
system so that they are.


The actual handling of the harvesting in NetarchiveSuite is done by
Heritrix, see http://crawler.archive.org. Heritrix, with its default
configuration, will usually not harvest videos, It is possible to
configure it to do so, though.


You can control Heritrix in a quite fine grained manner. In this case,
the difficulty in harvesting the video files, is that usually the actual
URL of the video is not written anywhere in the HTML page it is embedded
in. Rather it is generated by javascript or flash, before it is streamed
to the end user. Thus, when Heritrix is harvesting the page, it will not
be able to know which URL to get the video file from.

However, it is usually possible to calculate the URL of the video file
from the URLs Heritrix actually has access to. You can then configure
Heritrix to harvest the video files, by automatically adding these
calculated URLs.

On the page:
http://webteam.archive.org/confluence/display/Heritrix/BeanShell+Script
+For+Downloading+Video
there is a contributed piece of code that automatically calculates these
URLs where YouTube is involved.
(Note tha tthe code on the page is for Heritrix 2 - NetarchiveSuite
still uses the Heritrix 1 branch, which has slightly different names of
some classes and methods. I have a attached a file with an updated
(downdated?) version.)

In this specific case, you would need to add the following to the
order.xml-file, under 'extract-processors':

<newObject name="BeanShellProcessor" 
           class="org.archive.crawler.processor.BeanShellProcessor">
  <boolean name="enabled">true</boolean>
  <newObject name="BeanShellProcessor#decide-rules" 
             class="org.archive.crawler.deciderules.DecideRuleSequence">
    <map name="rules"/>
  </newObject>
  <string name="script-file">/tmp/youtube.script</string>
  <boolean name="isolate-threads">true</boolean>
</newObject>

(You would need to replace '/tmp/youtube.script' with wherever you place
the script)



To set up NetarchiveSuite to use Heritrix with this configuration you
need to add an order.xml template with this behaviour. You can read more
about order.xml templates in NetarciveSuite here:
http://netarchive.dk/suite/Installation_Manual#Include_Installation_Manual/AppendixD
and here
http://netarchive.dk/suite/User_Manual#top_Include_User_Manual/Harvester_Templates


I hope this helps.

Best,
  Kåre
-- 
Kaare Fiedler Christiansen - NetarchiveSuite developer
THE STATE AND UNIVERSITY LIBRARY, 
Universitetsparken 1, 8000 Aarhus C, Denmark.
Phone: +45 89462036
-------------- next part --------------
A non-text attachment was scrubbed...
Name: youtube.script
Type: text/x-csrc
Size: 5773 bytes
Desc: not available
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20090424/86a61f33/attachment-0002.bin>


More information about the NetarchiveSuite-users mailing list