[Netarchivesuite-users] Running Heritrix orders...
mgiles at visionstudio.com
Thu Oct 2 18:26:12 CEST 2008
I am currently doing periodic, very directed crawls of specific sites
(currently only 10 sites, eventually 50 or 100) with Heritrix (each site
has it's own order.xml). NetArchive seems like a great way to
coordinate and schedule those crawls. I already have functioning
order.xml files for each site. Instead of archiving in WARC or ARC,
they just write out the relevant HTML pages into a mirror directory (the
HTML gets parsed by another system after the crawl completes).
My questions are:
* Can I just load these order.xml files into NetArchive (I've gotten
used to building and testing them through Heritrix)?
* Will NetArchive let me do the mirroring (instead of archiving)?
* And finally, since I'm downloading a relatively small number of files
(e.g. 20-50k per site) over a fixed number of sites would it make sense
to run everything on a single machine?
Thanks! I've been trying to figure out how to get the whole distributed
crawl stuff to work in Heritrix (with CrawlJob JAR files) but there is
virtually no documentation.
More information about the NetarchiveSuite-users