[Netarchivesuite-users] Running Heritrix orders...

Michael Giles mgiles at visionstudio.com
Thu Oct 2 18:26:12 CEST 2008

I am currently doing periodic, very directed crawls of specific sites 
(currently only 10 sites, eventually 50 or 100) with Heritrix (each site 
has it's own order.xml).  NetArchive seems like a great way to 
coordinate and schedule those crawls.  I already have functioning 
order.xml files for each site.  Instead of archiving in WARC or ARC, 
they just write out the relevant HTML pages into a mirror directory (the 
HTML gets parsed by another system after the crawl completes).

My questions are:
 * Can I just load these order.xml files into NetArchive (I've gotten 
used to building and testing them through Heritrix)?
 * Will NetArchive let me do the mirroring (instead of archiving)?
 * And finally, since I'm downloading a relatively small number of files 
(e.g. 20-50k per site) over a fixed number of sites would it make sense 
to run everything on a single machine?

Thanks!  I've been trying to figure out how to get the whole distributed 
crawl stuff to work in Heritrix (with CrawlJob JAR files) but there is 
virtually no documentation.


