[Netarchivesuite-users] Running Heritrix orders...

Fri Oct 3 10:45:55 CEST 2008

On Thu, 2008-10-02 at 18:26 +0200, Michael Giles wrote:
> I am currently doing periodic, very directed crawls of specific sites
> (currently only 10 sites, eventually 50 or 100) with Heritrix (each site
> has it's own order.xml).  NetArchive seems like a great way to
> coordinate and schedule those crawls.  I already have functioning
> order.xml files for each site.  Instead of archiving in WARC or ARC,
> they just write out the relevant HTML pages into a mirror directory (the
> HTML gets parsed by another system after the crawl completes).
> 
> My questions are:
>  * Can I just load these order.xml files into NetArchive (I've gotten
> used to building and testing them through Heritrix)?

Yes, you can. NetarchiveSuite will treat them as templates, though, and
fill out some values. This means that there are certain requirements to
parts of the order.xml that must look in specific ways... 
You can read more here: 
http://netarchive.dk/suite/Installation_Manual#head-9de8fbd401c20e81141ef5fbfee4228aa0eaa6ee
(Installation manual, Appendix D)

Read on for certain trouble for you use case...

>  * Will NetArchive let me do the mirroring (instead of archiving)?

Well, on, not as is.
Currently, it is expected that the writer will generate ARC files.

I did a little testing, and it is possible to use a
MirrorWriterProcessor; HOWEVER, you need to also have an
ARCWriterProcessor in the order.xml, you can then set it to "disabled".

This will make harvesting and scheduling work. Since it is not the
intended usage, some strange things will happen though.

The final cycle of the HarvestControllers is to pack up metadata in an
ARC file, and upload all ARC files and metadata to the archive. Any
left-over data will be moved to a directory called "oldjobs".

My test shows that this will still happen, but there will of course only
be one ARCfile: The packaged metadata. The rest of the data will be
moved to oldjobs, including the mirrored data.

Viewing the data with our access module won't work, but it seems this is
not you use-case anyway.

This is certainly not optimal, but if you play around with it a bit, you
can perhaps gve us feedback on your requirements, and we can work with
you to see if we can adapt the system.

>  * And finally, since I'm downloading a relatively small number of files
> (e.g. 20-50k per site) over a fixed number of sites would it make sense
> to run everything on a single machine?

Yes, I definitely think so.

In your case, you only need the harvesting component of NetarchiveSuite.

You will have to update the settings to:
 * Use a different plugin for "arcrepositoryClient"
 * Use a different plugin for "indexClient"
 * Remove some SiteSections from the GUI.

For the first, change the setting 
"settings.common.arcrepositoryClient.class" to use the class
dk.netarkivet.archive.arcrepository.distribute.LocalArcRepository

This will disable the distributed archive.

For the second, change the setting
"settings.common.indexClient.class".
The indexClient is used to generate indexes over previously harvested
material for use to not store duplicates. In your case, you probably
don't want this.
Unfortunately, it is a little poor at the moment. There are only two
implemented plugins for indexClient, and neither work really well for
disabling the archive. 

You can implement a class that implements the interface
"dk.netarkivet.common.distribute.indexserver.JobIndexCache", that always
returns an Index with an empty set, and an empty file. Please ask if you
need help with this.

For the third point, simply edit the settings file, so only
"DefinitionsSiteSection", "HistorySiteSection" and "StatusSiteSection"
are available under the setting
"settings.common.webinterface.siteSection".

You can then start applications. You need to start:

 * A JMSBroker
 * One instance of dk.netarkivet.common.webinterace.GUIApplication
 * As many instances of
dk.netarkivet.harvester.harvesting.HarvestControllerApplication as you
need concurrent harvests.

Please make sure that all applications use different settings-files,
where all TCP-ports are changed, so they do not overlap.

> Thanks!  I've been trying to figure out how to get the whole distributed
> crawl stuff to work in Heritrix (with CrawlJob JAR files) but there is
> virtually no documentation.

I hope you can get things running. As said, the system was not itnended
for this usage, and some tweaking is probably needed, but we would be
happy to work with you on that.

Please do not hesitate to ask further questions!

Best,
  Kåre Fiedler Christiansen
  NetarchiveSuite developer
-- 
Kaare Fiedler Christiansen <kfc at statsbiblioteket.dk>