[Netarchivesuite-users] Running Heritrix orders...

Mon Oct 6 14:55:10 CEST 2008

Hi Michael.
A single harvest is defined as one Heritrix Job w/ a single order.xml and a corresponding seedlist.
One instance (run) of a selective harvest can result in one or more Heritrix jobs depending on domains you have added to the selective harvest.
Each domain in the selective harvest uses a specific order.xml. And the selective harvest will generate multiple heritrix harvest jobs, where a single Heritrix harvest job will handle the domains using the same order.xml

Does that answer your question?
Otherwise feel free to ask again.

Regards
Søren Vejrup Carlsen (The Royal Library, Copenhagen, DK)
NetarchiveSuite developer

-----Original Message-----
From: netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk [mailto:netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk]On Behalf Of Michael Giles
Sent: Friday, October 03, 2008 8:49 PM
To: netarchivesuite-users at lists.gforge.statsbiblioteket.dk
Subject: Re: [Netarchivesuite-users] Running Heritrix orders...

Wow!  Thank you so much for such a thorough response.  I will try these steps out and let you know the results.  The only immediate question I have from the email is with regards to this statement:

As many instances of

dk.netarkivet.harvester.harvesting.HarvestControllerApplication as you

need concurrent harvests.
What exactly is entailed by a single "harvest"?  As stated, each website in my crawl has a unique set of rules (in its order.xml).  Do I need a separate instance of the HarvestController for each site (hopefully not)?  Or can I configure a single "harvest" that lists a bunch of sites (each with their own order.xml) and have them run in parallel?  I'm assuming it is the later, using the Selective Harvest configuration ( http://netarchive.dk/suite/User_Manual#head-9ba4e7b8eed0422b8f8931992ce47ad73e9c84cc), but just wanted to make sure.

It will all probably be more obvious once I get down into the work of running it.  Thanks again, and have a great weekend!

-Mike

Kaare Fiedler Christiansen wrote: 

On Thu, 2008-10-02 at 18:26 +0200, Michael Giles wrote:

I am currently doing periodic, very directed crawls of specific sites

(currently only 10 sites, eventually 50 or 100) with Heritrix (each site

has it's own order.xml).  NetArchive seems like a great way to

coordinate and schedule those crawls.  I already have functioning

order.xml files for each site.  Instead of archiving in WARC or ARC,

they just write out the relevant HTML pages into a mirror directory (the

HTML gets parsed by another system after the crawl completes).

My questions are:

 * Can I just load these order.xml files into NetArchive (I've gotten

used to building and testing them through Heritrix)?

Yes, you can. NetarchiveSuite will treat them as templates, though, and

fill out some values. This means that there are certain requirements to

parts of the order.xml that must look in specific ways... 

You can read more here: 

http://netarchive.dk/suite/Installation_Manual#head-9de8fbd401c20e81141ef5fbfee4228aa0eaa6ee

(Installation manual, Appendix D)

Read on for certain trouble for you use case...

 * Will NetArchive let me do the mirroring (instead of archiving)?

Well, on, not as is.

Currently, it is expected that the writer will generate ARC files.

I did a little testing, and it is possible to use a

MirrorWriterProcessor; HOWEVER, you need to also have an

ARCWriterProcessor in the order.xml, you can then set it to "disabled".

This will make harvesting and scheduling work. Since it is not the

intended usage, some strange things will happen though.

The final cycle of the HarvestControllers is to pack up metadata in an

ARC file, and upload all ARC files and metadata to the archive. Any

left-over data will be moved to a directory called "oldjobs".

My test shows that this will still happen, but there will of course only

be one ARCfile: The packaged metadata. The rest of the data will be

moved to oldjobs, including the mirrored data.

Viewing the data with our access module won't work, but it seems this is

not you use-case anyway.

This is certainly not optimal, but if you play around with it a bit, you

can perhaps gve us feedback on your requirements, and we can work with

you to see if we can adapt the system.

 * And finally, since I'm downloading a relatively small number of files

(e.g. 20-50k per site) over a fixed number of sites would it make sense

to run everything on a single machine?

Yes, I definitely think so.

In your case, you only need the harvesting component of NetarchiveSuite.

You will have to update the settings to:

 * Use a different plugin for "arcrepositoryClient"

 * Use a different plugin for "indexClient"

 * Remove some SiteSections from the GUI.

For the first, change the setting 

"settings.common.arcrepositoryClient.class" to use the class

dk.netarkivet.archive.arcrepository.distribute.LocalArcRepository

This will disable the distributed archive.

For the second, change the setting

"settings.common.indexClient.class".

The indexClient is used to generate indexes over previously harvested

material for use to not store duplicates. In your case, you probably

don't want this.

Unfortunately, it is a little poor at the moment. There are only two

implemented plugins for indexClient, and neither work really well for

disabling the archive. 

You can implement a class that implements the interface

"dk.netarkivet.common.distribute.indexserver.JobIndexCache", that always

returns an Index with an empty set, and an empty file. Please ask if you

need help with this.

For the third point, simply edit the settings file, so only

"DefinitionsSiteSection", "HistorySiteSection" and "StatusSiteSection"

are available under the setting

"settings.common.webinterface.siteSection".

You can then start applications. You need to start:

 * A JMSBroker

 * One instance of dk.netarkivet.common.webinterace.GUIApplication

 * As many instances of

dk.netarkivet.harvester.harvesting.HarvestControllerApplication as you

need concurrent harvests.

Please make sure that all applications use different settings-files,

where all TCP-ports are changed, so they do not overlap.

Thanks!  I've been trying to figure out how to get the whole distributed

crawl stuff to work in Heritrix (with CrawlJob JAR files) but there is

virtually no documentation.

I hope you can get things running. As said, the system was not itnended

for this usage, and some tweaking is probably needed, but we would be

happy to work with you on that.

Please do not hesitate to ask further questions!

Best,

  Kåre Fiedler Christiansen

  NetarchiveSuite developer

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20081006/225dd5c4/attachment-0002.html>