[Netarchivesuite-users] Running Heritrix orders...

Michael Giles mgiles at visionstudio.com
Fri Oct 3 20:49:11 CEST 2008

Wow!  Thank you so much for such a thorough response.  I will try these 
steps out and let you know the results.  The only immediate question I 
have from the email is with regards to this statement:

As many instances of
dk.netarkivet.harvester.harvesting.HarvestControllerApplication as you
need concurrent harvests.

What exactly is entailed by a single "harvest"?  As stated, each website 
in my crawl has a unique set of rules (in its order.xml).  Do I need a 
separate instance of the HarvestController for each site (hopefully 
not)?  Or can I configure a single "harvest" that lists a bunch of sites 
(each with their own order.xml) and have them run in parallel?  I'm 
assuming it is the later, using the Selective Harvest configuration 
but just wanted to make sure.

It will all probably be more obvious once I get down into the work of 
running it.  Thanks again, and have a great weekend!


Kaare Fiedler Christiansen wrote:
> On Thu, 2008-10-02 at 18:26 +0200, Michael Giles wrote:
>> I am currently doing periodic, very directed crawls of specific sites
>> (currently only 10 sites, eventually 50 or 100) with Heritrix (each site
>> has it's own order.xml).  NetArchive seems like a great way to
>> coordinate and schedule those crawls.  I already have functioning
>> order.xml files for each site.  Instead of archiving in WARC or ARC,
>> they just write out the relevant HTML pages into a mirror directory (the
>> HTML gets parsed by another system after the crawl completes).
>> My questions are:
>>  * Can I just load these order.xml files into NetArchive (I've gotten
>> used to building and testing them through Heritrix)?
> Yes, you can. NetarchiveSuite will treat them as templates, though, and
> fill out some values. This means that there are certain requirements to
> parts of the order.xml that must look in specific ways... 
> You can read more here: 
> http://netarchive.dk/suite/Installation_Manual#head-9de8fbd401c20e81141ef5fbfee4228aa0eaa6ee
> (Installation manual, Appendix D)
> Read on for certain trouble for you use case...
>>  * Will NetArchive let me do the mirroring (instead of archiving)?
> Well, on, not as is.
> Currently, it is expected that the writer will generate ARC files.
> I did a little testing, and it is possible to use a
> MirrorWriterProcessor; HOWEVER, you need to also have an
> ARCWriterProcessor in the order.xml, you can then set it to "disabled".
> This will make harvesting and scheduling work. Since it is not the
> intended usage, some strange things will happen though.
> The final cycle of the HarvestControllers is to pack up metadata in an
> ARC file, and upload all ARC files and metadata to the archive. Any
> left-over data will be moved to a directory called "oldjobs".
> My test shows that this will still happen, but there will of course only
> be one ARCfile: The packaged metadata. The rest of the data will be
> moved to oldjobs, including the mirrored data.
> Viewing the data with our access module won't work, but it seems this is
> not you use-case anyway.
> This is certainly not optimal, but if you play around with it a bit, you
> can perhaps gve us feedback on your requirements, and we can work with
> you to see if we can adapt the system.
>>  * And finally, since I'm downloading a relatively small number of files
>> (e.g. 20-50k per site) over a fixed number of sites would it make sense
>> to run everything on a single machine?
> Yes, I definitely think so.
> In your case, you only need the harvesting component of NetarchiveSuite.
> You will have to update the settings to:
>  * Use a different plugin for "arcrepositoryClient"
>  * Use a different plugin for "indexClient"
>  * Remove some SiteSections from the GUI.
> For the first, change the setting 
> "settings.common.arcrepositoryClient.class" to use the class
> dk.netarkivet.archive.arcrepository.distribute.LocalArcRepository
> This will disable the distributed archive.
> For the second, change the setting
> "settings.common.indexClient.class".
> The indexClient is used to generate indexes over previously harvested
> material for use to not store duplicates. In your case, you probably
> don't want this.
> Unfortunately, it is a little poor at the moment. There are only two
> implemented plugins for indexClient, and neither work really well for
> disabling the archive. 
> You can implement a class that implements the interface
> "dk.netarkivet.common.distribute.indexserver.JobIndexCache", that always
> returns an Index with an empty set, and an empty file. Please ask if you
> need help with this.
> For the third point, simply edit the settings file, so only
> "DefinitionsSiteSection", "HistorySiteSection" and "StatusSiteSection"
> are available under the setting
> "settings.common.webinterface.siteSection".
> You can then start applications. You need to start:
>  * A JMSBroker
>  * One instance of dk.netarkivet.common.webinterace.GUIApplication
>  * As many instances of
> dk.netarkivet.harvester.harvesting.HarvestControllerApplication as you
> need concurrent harvests.
> Please make sure that all applications use different settings-files,
> where all TCP-ports are changed, so they do not overlap.
>> Thanks!  I've been trying to figure out how to get the whole distributed
>> crawl stuff to work in Heritrix (with CrawlJob JAR files) but there is
>> virtually no documentation.
> I hope you can get things running. As said, the system was not itnended
> for this usage, and some tweaking is probably needed, but we would be
> happy to work with you on that.
> Please do not hesitate to ask further questions!
> Best,
>   Kåre Fiedler Christiansen
>   NetarchiveSuite developer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20081003/024ce0ce/attachment-0002.html>

More information about the NetarchiveSuite-users mailing list