[Netarchivesuite-users] Running Heritrix orders...

Fri Oct 3 20:49:11 CEST 2008

Wow!  Thank you so much for such a thorough response.  I will try these 
steps out and let you know the results.  The only immediate question I 
have from the email is with regards to this statement:

As many instances of
dk.netarkivet.harvester.harvesting.HarvestControllerApplication as you
need concurrent harvests.

What exactly is entailed by a single "harvest"?  As stated, each website 
in my crawl has a unique set of rules (in its order.xml).  Do I need a 
separate instance of the HarvestController for each site (hopefully 
not)?  Or can I configure a single "harvest" that lists a bunch of sites 
(each with their own order.xml) and have them run in parallel?  I'm 
assuming it is the later, using the Selective Harvest configuration 
(http://netarchive.dk/suite/User_Manual#head-9ba4e7b8eed0422b8f8931992ce47ad73e9c84cc), 
but just wanted to make sure.

It will all probably be more obvious once I get down into the work of 
running it.  Thanks again, and have a great weekend!

-Mike

Kaare Fiedler Christiansen wrote:
> On Thu, 2008-10-02 at 18:26 +0200, Michael Giles wrote:
>   
>> I am currently doing periodic, very directed crawls of specific sites
>> (currently only 10 sites, eventually 50 or 100) with Heritrix (each site
>> has it's own order.xml).  NetArchive seems like a great way to
>> coordinate and schedule those crawls.  I already have functioning
>> order.xml files for each site.  Instead of archiving in WARC or ARC,
>> they just write out the relevant HTML pages into a mirror directory (the
>> HTML gets parsed by another system after the crawl completes).
>>
>> My questions are:
>>  * Can I just load these order.xml files into NetArchive (I've gotten
>> used to building and testing them through Heritrix)?
>>     
>
> Yes, you can. NetarchiveSuite will treat them as templates, though, and
> fill out some values. This means that there are certain requirements to
> parts of the order.xml that must look in specific ways... 
> You can read more here: 
> http://netarchive.dk/suite/Installation_Manual#head-9de8fbd401c20e81141ef5fbfee4228aa0eaa6ee
> (Installation manual, Appendix D)
>
> Read on for certain trouble for you use case...
>
>   
>>  * Will NetArchive let me do the mirroring (instead of archiving)?
>>     
>
> Well, on, not as is.
> Currently, it is expected that the writer will generate ARC files.
>
> I did a little testing, and it is possible to use a
> MirrorWriterProcessor; HOWEVER, you need to also have an
> ARCWriterProcessor in the order.xml, you can then set it to "disabled".
>
> This will make harvesting and scheduling work. Since it is not the
> intended usage, some strange things will happen though.
>
> The final cycle of the HarvestControllers is to pack up metadata in an
> ARC file, and upload all ARC files and metadata to the archive. Any
> left-over data will be moved to a directory called "oldjobs".
>
> My test shows that this will still happen, but there will of course only
> be one ARCfile: The packaged metadata. The rest of the data will be
> moved to oldjobs, including the mirrored data.
>
> Viewing the data with our access module won't work, but it seems this is
> not you use-case anyway.
>
> This is certainly not optimal, but if you play around with it a bit, you
> can perhaps gve us feedback on your requirements, and we can work with
> you to see if we can adapt the system.
>
>   
>>  * And finally, since I'm downloading a relatively small number of files
>> (e.g. 20-50k per site) over a fixed number of sites would it make sense
>> to run everything on a single machine?
>>     
>
> Yes, I definitely think so.
>
> In your case, you only need the harvesting component of NetarchiveSuite.
>
> You will have to update the settings to:
>  * Use a different plugin for "arcrepositoryClient"
>  * Use a different plugin for "indexClient"
>  * Remove some SiteSections from the GUI.
>
> For the first, change the setting 
> "settings.common.arcrepositoryClient.class" to use the class
> dk.netarkivet.archive.arcrepository.distribute.LocalArcRepository
>
> This will disable the distributed archive.
>
>
> For the second, change the setting
> "settings.common.indexClient.class".
> The indexClient is used to generate indexes over previously harvested
> material for use to not store duplicates. In your case, you probably
> don't want this.
> Unfortunately, it is a little poor at the moment. There are only two
> implemented plugins for indexClient, and neither work really well for
> disabling the archive. 
>
> You can implement a class that implements the interface
> "dk.netarkivet.common.distribute.indexserver.JobIndexCache", that always
> returns an Index with an empty set, and an empty file. Please ask if you
> need help with this.
>
>
> For the third point, simply edit the settings file, so only
> "DefinitionsSiteSection", "HistorySiteSection" and "StatusSiteSection"
> are available under the setting
> "settings.common.webinterface.siteSection".
>
>
> You can then start applications. You need to start:
>
>  * A JMSBroker
>  * One instance of dk.netarkivet.common.webinterace.GUIApplication
>  * As many instances of
> dk.netarkivet.harvester.harvesting.HarvestControllerApplication as you
> need concurrent harvests.
>
> Please make sure that all applications use different settings-files,
> where all TCP-ports are changed, so they do not overlap.
>
>   
>> Thanks!  I've been trying to figure out how to get the whole distributed
>> crawl stuff to work in Heritrix (with CrawlJob JAR files) but there is
>> virtually no documentation.
>>     
>
> I hope you can get things running. As said, the system was not itnended
> for this usage, and some tweaking is probably needed, but we would be
> happy to work with you on that.
>
> Please do not hesitate to ask further questions!
>
> Best,
>   Kåre Fiedler Christiansen
>   NetarchiveSuite developer
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20081003/024ce0ce/attachment-0002.html>