[Netarchivesuite-users] Snapshot harvest failure / job generationconfiguration

Mon Apr 27 15:09:53 CEST 2009

Hi Nicolas.

The output from Heritrix is subsequently put into the metadata arc-file for the job.

This is where you can find out what information Heritrix has logged

It may be that bug 1336 (https://gforge.statsbiblioteket.dk/tracker/?func=detail&group_id=7&aid=1336&atid=105 )

is still not yet fixed. 

About your job generation, the scheduler have no idea what so ever about how many harvesters are available.

In a situation w/ about 1.000.000 domains, it does not seem unreasonable to have one harvester take care of 3000 seeds.

But I give you, that it may not be what any user would want. Job generation is currently a sore point in the NetarchiveSuite.

My suggestion is that you lower the setting settings.harvester.scheduler.jobs.maxTotalSize to 500000

The default value is 2000000

I hope this helps

/Søren

Fra: netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk [mailto:netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk] På vegne af nicolas.giraud at bnf.fr
Sendt: 27. april 2009 09:39
Til: netarchivesuite-users at lists.gforge.statsbiblioteket.dk
Emne: [Netarchivesuite-users] Snapshot harvest failure / job generationconfiguration

Hi,

I've been loading about 2500 domains (for ~3000 seeds) in NAS to perform an initial test of the snapshot harvest feature. I've launched this test twice, and in both occurences, the crawl has failed after some time (roughly 12 hours for both launches). The error that occured is attached to this message. If I understand correctly this means that the Heritrix JVM crashed somehow... What can be the cause? I am still trying to find some clue in the logs bundled in the metadata arc, no luck for now.

Additionnally job generation does not proceed as we expected. I have 3 low proirity crawlers available in the setup, and only one huge job was created with the ~3000 seeds in it, hence 2 crawlers were perfectly idle. My expectation was that NAS would divide these 3000 seeds and generate a number of smaller jobs, and submit them in turn to all available crawlers. I used the default job generation parameters, is this behavior correct? How should I configure job generation to obtian the desired behavior?

All the best,
Nicolas

----- Réacheminé par Nicolas GIRAUD/ETS/BnF le 27/04/2009 08:53 -----

Message de : nicolas.giraud at bnf.fr
                    25/04/2009 01:41

Pour

nicolas.giraud at bnf.fr

Copie

Objet

Netarkivet error: Fatal error while operating job 'Job 3 (state = SUBMITTED, HD = 1, priority = LOWPRIORITY, forcemaxcount = -1, forcemaxbytes = 20000000, orderxml = default_obeyrobots, numconfigs = 2543)'

acheron2.bnf.fr
dk.netarkivet.harvester.harvesting.distribute.HarvestControllerServer$HarvesterThread.run(HarvestControllerServer.java:670)
Fatal error while operating job 'Job 3 (state = SUBMITTED, HD = 1, priority = LOWPRIORITY, forcemaxcount = -1, forcemaxbytes = 20000000, orderxml = default_obeyrobots, numconfigs = 2543)'
dk.netarkivet.common.exceptions.IOFailure: Error during crawling. The crawl may have been only partially completed.
                at dk.netarkivet.harvester.harvesting.distribute.HarvestControllerServer$HarvesterThread.run(HarvestControllerServer.java:657)
Caused by: dk.netarkivet.common.exceptions.IOFailure: Failed to connect to URL service:jmx:rmi:///jndi/rmi://localhost:8170/jmxrmi after 0 attempts
                at dk.netarkivet.common.utils.JMXUtils.getJMXConnector(JMXUtils.java:383)
                at dk.netarkivet.harvester.harvesting.JMXHeritrixController.getHeritrixJMXConnector(JMXHeritrixController.java:928)
                at dk.netarkivet.harvester.harvesting.JMXHeritrixController.getCrawlJobAttribute(JMXHeritrixController.java:889)
                at dk.netarkivet.harvester.harvesting.JMXHeritrixController.crawlIsEnded(JMXHeritrixController.java:471)
                at dk.netarkivet.harvester.harvesting.HeritrixLauncher.doCrawlLoop(HeritrixLauncher.java:214)
                at dk.netarkivet.harvester.harvesting.HeritrixLauncher.doCrawl(HeritrixLauncher.java:196)
                at dk.netarkivet.harvester.harvesting.HarvestController.runHarvest(HarvestController.java:221)
                at dk.netarkivet.harvester.harvesting.distribute.HarvestControllerServer$HarvesterThread.run(HarvestControllerServer.java:650)
Caused by: java.rmi.NoSuchObjectException: no such object in table
                at sun.rmi.transport.StreamRemoteCall.exceptionReceivedFromServer(StreamRemoteCall.java:255)
                at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:233)
                at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:142)
                at com.sun.jmx.remote.internal.PRef.invoke(Unknown Source)
                at javax.management.remote.rmi.RMIConnectionImpl_Stub.getConnectionId(Unknown Source)
                at javax.management.remote.rmi.RMIConnector.getConnectionId(RMIConnector.java:353)
                at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:315)
                at javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:248)
                at dk.netarkivet.common.utils.JMXUtils.getJMXConnector(JMXUtils.java:369)
                ... 7 more

Avant d'imprimer, pensez à l'environnement.
Consider the environment before printing this mail.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20090427/b41ae899/attachment-0002.html>