[Netarchivesuite-users] Snapshot harvest failure / job generation configuration

nicolas.giraud at bnf.fr nicolas.giraud at bnf.fr
Mon Apr 27 09:38:31 CEST 2009


Hi,

I've been loading about 2500 domains (for ~3000 seeds) in NAS to perform 
an initial test of the snapshot harvest feature. I've launched this test 
twice, and in both occurences, the crawl has failed after some time 
(roughly 12 hours for both launches). The error that occured is attached 
to this message. If I understand correctly this means that the Heritrix 
JVM crashed somehow... What can be the cause? I am still trying to find 
some clue in the logs bundled in the metadata arc, no luck for now.

Additionnally job generation does not proceed as we expected. I have 3 low 
proirity crawlers available in the setup, and only one huge job was 
created with the ~3000 seeds in it, hence 2 crawlers were perfectly idle. 
My expectation was that NAS would divide these 3000 seeds and generate a 
number of smaller jobs, and submit them in turn to all available crawlers. 
I used the default job generation parameters, is this behavior correct? 
How should I configure job generation to obtian the desired behavior?

All the best,
Nicolas

----- Réacheminé par Nicolas GIRAUD/ETS/BnF le 27/04/2009 08:53 -----







Message de : nicolas.giraud at bnf.fr 
                      25/04/2009 01:41


Pour
nicolas.giraud at bnf.fr
Copie

Objet
Netarkivet error: Fatal error while operating job 'Job 3 (state = 
SUBMITTED, HD = 1, priority = LOWPRIORITY, forcemaxcount = -1, 
forcemaxbytes = 20000000, orderxml = default_obeyrobots, numconfigs = 
2543)'



acheron2.bnf.fr
dk.netarkivet.harvester.harvesting.distribute.HarvestControllerServer$HarvesterThread.run(HarvestControllerServer.java:670)
Fatal error while operating job 'Job 3 (state = SUBMITTED, HD = 1, 
priority = LOWPRIORITY, forcemaxcount = -1, forcemaxbytes = 20000000, 
orderxml = default_obeyrobots, numconfigs = 2543)'
dk.netarkivet.common.exceptions.IOFailure: Error during crawling. The 
crawl may have been only partially completed.
                 at 
dk.netarkivet.harvester.harvesting.distribute.HarvestControllerServer$HarvesterThread.run(HarvestControllerServer.java:657)
Caused by: dk.netarkivet.common.exceptions.IOFailure: Failed to connect to 
URL service:jmx:rmi:///jndi/rmi://localhost:8170/jmxrmi after 0 attempts
                 at 
dk.netarkivet.common.utils.JMXUtils.getJMXConnector(JMXUtils.java:383)
                 at 
dk.netarkivet.harvester.harvesting.JMXHeritrixController.getHeritrixJMXConnector(JMXHeritrixController.java:928)
                 at 
dk.netarkivet.harvester.harvesting.JMXHeritrixController.getCrawlJobAttribute(JMXHeritrixController.java:889)
                 at 
dk.netarkivet.harvester.harvesting.JMXHeritrixController.crawlIsEnded(JMXHeritrixController.java:471)
                 at 
dk.netarkivet.harvester.harvesting.HeritrixLauncher.doCrawlLoop(HeritrixLauncher.java:214)
                 at 
dk.netarkivet.harvester.harvesting.HeritrixLauncher.doCrawl(HeritrixLauncher.java:196)
                 at 
dk.netarkivet.harvester.harvesting.HarvestController.runHarvest(HarvestController.java:221)
                 at 
dk.netarkivet.harvester.harvesting.distribute.HarvestControllerServer$HarvesterThread.run(HarvestControllerServer.java:650)
Caused by: java.rmi.NoSuchObjectException: no such object in table
                 at 
sun.rmi.transport.StreamRemoteCall.exceptionReceivedFromServer(StreamRemoteCall.java:255)
                 at 
sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:233)
                 at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:142)
                 at com.sun.jmx.remote.internal.PRef.invoke(Unknown 
Source)
                 at 
javax.management.remote.rmi.RMIConnectionImpl_Stub.getConnectionId(Unknown 
Source)
                 at 
javax.management.remote.rmi.RMIConnector.getConnectionId(RMIConnector.java:353)
                 at 
javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:315)
                 at 
javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:248)
                 at 
dk.netarkivet.common.utils.JMXUtils.getJMXConnector(JMXUtils.java:369)
                 ... 7 more







Avant d'imprimer, pensez à l'environnement. 
Consider the environment before printing this mail.   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20090427/92bb92f1/attachment-0002.html>


More information about the NetarchiveSuite-users mailing list