[Netarchivesuite-users] IOException while querying Heritrix statusvia JMX

Mon May 18 14:05:08 CEST 2009

I sounds very reasonable that the problem lies in the fact that heritrix sometimes (e.g. during startup of a large job but also later in the middle or end of large crawls) is not responding to JMX-calls. We never see it in our selective crawling - so it must have something to do with the size of jobs.

So a configurable number of retries to work around this problem seems like a good idea. Off cause the simple workaround right now is to set the timeout value to something bigger than 120 seconds as Soeren suggests

best
Bjarne Andersen

From: netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk [mailto:netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk] On Behalf Of Søren Vejrup Carlsen
Sent: Monday, May 18, 2009 1:54 PM
To: netarchivesuite-users at lists.gforge.statsbiblioteket.dk
Subject: Re: [Netarchivesuite-users] IOException while querying Heritrix statusvia JMX

Hi Nicolas.
Is there anything to see in the Heritrix logs that might indicate what the problem is?
First, you could try to increase the value of the setting “settings.common.jmx.timeout”.
The default value is 120 seconds.  You could try increasing it to 300 seconds (5 minutes)

/Søren

Fra: netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk [mailto:netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk] På vegne af nicolas.giraud at bnf.fr
Sendt: 18. maj 2009 10:34
Til: netarchivesuite-users at lists.gforge.statsbiblioteket.dk
Emne: [Netarchivesuite-users] IOException while querying Heritrix statusvia JMX

Hi,

I think the bug that existed in 1.12 is not fully fixed in 1.14. I'm experiencing trouble right at the start of broad crawl jobs now, with a new harvest template I've been using. My goal was to take the prodiction configuration of Heritrix used by Internet Archive when they performed or last broad crawl in 2008. I started from the default_obeyrobots config, and then started adding the IA settings. This is the harvest template I obtained :

I don't know what configuration parameter causes this, but Heritrix takes a long time to compute the initial frontier, and uses a lot of CPU doing so, hence, even if I can connect to the admin console, I obtain the following error after a couple of minutes:

GRAVE: Mailing netarkivet error: Fatal error while operating job 'Job 13 (state = SUBMITTED, HD = 1, priority = LOWPRIORITY, forcemaxcount = -1, forcemaxbyt
es = 20000000, orderxml = ia_large_2008, numconfigs = 1000)'
dk.netarkivet.common.exceptions.IOFailure: Error during crawling. The crawl may have been only partially completed.
        at dk.netarkivet.harvester.harvesting.distribute.HarvestControllerServer$HarvesterThread.run(HarvestControllerServer.java:657)
Caused by: dk.netarkivet.common.exceptions.IOFailure: Failed to find MBean org.archive.crawler:name=13-1-20090515153903599,type=CrawlService.Job,jmxport=817
0,mother=Heritrix,host=acheron2.bnf.fr for getting attribute Status after 17 attempts
        at dk.netarkivet.common.utils.JMXUtils.getAttribute(JMXUtils.java:317)
        at dk.netarkivet.common.utils.JMXUtils.getAttribute(JMXUtils.java:459)
        at dk.netarkivet.harvester.harvesting.JMXHeritrixController.getCrawlJobAttribute(JMXHeritrixController.java:905)
        at dk.netarkivet.harvester.harvesting.JMXHeritrixController.crawlIsEnded(JMXHeritrixController.java:487)
        at dk.netarkivet.harvester.harvesting.JMXHeritrixController.atFinish(JMXHeritrixController.java:354)
        at dk.netarkivet.harvester.harvesting.HeritrixLauncher.doCrawl(HeritrixLauncher.java:193)
        at dk.netarkivet.harvester.harvesting.HarvestController.runHarvest(HarvestController.java:221)
        at dk.netarkivet.harvester.harvesting.distribute.HarvestControllerServer$HarvesterThread.run(HarvestControllerServer.java:650)
Caused by: javax.management.InstanceNotFoundException: org.archive.crawler:name=13-1-20090515153903599,type=CrawlService.Job,jmxport=8170,mother=Heritrix,ho
st=acheron2.bnf.fr
        at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getMBean(DefaultMBeanServerInterceptor.java:1094)
        at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:662)
        at com.sun.jmx.mbeanserver.JmxMBeanServer.getAttribute(JmxMBeanServer.java:638)
        at com.sun.jmx.remote.security.MBeanServerAccessController.getAttribute(MBeanServerAccessController.java:299)
        at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1403)
        at javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl.java:72)
        at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1264)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1366)
        at javax.management.remote.rmi.RMIConnectionImpl.getAttribute(RMIConnectionImpl.java:600)
        at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:305)
        at sun.rmi.transport.Transport$1.run(Transport.java:159)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.rmi.transport.Transport.serviceCall(Transport.java:155)
        at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
        at sun.rmi.transport.StreamRemoteCall.exceptionReceivedFromServer(StreamRemoteCall.java:255)
        at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:233)
        at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:142)
        at com.sun.jmx.remote.internal.PRef.invoke(Unknown Source)
        at javax.management.remote.rmi.RMIConnectionImpl_Stub.getAttribute(Unknown Source)
        at javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection.getAttribute(RMIConnector.java:878)
        at dk.netarkivet.common.utils.JMXUtils.getAttribute(JMXUtils.java:296)
        ... 7 more

I've been thinking to modify JMXHeritrixController.getCrawlJobAttribute to allow several retries after a configurable wait period to work around this problem, how about it?

Best regards,
Nicolas

Avant d'imprimer, pensez à l'environnement.
Consider the environment before printing this mail.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20090518/73266a67/attachment-0002.html>