[Netarchivesuite-users] IOException while querying Heritrix status via JMX

nicolas.giraud at bnf.fr nicolas.giraud at bnf.fr
Mon May 18 10:33:31 CEST 2009


Hi,

I think the bug that existed in 1.12 is not fully fixed in 1.14. I'm 
experiencing trouble right at the start of broad crawl jobs now, with a 
new harvest template I've been using. My goal was to take the prodiction 
configuration of Heritrix used by Internet Archive when they performed or 
last broad crawl in 2008. I started from the default_obeyrobots config, 
and then started adding the IA settings. This is the harvest template I 
obtained : 

I don't know what configuration parameter causes this, but Heritrix takes 
a long time to compute the initial frontier, and uses a lot of CPU doing 
so, hence, even if I can connect to the admin console, I obtain the 
following error after a couple of minutes:

GRAVE: Mailing netarkivet error: Fatal error while operating job 'Job 13 
(state = SUBMITTED, HD = 1, priority = LOWPRIORITY, forcemaxcount = -1, 
forcemaxbyt
es = 20000000, orderxml = ia_large_2008, numconfigs = 1000)'
dk.netarkivet.common.exceptions.IOFailure: Error during crawling. The 
crawl may have been only partially completed.
        at 
dk.netarkivet.harvester.harvesting.distribute.HarvestControllerServer$HarvesterThread.run(HarvestControllerServer.java:657)
Caused by: dk.netarkivet.common.exceptions.IOFailure: Failed to find MBean 
org.archive.crawler:name=13-1-20090515153903599,type=CrawlService.Job,jmxport=817
0,mother=Heritrix,host=acheron2.bnf.fr for getting attribute Status after 
17 attempts
        at 
dk.netarkivet.common.utils.JMXUtils.getAttribute(JMXUtils.java:317)
        at 
dk.netarkivet.common.utils.JMXUtils.getAttribute(JMXUtils.java:459)
        at 
dk.netarkivet.harvester.harvesting.JMXHeritrixController.getCrawlJobAttribute(JMXHeritrixController.java:905)
        at 
dk.netarkivet.harvester.harvesting.JMXHeritrixController.crawlIsEnded(JMXHeritrixController.java:487)
        at 
dk.netarkivet.harvester.harvesting.JMXHeritrixController.atFinish(JMXHeritrixController.java:354)
        at 
dk.netarkivet.harvester.harvesting.HeritrixLauncher.doCrawl(HeritrixLauncher.java:193)
        at 
dk.netarkivet.harvester.harvesting.HarvestController.runHarvest(HarvestController.java:221)
        at 
dk.netarkivet.harvester.harvesting.distribute.HarvestControllerServer$HarvesterThread.run(HarvestControllerServer.java:650)
Caused by: javax.management.InstanceNotFoundException: 
org.archive.crawler:name=13-1-20090515153903599,type=CrawlService.Job,jmxport=8170,mother=Heritrix,ho
st=acheron2.bnf.fr
        at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getMBean(DefaultMBeanServerInterceptor.java:1094)
        at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:662)
        at 
com.sun.jmx.mbeanserver.JmxMBeanServer.getAttribute(JmxMBeanServer.java:638)
        at 
com.sun.jmx.remote.security.MBeanServerAccessController.getAttribute(MBeanServerAccessController.java:299)
        at 
javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1403)
        at 
javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl.java:72)
        at 
javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1264)
        at java.security.AccessController.doPrivileged(Native Method)
        at 
javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1366)
        at 
javax.management.remote.rmi.RMIConnectionImpl.getAttribute(RMIConnectionImpl.java:600)
        at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at 
sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:305)
        at sun.rmi.transport.Transport$1.run(Transport.java:159)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.rmi.transport.Transport.serviceCall(Transport.java:155)
        at 
sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535)
        at 
sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790)
        at 
sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
        at 
sun.rmi.transport.StreamRemoteCall.exceptionReceivedFromServer(StreamRemoteCall.java:255)
        at 
sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:233)
        at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:142)
        at com.sun.jmx.remote.internal.PRef.invoke(Unknown Source)
        at 
javax.management.remote.rmi.RMIConnectionImpl_Stub.getAttribute(Unknown 
Source)
        at 
javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection.getAttribute(RMIConnector.java:878)
        at 
dk.netarkivet.common.utils.JMXUtils.getAttribute(JMXUtils.java:296)
        ... 7 more

I've been thinking to modify JMXHeritrixController.getCrawlJobAttribute to 
allow several retries after a configurable wait period to work around this 
problem, how about it?

Best regards,
Nicolas




Avant d'imprimer, pensez à l'environnement. 
Consider the environment before printing this mail.   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.gforge.statsbiblioteket.dk/pipermail/netarchivesuite-users/attachments/20090518/d87edfb5/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ia_large_2008.xml
Type: application/octet-stream
Size: 51557 bytes
Desc: not available
Url : http://lists.gforge.statsbiblioteket.dk/pipermail/netarchivesuite-users/attachments/20090518/d87edfb5/attachment-0001.obj 


More information about the NetarchiveSuite-users mailing list