[Netarchivesuite-users] IOException while querying Heritrix status via JMX
nicolas.giraud at bnf.fr
nicolas.giraud at bnf.fr
Mon May 18 10:33:31 CEST 2009
Hi,
I think the bug that existed in 1.12 is not fully fixed in 1.14. I'm
experiencing trouble right at the start of broad crawl jobs now, with a
new harvest template I've been using. My goal was to take the prodiction
configuration of Heritrix used by Internet Archive when they performed or
last broad crawl in 2008. I started from the default_obeyrobots config,
and then started adding the IA settings. This is the harvest template I
obtained :
I don't know what configuration parameter causes this, but Heritrix takes
a long time to compute the initial frontier, and uses a lot of CPU doing
so, hence, even if I can connect to the admin console, I obtain the
following error after a couple of minutes:
GRAVE: Mailing netarkivet error: Fatal error while operating job 'Job 13
(state = SUBMITTED, HD = 1, priority = LOWPRIORITY, forcemaxcount = -1,
forcemaxbyt
es = 20000000, orderxml = ia_large_2008, numconfigs = 1000)'
dk.netarkivet.common.exceptions.IOFailure: Error during crawling. The
crawl may have been only partially completed.
at
dk.netarkivet.harvester.harvesting.distribute.HarvestControllerServer$HarvesterThread.run(HarvestControllerServer.java:657)
Caused by: dk.netarkivet.common.exceptions.IOFailure: Failed to find MBean
org.archive.crawler:name=13-1-20090515153903599,type=CrawlService.Job,jmxport=817
0,mother=Heritrix,host=acheron2.bnf.fr for getting attribute Status after
17 attempts
at
dk.netarkivet.common.utils.JMXUtils.getAttribute(JMXUtils.java:317)
at
dk.netarkivet.common.utils.JMXUtils.getAttribute(JMXUtils.java:459)
at
dk.netarkivet.harvester.harvesting.JMXHeritrixController.getCrawlJobAttribute(JMXHeritrixController.java:905)
at
dk.netarkivet.harvester.harvesting.JMXHeritrixController.crawlIsEnded(JMXHeritrixController.java:487)
at
dk.netarkivet.harvester.harvesting.JMXHeritrixController.atFinish(JMXHeritrixController.java:354)
at
dk.netarkivet.harvester.harvesting.HeritrixLauncher.doCrawl(HeritrixLauncher.java:193)
at
dk.netarkivet.harvester.harvesting.HarvestController.runHarvest(HarvestController.java:221)
at
dk.netarkivet.harvester.harvesting.distribute.HarvestControllerServer$HarvesterThread.run(HarvestControllerServer.java:650)
Caused by: javax.management.InstanceNotFoundException:
org.archive.crawler:name=13-1-20090515153903599,type=CrawlService.Job,jmxport=8170,mother=Heritrix,ho
st=acheron2.bnf.fr
at
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getMBean(DefaultMBeanServerInterceptor.java:1094)
at
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:662)
at
com.sun.jmx.mbeanserver.JmxMBeanServer.getAttribute(JmxMBeanServer.java:638)
at
com.sun.jmx.remote.security.MBeanServerAccessController.getAttribute(MBeanServerAccessController.java:299)
at
javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1403)
at
javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl.java:72)
at
javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1264)
at java.security.AccessController.doPrivileged(Native Method)
at
javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1366)
at
javax.management.remote.rmi.RMIConnectionImpl.getAttribute(RMIConnectionImpl.java:600)
at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:305)
at sun.rmi.transport.Transport$1.run(Transport.java:159)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.Transport.serviceCall(Transport.java:155)
at
sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535)
at
sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790)
at
sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
at
sun.rmi.transport.StreamRemoteCall.exceptionReceivedFromServer(StreamRemoteCall.java:255)
at
sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:233)
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:142)
at com.sun.jmx.remote.internal.PRef.invoke(Unknown Source)
at
javax.management.remote.rmi.RMIConnectionImpl_Stub.getAttribute(Unknown
Source)
at
javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection.getAttribute(RMIConnector.java:878)
at
dk.netarkivet.common.utils.JMXUtils.getAttribute(JMXUtils.java:296)
... 7 more
I've been thinking to modify JMXHeritrixController.getCrawlJobAttribute to
allow several retries after a configurable wait period to work around this
problem, how about it?
Best regards,
Nicolas
Avant d'imprimer, pensez à l'environnement.
Consider the environment before printing this mail.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.gforge.statsbiblioteket.dk/pipermail/netarchivesuite-users/attachments/20090518/d87edfb5/attachment-0001.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ia_large_2008.xml
Type: application/octet-stream
Size: 51557 bytes
Desc: not available
Url : http://lists.gforge.statsbiblioteket.dk/pipermail/netarchivesuite-users/attachments/20090518/d87edfb5/attachment-0001.obj
More information about the NetarchiveSuite-users
mailing list