[Netarchivesuite-users] Job already finished, but still started in NAS

Thu Oct 8 09:31:35 CEST 2009

Hi,

I'll comment on your mail below:

On Tue, 2009-10-06 at 17:46 +0200, aponb at gmx.at wrote:
> Hi!
> 
> I have following problem! Some days ago we put back a broken server into 
> NAS. The server is used as a webcrawler. And after starting the 
> application on this server, I was waiting that the application of that 
> machine appear in the Systemstate. But they did not. Due to a 
> configuration error in the hosts File on this Machine the following 
> error appeared in the GUI:
> 
> Unable to proxy JMX beans on host 
> 'webcrawler05.onb.ac.at.onb.ac.at:8400', last seen active at 'Tue Oct 06 
> 08:12:14 CEST 2009'
> dk.netarkivet.common.exceptions.IOFailure: Could not connect to 
> service:jmx:rmi://webcrawler05.onb.ac.at.onb.ac.at:8600/jndi/rmi://webcrawler05.onb.ac.at.onb.ac.at:8400/jmxrmi
> at 
> dk.netarkivet.common.utils.JMXUtils.getMBeanServerConnection(JMXUtils.java:189)
> at 
> dk.netarkivet.common.utils.JMXUtils.getMBeanServerConnection(JMXUtils.java:164)
> at 
> dk.netarkivet.monitor.jmx.RmiProxyConnectionFactory.getConnection(RmiProxyConnectionFactory.java:67)

This only means that the JMX monitoring of the host did not work. That
should be harmless for all functionality, although annoying for
monitoring the application.

> But the server did get a new Job to assigned and did start to crawl. I 
> corrected the hosts file to the correct value. I was thinking after 
> finishing that job, the application will reconnect in the right way. But 
> that did not happen. The job finished crawling and all files were 
> uploaded to the storage, but the NAS did get any information about that. 

That sounds very strange, would you mind double checking that this is
actually the case?

JMX is in no way involved in NetarchiveSuite getting information about
the crawl, that is purely handled by the exchange of JMS messages. Since
the files could be uploaded, JMS seems to work. So I would imagine that
the scheduler also got the message about the job being crawled.

You can check this by looking at the history for the harvest definition
that generated the crawled job.

> But the Harvester on that server did already got a new job assigned, 
> which is already running. The problem is now that I have one job without 
> any statistical information and there will be more jobs without stopping 
> that machine (how can I stop a HarvesterControllerInstance right after 
> uploading the files from the last job - is there a possibility to stop 
> assigning jobs to that instance?

Unfortunately not. The best way to stop the machine is to kill it during
upload if you can manage it. Otherwise, you might want to kill it in the
early stages of a crawl and resubmit that job later.

>  - if I can stop that 
> HarvesterController, a restart of that Controller should bring the 
> applications of that server in the systemstate right?)

It should, assuming it is on the same host with the same ports and
running the same kind of harvester. Otherwise, you might want to give it
a different JMX port in your settings, that will enable it to be
monitored as if it was an entirely new application. Although the old
entry in the Status page will remain there showing a warning the
application that is no longer running.

> Without having the statistical information the NAS will again crawl many 
> of these domains of that job, although these are maybe already 
> completed. So how can I recalculate the statistical data from the arc 
> files, which are already uploaded in the storage?

If you really haven't got the statistical data (please double check),
and all the files were uploaded and deleted from the server, it will
probably require quite a bit of manual work to get the right data. 

You will have to reconstruct the log files in the crawldir from the
metadata files. Then move that crawldir into the jobs directory of a
harvester, and restart it. 

We should probably create a tool for doing what a harvester does with
old jobs, to make this sort of recovery possible without restarting the
harvesters.

Best,
  Kåre

-- 
Kaare Fiedler Christiansen - NetarchiveSuite developer
THE STATE AND UNIVERSITY LIBRARY, 
Universitetsparken 1, 8000 Aarhus C, Denmark.
Phone: +45 89462036