[Netarchivesuite-users] Oldjobs directory growing too big

Wed Apr 29 13:52:15 CEST 2009

It still a manuel process here covering:
1) identify jobs that are not reported FINISHED (or FAILED)
2) locate which harvestmachine that job originally was run on
3) find a harvester-instance on that machine that is available (doing nothing)
4) moving job-directory from /oldjobs/ into the harvester-directory
5) restarting that harvester

After these five steps the jobs should be marked FAILED in the database and all statistics (on domain-level) should be generated in the database as well

On very rare occations the crawl.log is deleted from the job-directory. The HarvesterController needs exactly that to finish correctly by generating statistics on domain-level. In these occations you will not get any statistics in the database for that job. Unless you do another manual process by extracting the crawl.log again from the archive and the job-metadata-1.arc file

best
Bjarne

From: netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk [mailto:netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk] On Behalf Of Søren Vejrup Carlsen
Sent: Wednesday, April 29, 2009 1:20 PM
To: netarchivesuite-users at lists.gforge.statsbiblioteket.dk
Subject: Re: [Netarchivesuite-users] Oldjobs directory growing too big

Hi Nicolas.
Have you by any chance reset (emptied) the JMS queues used by your NAS installation?
If any Heritrix harvests fails, our code should send a message back to the scheduler, that this job has failed.
But if you have emptied   the JMS queues used by your NAS installation, this message may have been lost.

/Søren

Fra: netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk [mailto:netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk] På vegne af nicolas.giraud at bnf.fr
Sendt: 29. april 2009 11:56
Til: netarchivesuite-users at lists.gforge.statsbiblioteket.dk
Emne: Re: [Netarchivesuite-users] Oldjobs directory growing too big

Hi Bjarne,

Thanks for the info. By "The recovering of jobs not reported as FINISHED is still an all manual process here" do you mean that there is a manual procedure to restart a job that is stored in the oldjobs directory?

I have a job that appears as "started"  in the Harvest status section, but I had 2 errors during the crawl : loss of JMX connection to Heritrix (I don't understand what causes this now), so the job got moved to oldjobs. Then the disk was saturated. I moved the oldjob dirs to NFS mounts to solve the disk space problem. But now after restarting NAS, the job still shows as "started" but it does not restart, no Heritrix is instanced. I'm a bit lost there.

Nicolas

Avant d'imprimer, pensez à l'environnement.
Consider the environment before printing this mail.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20090429/6869804a/attachment-0002.html>