[Netarchivesuite-users] Oldjobs directory growing too big

Tue Apr 28 12:03:56 CEST 2009

If the finalization of a crawljob - which consists of at least these steps
 - packaging things (e.g. crawler-logs) into X-metadata-1.arc
 - uploding all ARC-files + metadata-arc-file to the arc-repository
 - sending a job-finished message to the system (through JMS)

If any of these steps fail you will have important information in /oldjobs - e.g. not-uploaded ARC-files

So before deleting anything you have to be sure that the directories have been emptied for "real" archive content - for most jobs off cause it will be - but for failed jobs you could have things left in /oldjobs

In Netarchive.dk we have set up a small script that takes all ARC files in /oldjobs - both cralwer-ARC-files and metadata-ARC-files and uploads them using the commandlinje upload-tool supplied with NetarchiveSuite

On rare occations (e.g. when a crawler looses the JMS-connection during a crawl) the 3rd step of the above will fail (mostly also the 2nd) because the harvester-application cannot send either upload-messages or the job-finished message. In these cases the jobs will not get reported as finished in the database and will remain in status STARTED. The only way to fix this currently is to copy the entire contents of a job-directory back to a harvester-instance (not running other jobs) and restart that instance. That will make the harvester find the old data and do whats nessecary to do actually all three steps if required.

All this error handling is currently a manual process - but luckily is does not happen that often

best
Bjarne Andersen
________________________________________
Fra: netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk [netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk] På vegne af Søren Vejrup Carlsen [svc at kb.dk]
Sendt: 28. april 2009 11:38
Til: netarchivesuite-users at lists.gforge.statsbiblioteket.dk
Emne: Re: [Netarchivesuite-users] Oldjobs directory growing too big

Hi Nicolas.
The oldjobs directory is the final resting place for the crawl-directories after the crawl has finished/aborted,
and the files have been uploaded to the archive.

In these directories remain indefinitely (or until deleted manually) all the files that are not (yet) stored into the metadata-1.arc

This directory is not cleared automatically!
But it is safe to clear it almost at any time (except when the crawldirectory are moved to this directory after the post-processing of the job has finished).

About clearing the broker’s queue: The contents of the oldjobs has nothing to do with the contents of the brokers queue.
In the broker queue, you (may) have outstanding crawljob waiting for an available harvester.

I hope this answers your question.
Søren

Fra: netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk [mailto:netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk] På vegne af nicolas.giraud at bnf.fr
Sendt: 28. april 2009 10:55
Til: netarchivesuite-users at lists.gforge.statsbiblioteket.dk
Emne: [Netarchivesuite-users] Oldjobs directory growing too big

Hi,

The oldjobs directory on my crawler machine tends to get very big. What exactly is stored in this directory, is it Heritrix instances for failed jobs? When is that folder cleared? Is it safe to clear it and then clear the broker's queue?

Cheers,
Nicolas

Avant d'imprimer, pensez à l'environnement.
Consider the environment before printing this mail.