[Netarchivesuite-users] Oldjobs directory growing too big

Søren Vejrup Carlsen svc at kb.dk
Tue Apr 28 11:38:45 CEST 2009


Hi Nicolas.

The oldjobs directory is the final resting place for the crawl-directories after the crawl has finished/aborted, 

and the files have been uploaded to the archive.

 

In these directories remain indefinitely (or until deleted manually) all the files that are not (yet) stored into the metadata-1.arc

 

This directory is not cleared automatically!

But it is safe to clear it almost at any time (except when the crawldirectory are moved to this directory after the post-processing of the job has finished).

 

About clearing the broker's queue: The contents of the oldjobs has nothing to do with the contents of the brokers queue.

In the broker queue, you (may) have outstanding crawljob waiting for an available harvester. 

 

I hope this answers your question.

Søren

 

Fra: netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk [mailto:netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk] På vegne af nicolas.giraud at bnf.fr
Sendt: 28. april 2009 10:55
Til: netarchivesuite-users at lists.gforge.statsbiblioteket.dk
Emne: [Netarchivesuite-users] Oldjobs directory growing too big

 


Hi,

The oldjobs directory on my crawler machine tends to get very big. What exactly is stored in this directory, is it Heritrix instances for failed jobs? When is that folder cleared? Is it safe to clear it and then clear the broker's queue?

Cheers,
Nicolas


Avant d'imprimer, pensez à l'environnement.
Consider the environment before printing this mail.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20090428/3aac7c58/attachment-0002.html>


More information about the NetarchiveSuite-users mailing list