[Netarchivesuite-users] RE Pause and inactivity of crawls

sara.aubry at bnf.fr sara.aubry at bnf.fr
Wed Feb 3 18:20:30 CET 2010


Hi Eleonora,

Short answer is no.
Here is a detailled answer from Soren who explained us how inactivity 
check is working.

Sara

-------

The inactivity checks are done in the HeritrixLauncher.doCrawlLoop() 
method.

Step 1: Request how KBsPerSecond Heritrix is fetching and if Heritrix is 
paused
processedKBPerSec = heritrixController.getCurrentProcessedKBPerSec();
paused = heritrixController.isPaused();

Step 2: If Heritrix is doing anything right now (processedKBPerSec > 0) or 
it is paused (paused == true), we set the value lastTimeReceiveData to 
current time thereby effectively saying we are still active in step 5

(processedKBPerSec > 0 || paused) {
                 lastTimeReceivedData = System.currentTimeMillis();
             }

Step 3: Fetch number of active heritrix Threads (ToeThreads) and 
information about paused status
activeToeCount = heritrixController.getActiveToeCount();
paused = heritrixController.isPaused();

Step 4: If number of active ToeThreads > 0 or it is paused (paused == 
true), The time for last time we saw Heritrix having active ToeThreads 
(lastNonZeroActiveQueuesTime) are set to current time thereby effectively 
saying we are still active in step 5

if (activeToeCount > 0 || paused) {
                 lastNonZeroActiveQueuesTime = System.currentTimeMillis();
}

Step 5: Determine whether or not we should request Heritrix to stop the 
crawl.
This is determined by the following if-clause, which - if true - sends a 
request to Heritrix to stop the crawl:

if ((lastNonZeroActiveQueuesTime + timeOutInMillis
                  < System.currentTimeMillis())
                 || (lastTimeReceivedData + timeOutInMillisReceivedData
                     < System.currentTimeMillis())) {

If we are paused this will always be false.

If we have active toethreads, the first part will be false
If Heritrix have fetched data since last time around the loop, the second 
part will false.

That means, that for this check to fail we must have active toethreads and 
processedKBPerSec > 0
The check can also fail (No inactivity abort) if enough time hasn't 
elapsed since we saw active Toethreads
(defined by setting HarvesterSettings.INACTIVITY_TIMEOUT_IN_SECS) or 
enough time hasn't elapsed since we last received data (defined by setting 
HarvesterSettings.CRAWLER_TIMEOUT_NON_RESPONDING)

Note:
timeOutInMillis corresponds to setting 
HarvesterSettings.INACTIVITY_TIMEOUT_IN_SECS
timeOutInMillisReceivedData  corresponds to setting 
HarvesterSettings.CRAWLER_TIMEOUT_NON_RESPONDING

I hope that this doesn't confuse further.
A suggestion could be to bypass this inactivity abortion by raising the 
values of these timeout-values significantly (eg. to hours instead of 
minutes; using 180000 instead of 1800)
 
Regards 
Søren









Message de : Nicchiarelli Eleonora <eleonora.nicchiarelli at onb.ac.at> 
                      03/02/2010 17:46

Envoyé par : 
<netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk>

Veuillez répondre à 
<netarchivesuite-users at lists.gforge.statsbiblioteket.dk>



Pour
<netarchivesuite-users at lists.gforge.statsbiblioteket.dk>
Copie

Objet
[Netarchivesuite-users] Pause and inactivity of crawls



Dear all, 

a question about job pausing in NAS 3.10.

Is a job that has been paused through the Heritrix interface susceptible 
of being killed (that is, put to failed status) after the pause time has 
exceeded the timeout? 

Our experience suggests that this is not the case, that is, that timeouts 
are disregarded for paused jobs (as it is reasonable to think that paused 
jobs are not "really" inactive). But we would like to know the "official" 
answer. 

Many thanks in advance, 

Eleonora

Eleonora Nicchiarelli Bettelli
Digital Preservation
Austrian National Library
Josefsplatz 1, 1015 Wien

Tel:  +43 1 53 410 686
Fax: +43 1 53 410 610
Web: http://www.onb.ac.at/
Mail: eleonora.nicchiarelli at onb.ac.at





_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at lists.gforge.statsbiblioteket.dk
https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchivesuite-users






Avant d'imprimer, pensez à l'environnement. 
Consider the environment before printing this mail.   



More information about the NetarchiveSuite-users mailing list