[Netarchivesuite-users] RE Pause and inactivity of crawls
sara.aubry at bnf.fr
sara.aubry at bnf.fr
Wed Feb 3 18:20:30 CET 2010
Hi Eleonora,
Short answer is no.
Here is a detailled answer from Soren who explained us how inactivity
check is working.
Sara
-------
The inactivity checks are done in the HeritrixLauncher.doCrawlLoop()
method.
Step 1: Request how KBsPerSecond Heritrix is fetching and if Heritrix is
paused
processedKBPerSec = heritrixController.getCurrentProcessedKBPerSec();
paused = heritrixController.isPaused();
Step 2: If Heritrix is doing anything right now (processedKBPerSec > 0) or
it is paused (paused == true), we set the value lastTimeReceiveData to
current time thereby effectively saying we are still active in step 5
(processedKBPerSec > 0 || paused) {
lastTimeReceivedData = System.currentTimeMillis();
}
Step 3: Fetch number of active heritrix Threads (ToeThreads) and
information about paused status
activeToeCount = heritrixController.getActiveToeCount();
paused = heritrixController.isPaused();
Step 4: If number of active ToeThreads > 0 or it is paused (paused ==
true), The time for last time we saw Heritrix having active ToeThreads
(lastNonZeroActiveQueuesTime) are set to current time thereby effectively
saying we are still active in step 5
if (activeToeCount > 0 || paused) {
lastNonZeroActiveQueuesTime = System.currentTimeMillis();
}
Step 5: Determine whether or not we should request Heritrix to stop the
crawl.
This is determined by the following if-clause, which - if true - sends a
request to Heritrix to stop the crawl:
if ((lastNonZeroActiveQueuesTime + timeOutInMillis
< System.currentTimeMillis())
|| (lastTimeReceivedData + timeOutInMillisReceivedData
< System.currentTimeMillis())) {
If we are paused this will always be false.
If we have active toethreads, the first part will be false
If Heritrix have fetched data since last time around the loop, the second
part will false.
That means, that for this check to fail we must have active toethreads and
processedKBPerSec > 0
The check can also fail (No inactivity abort) if enough time hasn't
elapsed since we saw active Toethreads
(defined by setting HarvesterSettings.INACTIVITY_TIMEOUT_IN_SECS) or
enough time hasn't elapsed since we last received data (defined by setting
HarvesterSettings.CRAWLER_TIMEOUT_NON_RESPONDING)
Note:
timeOutInMillis corresponds to setting
HarvesterSettings.INACTIVITY_TIMEOUT_IN_SECS
timeOutInMillisReceivedData corresponds to setting
HarvesterSettings.CRAWLER_TIMEOUT_NON_RESPONDING
I hope that this doesn't confuse further.
A suggestion could be to bypass this inactivity abortion by raising the
values of these timeout-values significantly (eg. to hours instead of
minutes; using 180000 instead of 1800)
Regards
Søren
Message de : Nicchiarelli Eleonora <eleonora.nicchiarelli at onb.ac.at>
03/02/2010 17:46
Envoyé par :
<netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk>
Veuillez répondre à
<netarchivesuite-users at lists.gforge.statsbiblioteket.dk>
Pour
<netarchivesuite-users at lists.gforge.statsbiblioteket.dk>
Copie
Objet
[Netarchivesuite-users] Pause and inactivity of crawls
Dear all,
a question about job pausing in NAS 3.10.
Is a job that has been paused through the Heritrix interface susceptible
of being killed (that is, put to failed status) after the pause time has
exceeded the timeout?
Our experience suggests that this is not the case, that is, that timeouts
are disregarded for paused jobs (as it is reasonable to think that paused
jobs are not "really" inactive). But we would like to know the "official"
answer.
Many thanks in advance,
Eleonora
Eleonora Nicchiarelli Bettelli
Digital Preservation
Austrian National Library
Josefsplatz 1, 1015 Wien
Tel: +43 1 53 410 686
Fax: +43 1 53 410 610
Web: http://www.onb.ac.at/
Mail: eleonora.nicchiarelli at onb.ac.at
_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at lists.gforge.statsbiblioteket.dk
https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchivesuite-users
Avant d'imprimer, pensez à l'environnement.
Consider the environment before printing this mail.
More information about the NetarchiveSuite-users
mailing list