[Netarchivesuite-users] RE Pause and inactivity of crawls

Nicchiarelli Eleonora eleonora.nicchiarelli at onb.ac.at
Thu Feb 4 12:33:57 CET 2010


Hi Sara, 

many thanks! Sorry in case the answer was already in the mailing list, I did not check it. 

Eleonora

Eleonora Nicchiarelli Bettelli
Digital Preservation
Austrian National Library
Josefsplatz 1, 1015 Wien

Tel:  +43 1 53 410 686
Fax: +43 1 53 410 610
Web: http://www.onb.ac.at/
Mail: eleonora.nicchiarelli at onb.ac.at


> -----Ursprüngliche Nachricht-----
> Von: sara.aubry at bnf.fr [mailto:netarchivesuite-users-
> bounces at lists.gforge.statsbiblioteket.dk] Im Auftrag von sara.aubry at bnf.fr
> Gesendet: Mittwoch, 03. Februar 2010 18:21
> An: netarchivesuite-users at lists.gforge.statsbiblioteket.dk
> Betreff: [Netarchivesuite-users] RE Pause and inactivity of crawls
> 
> Hi Eleonora,
> 
> Short answer is no.
> Here is a detailled answer from Soren who explained us how inactivity
> check is working.
> 
> Sara
> 
> -------
> 
> The inactivity checks are done in the HeritrixLauncher.doCrawlLoop()
> method.
> 
> Step 1: Request how KBsPerSecond Heritrix is fetching and if Heritrix is
> paused
> processedKBPerSec = heritrixController.getCurrentProcessedKBPerSec();
> paused = heritrixController.isPaused();
> 
> Step 2: If Heritrix is doing anything right now (processedKBPerSec > 0) or
> it is paused (paused == true), we set the value lastTimeReceiveData to
> current time thereby effectively saying we are still active in step 5
> 
> (processedKBPerSec > 0 || paused) {
>                  lastTimeReceivedData = System.currentTimeMillis();
>              }
> 
> Step 3: Fetch number of active heritrix Threads (ToeThreads) and
> information about paused status
> activeToeCount = heritrixController.getActiveToeCount();
> paused = heritrixController.isPaused();
> 
> Step 4: If number of active ToeThreads > 0 or it is paused (paused ==
> true), The time for last time we saw Heritrix having active ToeThreads
> (lastNonZeroActiveQueuesTime) are set to current time thereby effectively
> saying we are still active in step 5
> 
> if (activeToeCount > 0 || paused) {
>                  lastNonZeroActiveQueuesTime = System.currentTimeMillis();
> }
> 
> Step 5: Determine whether or not we should request Heritrix to stop the
> crawl.
> This is determined by the following if-clause, which - if true - sends a
> request to Heritrix to stop the crawl:
> 
> if ((lastNonZeroActiveQueuesTime + timeOutInMillis
>                   < System.currentTimeMillis())
>                  || (lastTimeReceivedData + timeOutInMillisReceivedData
>                      < System.currentTimeMillis())) {
> 
> If we are paused this will always be false.
> 
> If we have active toethreads, the first part will be false
> If Heritrix have fetched data since last time around the loop, the second
> part will false.
> 
> That means, that for this check to fail we must have active toethreads and
> processedKBPerSec > 0
> The check can also fail (No inactivity abort) if enough time hasn't
> elapsed since we saw active Toethreads
> (defined by setting HarvesterSettings.INACTIVITY_TIMEOUT_IN_SECS) or
> enough time hasn't elapsed since we last received data (defined by setting
> HarvesterSettings.CRAWLER_TIMEOUT_NON_RESPONDING)
> 
> Note:
> timeOutInMillis corresponds to setting
> HarvesterSettings.INACTIVITY_TIMEOUT_IN_SECS
> timeOutInMillisReceivedData  corresponds to setting
> HarvesterSettings.CRAWLER_TIMEOUT_NON_RESPONDING
> 
> I hope that this doesn't confuse further.
> A suggestion could be to bypass this inactivity abortion by raising the
> values of these timeout-values significantly (eg. to hours instead of
> minutes; using 180000 instead of 1800)
> 
> Regards
> Søren
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Message de : Nicchiarelli Eleonora <eleonora.nicchiarelli at onb.ac.at>
>                       03/02/2010 17:46
> 
> Envoyé par :
> <netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk>
> 
> Veuillez répondre à
> <netarchivesuite-users at lists.gforge.statsbiblioteket.dk>
> 
> 
> 
> Pour
> <netarchivesuite-users at lists.gforge.statsbiblioteket.dk>
> Copie
> 
> Objet
> [Netarchivesuite-users] Pause and inactivity of crawls
> 
> 
> 
> Dear all,
> 
> a question about job pausing in NAS 3.10.
> 
> Is a job that has been paused through the Heritrix interface susceptible
> of being killed (that is, put to failed status) after the pause time has
> exceeded the timeout?
> 
> Our experience suggests that this is not the case, that is, that timeouts
> are disregarded for paused jobs (as it is reasonable to think that paused
> jobs are not "really" inactive). But we would like to know the "official"
> answer.
> 
> Many thanks in advance,
> 
> Eleonora
> 
> Eleonora Nicchiarelli Bettelli
> Digital Preservation
> Austrian National Library
> Josefsplatz 1, 1015 Wien
> 
> Tel:  +43 1 53 410 686
> Fax: +43 1 53 410 610
> Web: http://www.onb.ac.at/
> Mail: eleonora.nicchiarelli at onb.ac.at
> 
> 
> 
> 
> 
> _______________________________________________
> NetarchiveSuite-users mailing list
> NetarchiveSuite-users at lists.gforge.statsbiblioteket.dk
> https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchivesuite-
> users
> 
> 
> 
> 
> 
> 
> Avant d'imprimer, pensez à l'environnement.
> Consider the environment before printing this mail.
> _______________________________________________
> NetarchiveSuite-users mailing list
> NetarchiveSuite-users at lists.gforge.statsbiblioteket.dk
> https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchivesuite-
> users






More information about the NetarchiveSuite-users mailing list