[Netarchivesuite-users] CRAWL ENDING - Finished - Ended by operator
Kaare Fiedler Christiansen
kfc at statsbiblioteket.dk
Fri Jun 6 08:57:56 CEST 2008
On Fri, 2008-06-06 at 08:34 +0200, aponb at gmx.at wrote:
> A configuration, which will be started every four hours, brings
> sometimes in the crawler log the message
> "CRAWL ENDING - Finished - Ended by operator"
> instead of only
> "CRAWL ENDING - Finished"
>
> and in fact in these jobs, there are some pages missing, which should
> have been crawled.
>
> Do you know what's the reason for that behavior?
"Ended by operator" happens when the crawl is requested stopped by the
system.
This is done when a harvester has been inactive for a long period,
although there are still URLs in the queue. The amount of time before
the harvesters are stopped is defined by the two settings:
settings.harvester.harvesting.heritrix.inactivityTimeout
settings.harvester.harvesting.heritrix.noresponseTimeout
The feature was added because we often saw inactive harvesters blocking
up our queues, and thus receiving no new requests.
We run our harvesters with a timeout of half an hour (1800 seconds), if
you wish to turn the feature off, just insert a very large number.
Looking at our default distributed settings.xml it seems the default
setting is as low as 100 seconds. That is a poor choice, we should
certainly up this default setting!
Best,
Kåre
More information about the NetarchiveSuite-users
mailing list