[Netarchivesuite-users] Some questions

Tue Nov 24 17:57:43 CET 2009

Thanks Bjarne for your answer!

We had to interacte in heritrix with many jobs. For that we're pausing the crawler, then deleting the unnecessary seeds, and the continue, so that the crawler will end normally. And all these domains get correct status messages like Domain completed or max. byte limited reached. Just in 3 jobs (out of over 200), all domains under the limit have the status Harvesting aborted.
I also think that that was a problem with communication or something else. I am just thinking that the status is wrong. And I also want to prevent, that these domains get crawled again in the second stage.
So I think it should help, when I change the state in the DB, then these domains should not included in the 2nd stage. Is there anything else to think about it or is just the status taken for evaluating this?

For the last 60 jobs we do not even have a start time. But they all are receiving it. Slowly but they get it :) I don't now why. Do you think that could be a result of manipulating the JMS MONITOR Queue ? (we still have to delete messages out of this queue, otherwise it would overrun)

Well, we now have to wait for the last jobs. Then we can change the NAS Version. And we also think about to change back to mysql since the "lost connection" problem is solved.

Coming back to my 3rd question. What are you think about an additional status for, let's say "corrected" for corrected jobs, when for example the upload failed and the upload skript later can successfully copy the files to the storage.

Regards
a.

> Did you perhaps use the heritrix-GUI to stop the jobs showing "harvesting aborted" ?? We had a similar error in our last snapshot with many jobs acting this way. That lead to way too many jobs turning up in the 2nd stage of the harvest - all domains with stop-reason "Harvesting aborted" - no matter they reached the 10Mb limit or not.
>
> I hope (and expect) that bug to have been fixed - I know Colin have been working on exactly that problem. 
>
> The start time should definately update seconds after a harvester instance starts a job - otherwise the scheduler (part of the GUIApplication) is not recieving these messages right.
>
> Stop time is updated after the last row in the history table is inserted for each job. If no stop-time ever occurs that could mean that the message from the harvester to the GUIApplication was lost (e.g. because of lost JMS-connection) - that should also be much more stable in 3.10 (I heard an installation survived a JMS-broker restart without trouble)
>
> one insert every 2 seconds sounds quite slow - are your admin-machine overloaded ?
>
> -
> Bjarne