[Netarchivesuite-users] Some questions

Bjarne Andersen bja at statsbiblioteket.dk
Tue Nov 24 15:17:47 CET 2009


Did you perhaps use the heritrix-GUI to stop the jobs showing "harvesting aborted" ?? We had a similar error in our last snapshot with many jobs acting this way. That lead to way too many jobs turning up in the 2nd stage of the harvest - all domains with stop-reason "Harvesting aborted" - no matter they reached the 10Mb limit or not.

I hope (and expect) that bug to have been fixed - I know Colin have been working on exactly that problem. 

The start time should definately update seconds after a harvester instance starts a job - otherwise the scheduler (part of the GUIApplication) is not recieving these messages right.

Stop time is updated after the last row in the history table is inserted for each job. If no stop-time ever occurs that could mean that the message from the harvester to the GUIApplication was lost (e.g. because of lost JMS-connection) - that should also be much more stable in 3.10 (I heard an installation survived a JMS-broker restart without trouble)

one insert every 2 seconds sounds quite slow - are your admin-machine overloaded ?

-
Bjarne

> -----Original Message-----
> From: netarchivesuite-users-
> bounces at lists.gforge.statsbiblioteket.dk [mailto:netarchivesuite-
> users-bounces at lists.gforge.statsbiblioteket.dk] On Behalf Of
> aponb at gmx.at
> Sent: Tuesday, November 24, 2009 2:37 PM
> To: NetarchiveSuite-users
> Subject: [Netarchivesuite-users] Some questions
> 
> Hi!
> 
> I have some questions and curious to hear, what you are thinking
> about it:
> 
> - After finishing the first Stage (border 10 MB) of the domain
> crawl, I
> have 3 Jobs which have the stopreason  "Harvesting aborted" instead
> of
> "Domain completed" for all domains under 10MB of Bytes.
> In the medadata.arc file for these jobs there is the message "CRAWL
> -
> Finished", which shouldn't produce the stopreason Harvesting
> aborted. Do
> you have any idea why this did happen?
> I am quite sure, that no Harvesting was aborted. Even if I reply
> these
> pages, they are complete and work fine. I have no problem to leave
> this
> (actual wrong) stop reason, when this has no impact on the next
> stage in
> the domain crawl. Am I right when I say, that all Domains with
> Harvesting aborted and Domain completed will not be included in the
> next
> stage?
> 
> - As already said we finished the first stage and all files are
> already
> uploaded to the storage. Still there are over 60 Jobs (each with
> 4000
> Domains) without any state. That means that the process in
> calculating
> how many bytes harvested any so on is still running. If I query the
> historyinfo table for that harvesterdefinition I can see that the
> count
> for this table is growing very slowly. Every two seconds I get one
> insert. And there are still over 200000 inserts to expect. That
> will
> take a long time. Another strange thing is that for these jobs the
> submit time was i. e.  23/09/2009, the start time 22/11/2009
> (although
> the crawl itself is finished many weeks ago), and the end time is
> still
> pending.
> So have you expierenced this problem before? This happens with NAS
> 3.8.2
> and derby, running as network service.
> 
> - As you know we had some upload problems to our storage and with
> the
> upload skript and the procedures how to handle this we could upload
> all
> our files (due to the delayed db operations there are still
> information
> about upload failures are coming in), but how do you handle this in
> the
> NAS. The status of a job, which has an upload failure still has the
> status failed. What are you doing in that case. Are you leaving the
> state of that job, or do you correct the state in the database.
> What do
> you think about inserting a new state for a job, like "corrected"
> or
> something like this?
> 
> 
> Thanks again for reading and I am looking forward to hearing from
> you!
> a.
> 
> 
> _______________________________________________
> NetarchiveSuite-users mailing list
> NetarchiveSuite-users at lists.gforge.statsbiblioteket.dk
> https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchiv
> esuite-users




More information about the NetarchiveSuite-users mailing list