[Netarchivesuite-users] Some questions

Søren Vejrup Carlsen svc at kb.dk
Tue Nov 24 14:59:30 CET 2009


Hi Andreas.
"Harvesting aborted" could indicate that the harvesting process (not necessarily Heritrix,but maybe the communication with Heritrix) was aborted somehow. There could be something in the harvester-logs, that gives you the cause of that problem.
  
Note that in NetarchiveSuite 3.10 the communication with Heritrix is much more stable.

>Still there are over 60 Jobs without any state.
Is the job state really undefined in the database? Could it be some weird problem with having Derby as a server?

regards
Søren

-----Oprindelig meddelelse-----
Fra: netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk [mailto:netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk] På vegne af aponb at gmx.at
Sendt: 24. november 2009 14:37
Til: NetarchiveSuite-users
Emne: [Netarchivesuite-users] Some questions

Hi!

I have some questions and curious to hear, what you are thinking about it:

- After finishing the first Stage (border 10 MB) of the domain crawl, I
have 3 Jobs which have the stopreason  "Harvesting aborted" instead of
"Domain completed" for all domains under 10MB of Bytes.
In the medadata.arc file for these jobs there is the message "CRAWL -
Finished", which shouldn't produce the stopreason Harvesting aborted. Do
you have any idea why this did happen?
I am quite sure, that no Harvesting was aborted. Even if I reply  these
pages, they are complete and work fine. I have no problem to leave this
(actual wrong) stop reason, when this has no impact on the next stage in
the domain crawl. Am I right when I say, that all Domains with
Harvesting aborted and Domain completed will not be included in the next
stage?

- As already said we finished the first stage and all files are already
uploaded to the storage. Still there are over 60 Jobs (each with 4000
Domains) without any state. That means that the process in calculating
how many bytes harvested any so on is still running. If I query the
historyinfo table for that harvesterdefinition I can see that the count
for this table is growing very slowly. Every two seconds I get one
insert. And there are still over 200000 inserts to expect. That will
take a long time. Another strange thing is that for these jobs the
submit time was i. e.  23/09/2009, the start time 22/11/2009 (although
the crawl itself is finished many weeks ago), and the end time is still
pending.
So have you expierenced this problem before? This happens with NAS 3.8.2
and derby, running as network service.

- As you know we had some upload problems to our storage and with the
upload skript and the procedures how to handle this we could upload all
our files (due to the delayed db operations there are still information
about upload failures are coming in), but how do you handle this in the
NAS. The status of a job, which has an upload failure still has the
status failed. What are you doing in that case. Are you leaving the
state of that job, or do you correct the state in the database. What do
you think about inserting a new state for a job, like "corrected" or
something like this?


Thanks again for reading and I am looking forward to hearing from you!
a.


_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at lists.gforge.statsbiblioteket.dk
https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchivesuite-users




More information about the NetarchiveSuite-users mailing list