[Netarchivesuite-users] Some questions
Bjarne Andersen
bja at statsbiblioteket.dk
Wed Nov 25 09:54:50 CET 2009
If all the reports are missing my best guess would be that NetarchiveSuite kills the heritrix process before it's actually finished (e.g. while heritrix is generating reports). I know it kills the heritrix process if it cannot communicate with heritrix over JMX - but as far as I know it tries several times with priodes of time in between - so this should not happen.
Are the 3 jobs in any way different from the others ? - e.g. larger ?
-
Bjarne
> -----Original Message-----
> From: netarchivesuite-users-
> bounces at lists.gforge.statsbiblioteket.dk [mailto:netarchivesuite-
> users-bounces at lists.gforge.statsbiblioteket.dk] On Behalf Of
> aponb at gmx.at
> Sent: Wednesday, November 25, 2009 9:49 AM
> To: netarchivesuite-users at lists.gforge.statsbiblioteket.dk
> Subject: [Netarchivesuite-users] Some questions
>
> >
> >
> <https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchi
> vesuite-users>>/ Hi!
> > />/
> > />/ I have some questions and curious to hear, what you are
> thinking about it:
> > />/
> > />/ - After finishing the first Stage (border 10 MB) of the
> domain crawl, I
> > />/ have 3 Jobs which have the stopreason "Harvesting aborted"
> instead of
> > />/ "Domain completed" for all domains under 10MB of Bytes.
> > />/ In the medadata.arc file for these jobs there is the message
> "CRAWL -
> > />/ Finished", which shouldn't produce the stopreason Harvesting
> aborted. Do
> > />/ you have any idea why this did happen?
> > />/
> > /To be precise it is the string "CRAWL ENDED" in progress-
> statistics.log
> > that determines that a job has ended in a normal fashion. Jobs
> with
> > stop reason Harvesting Aborted are not included in further
> > harvests based on the aborted harvest AFTER NAS VERSION 3.10.
> > (This was a feature request
> >
> https://gforge.statsbiblioteket.dk/tracker/?func=detail&group_id=7&
> aid=1773&atid=108
> <https://gforge.statsbiblioteket.dk/tracker/?func=detail&group_id=7
> &aid=1773&atid=108>
> >
> > although really it should have been classified as a bug-fix.)
> >
> >
> Thanks Colin for making this clear.
> Indeed there is exactly this difference between these 3 jobs and
> the others.
> Here the behavior in the progess-statistics.log of these 3 jobs:
>
> 20090930103144 CRAWL WAITING - Pausing - Waiting for threads to
> finish
> 2009-09-30T10:31:44Z 619522 2168 397118 [...]
> 2009-09-30T10:32:01Z 619522 2167 397118 [...]
> 20090930103201 CRAWL PAUSED - Paused
> 20090930103653 CRAWL RESUMED - Running
> 20090930103653 CRAWL ENDING - Finished
>
> and here as it should be:
>
> 20090929143936 CRAWL WAITING - Pausing - Waiting for threads to
> finish
> 2009-09-29T14:39:37Z 633944 37 407828 [...]
> 20090929143937 CRAWL PAUSED - Paused
> 20090929144729 CRAWL RESUMED - Running
> 20090929144729 CRAWL ENDING - Finished
> 2009-09-29T14:47:30Z 633944 0 407828 [...]
> 20090929144730 CRAWL ENDED - Finished
>
>
> If there is no CRAWL ENDED then it seems that also the following
> section
> will not created in the metadata.arc file.
> crawl-report.txt
> frontier-report.txt
> hosts-report.txt
> mimetype-report.txt
> processors-report.txt
> responsecode-report.txt
> seeds-report.txt
>
> and these files are also not anymore in the oldjobs directory of
> that
> harvester machine.
>
> Due to Colin's fix it is not necessary anymore to change the
> stopreason
> for these domains to have these excluded in a further stage of a
> full
> harvest.
> And it is unkown why the behavior above happens.
>
> Thanks and best regards
> a.
>
>
>
>
> _______________________________________________
> NetarchiveSuite-users mailing list
> NetarchiveSuite-users at lists.gforge.statsbiblioteket.dk
> https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchiv
> esuite-users
More information about the NetarchiveSuite-users
mailing list