[Netarchivesuite-users] Some questions

aponb at gmx.at aponb at gmx.at
Wed Nov 25 10:54:57 CET 2009


They are looking very similar to all the other jobs:

-Status Done
-60-70 arc files with a size of 100MB
-metadata arc Files around 300MB
-The only difference is hat these report sections are missing.
-2 of these 3 jobs was paused before Ending due to removing unnecessary 
urls.
-The other has also the last message CRAWL ENDING without pausing before.

Regards
a.

> If all the reports are missing my best guess would be that NetarchiveSuite kills the heritrix process before it's actually finished (e.g. while heritrix is generating reports). I know it kills the heritrix process if it cannot communicate with heritrix over JMX - but as far as I know it tries several times with priodes of time in between - so this should not happen.
>
> Are the 3 jobs in any way different from the others ? - e.g. larger ?
>
> -
> Bjarne
>
> >/ -----Original Message-----
> />/ From: netarchivesuite-users-
> />/ bounces at lists.gforge.statsbiblioteket.dk <https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchivesuite-users> [mailto:netarchivesuite-
> />/ users-bounces at lists.gforge.statsbiblioteket.dk <https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchivesuite-users>] On Behalf Of
> />/ aponb at gmx.at <https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchivesuite-users>
> />/ Sent: Wednesday, November 25, 2009 9:49 AM
> />/ To: netarchivesuite-users at lists.gforge.statsbiblioteket.dk <https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchivesuite-users>
> />/ Subject: [Netarchivesuite-users] Some questions
> />/ 
> />/ >
> />/ >
> />/ <https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchi
> />/ vesuite-users>>/ Hi!
> />/ > />/
> />/ > />/ I have some questions and curious to hear, what you are
> />/ thinking about it:
> />/ > />/
> />/ > />/ - After finishing the first Stage (border 10 MB) of the
> />/ domain crawl, I
> />/ > />/ have 3 Jobs which have the stopreason  "Harvesting aborted"
> />/ instead of
> />/ > />/ "Domain completed" for all domains under 10MB of Bytes.
> />/ > />/ In the medadata.arc file for these jobs there is the message
> />/ "CRAWL -
> />/ > />/ Finished", which shouldn't produce the stopreason Harvesting
> />/ aborted. Do
> />/ > />/ you have any idea why this did happen?
> />/ > />/
> />/ > /To be precise it is the string "CRAWL ENDED" in progress-
> />/ statistics.log
> />/ > that determines that a job has ended in a normal fashion. Jobs
> />/ with
> />/ > stop reason Harvesting Aborted are not included in further
> />/ > harvests based on the aborted harvest AFTER NAS VERSION 3.10.
> />/ > (This was a feature request
> />/ >
> />/ https://gforge.statsbiblioteket.dk/tracker/?func=detail&group_id=7& <https://gforge.statsbiblioteket.dk/tracker/?func=detail&group_id=7&>
> />/ aid=1773&atid=108
> />/ <https://gforge.statsbiblioteket.dk/tracker/?func=detail&group_id=7 <https://gforge.statsbiblioteket.dk/tracker/?func=detail&group_id=7>
> />/ &aid=1773&atid=108>
> />/ >
> />/ > although really it should have been classified as a bug-fix.)
> />/ >
> />/ >
> />/ Thanks Colin for making this clear.
> />/ Indeed there is exactly this difference between these 3 jobs and
> />/ the others.
> />/ Here the behavior in the progess-statistics.log of these 3 jobs:
> />/ 
> />/ 20090930103144 CRAWL WAITING - Pausing - Waiting for threads to
> />/ finish
> />/ 2009-09-30T10:31:44Z      619522        2168       397118 [...]
> />/ 2009-09-30T10:32:01Z      619522        2167       397118 [...]
> />/ 20090930103201 CRAWL PAUSED - Paused
> />/ 20090930103653 CRAWL RESUMED - Running
> />/ 20090930103653 CRAWL ENDING - Finished
> />/ 
> />/ and here as it should be:
> />/ 
> />/ 20090929143936 CRAWL WAITING - Pausing - Waiting for threads to
> />/ finish
> />/ 2009-09-29T14:39:37Z      633944          37       407828 [...]
> />/ 20090929143937 CRAWL PAUSED - Paused
> />/ 20090929144729 CRAWL RESUMED - Running
> />/ 20090929144729 CRAWL ENDING - Finished
> />/ 2009-09-29T14:47:30Z      633944           0       407828 [...]
> />/ 20090929144730 CRAWL ENDED - Finished
> />/ 
> />/ 
> />/ If there is no CRAWL ENDED then it seems that also the following
> />/ section
> />/ will not created in the metadata.arc file.
> />/ crawl-report.txt
> />/ frontier-report.txt
> />/ hosts-report.txt
> />/ mimetype-report.txt
> />/ processors-report.txt
> />/ responsecode-report.txt
> />/ seeds-report.txt
> />/ 
> />/ and these files are also not anymore in the oldjobs directory of
> />/ that
> />/ harvester machine.
> />/ 
> />/ Due to Colin's fix it is not necessary anymore to change the
> />/ stopreason
> />/ for these domains to have these excluded in a further stage of a
> />/ full
> />/ harvest.
> />/ And it is unkown why the behavior above happens.
> />/ 
> />/ Thanks and best regards
> />/ a.
> />/ 
> />/ 
> />/ 
> />/ 
> />/ _______________________________________________
> />/ NetarchiveSuite-users mailing list
> />/ NetarchiveSuite-users at lists.gforge.statsbiblioteket.dk <https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchivesuite-users>
> />/ https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchiv
> />/ esuite-users
> /





More information about the NetarchiveSuite-users mailing list