[Netarchivesuite-users] Some questions

Wed Nov 25 09:48:42 CET 2009

>
>  <https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchivesuite-users>>/ Hi!
> />/
> />/ I have some questions and curious to hear, what you are thinking about it:
> />/
> />/ - After finishing the first Stage (border 10 MB) of the domain crawl, I
> />/ have 3 Jobs which have the stopreason  "Harvesting aborted" instead of
> />/ "Domain completed" for all domains under 10MB of Bytes.
> />/ In the medadata.arc file for these jobs there is the message "CRAWL -
> />/ Finished", which shouldn't produce the stopreason Harvesting aborted. Do
> />/ you have any idea why this did happen?
> />/   
> /To be precise it is the string "CRAWL ENDED" in progress-statistics.log
> that determines that a job has ended in a normal fashion. Jobs with
> stop reason Harvesting Aborted are not included in further
> harvests based on the aborted harvest AFTER NAS VERSION 3.10.
> (This was a feature request
> https://gforge.statsbiblioteket.dk/tracker/?func=detail&group_id=7&aid=1773&atid=108 <https://gforge.statsbiblioteket.dk/tracker/?func=detail&group_id=7&aid=1773&atid=108> 
>
> although really it should have been classified as a bug-fix.)
>
>   
Thanks Colin for making this clear.
Indeed there is exactly this difference between these 3 jobs and the others.
Here the behavior in the progess-statistics.log of these 3 jobs:

20090930103144 CRAWL WAITING - Pausing - Waiting for threads to finish
2009-09-30T10:31:44Z      619522        2168       397118 [...]
2009-09-30T10:32:01Z      619522        2167       397118 [...]
20090930103201 CRAWL PAUSED - Paused
20090930103653 CRAWL RESUMED - Running
20090930103653 CRAWL ENDING - Finished

and here as it should be:

20090929143936 CRAWL WAITING - Pausing - Waiting for threads to finish
2009-09-29T14:39:37Z      633944          37       407828 [...]
20090929143937 CRAWL PAUSED - Paused
20090929144729 CRAWL RESUMED - Running
20090929144729 CRAWL ENDING - Finished
2009-09-29T14:47:30Z      633944           0       407828 [...]
20090929144730 CRAWL ENDED - Finished

If there is no CRAWL ENDED then it seems that also the following section
will not created in the metadata.arc file.
crawl-report.txt
frontier-report.txt
hosts-report.txt
mimetype-report.txt
processors-report.txt
responsecode-report.txt
seeds-report.txt

and these files are also not anymore in the oldjobs directory of that
harvester machine.

Due to Colin's fix it is not necessary anymore to change the stopreason
for these domains to have these excluded in a further stage of a full
harvest.
And it is unkown why the behavior above happens.

Thanks and best regards
a.