[Netarchivesuite-users] Troubleshooting second stage and deduplication

Bjarne Andersen bja at statsbiblioteket.dk
Fri Feb 26 08:29:16 CET 2010

Hi Sara.

I will try and answer your questions.

1) I'm not sure if the "Harvest aborted" will be included - one of the developers have to join me in this one. 3.10.0 had a problem with jobs stopped manually through the heritrix-GUI led to all domains getting "Harvesting aborted" - thus way too many domain got in our second step - that suggests that "Harvest aborted" does get included - but I'm not sure whether the fix did something to that logic or not

2) For a 20Tb harvest the index is around 80Gb in our last Domain-crawl. So that suggests that for your 4Tbytes you will end up with a 13Gb index. The indexserver will need more space (up to 2 x the index size) since it's rearranging  the files within the index once in a while for optimization. So if your 70GB partition is already 90% full I think you have a problem. The IndexServer will generate the index centrally and cache it (in compressed form). Every harvester that needs this index will get the compressed one from the IndexServer and unpack it on each server - only Once per physical server, so if you run multiple instances of HarvestController (heritrix) on one server it will only get the index one and share it locally.

3) Are you using the Derby database ? - by "loosing connection" do you mean the examples following in your Question-3 ?. Jobs never reported "Done" or "Failed" in the database can have several reasons. If the harvestController crashes for some reason during the upload and finish phase the harvestCrontroller might never get to send the "Job Finished" message to the GUIApplication (and database).
 - For jobs that are still in status "started" (but not actually running) you can do the "copy job-folder back to a harvester-instance and restart the HarvestController"-trick. Before doing that you should check if the metadata-file for that jobs is created right and uploaded because the restart will create yet another metadata-file so if you already have a correct metadata-file you should just delete the new one. You need the crawl.log in the ".../logs/" directory of the job because this is what HarvestControllers uses to generate statistics per domain that goes back into the database" so if crawl.log is no longer in the job-directory you need to extract it from the metadata-file in your archive. 
- there is sometimes a JMX-problem with communicating with heritrix leading to NS HarvestCrotroller trying to kill heritrix (often before its completely finished with a job) - I think this could be the majority of your failed jobs ? (should have a fail-reason in the database
- I see at least one of your examples stating (in the StackTrace) that e.g. "jobs/current/low/1390_1267026208952/arcs is not a directory". This means that heritrix never got to actually crawl - such a job should be restarted (maybe after an examination about why heritrix never started for real - look in the heritrix log files)
- You also have an example of: "dk.netarkivet.common.exceptions.IOFailure: Timeout waiting for reply of request for Index". This is a harvestController that would not wait for the IndexServer (and the Index) any longer. This period i a configurable one. For our LowPriority crawlers we user 7 days (one week) and for HighPriority we use the default 24 hours - I think the setting might be in seconds - so you need to calculate yourself. For the Domain crawl I think the indexing currently takes 10-12 days in total - so actually we propably should raise the IndexTimeout value our selves. Otherwise jobs will fail with the timeout - take new jobs and wait another 7 days. Jobs failing with the timeout should be restarted as well

Fra: netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk [netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk] På vegne af sara.aubry at bnf.fr [sara.aubry at bnf.fr]
Sendt: 24. februar 2010 18:38
Til: netarchivesuite-users at lists.gforge.statsbiblioteket.dk
Cc: bert.wendland at bnf.fr; PAUL.FIEVRE at bnf.fr
Emne: [Netarchivesuite-users] Troubleshooting second stage and  deduplication

Hello everyone,

We just launched the second stage of our broad crawl (still testing).

At 16:30, the IndexServer started to build the deduplication index.
It is based on:
1349 jobs,
3,7 Tb of data (ARC files),
87 Gb of metadata (metadata ARC files).

1,4 mio domains were completed in the first stage,
176 000 domains reached the max object limit,
74 000 domains have a "harvest aborted" as stop reason.

1)  Will all my 176 000 + 74 000 domains be in the second stage?
Just to be sure, I think the answer is yes.

2) How much disk space do we need to store the working cache files and the
target deduplication index?
For now, we have a 70Gb partition which is 90% full...
Could you re-explain us the process of creation of this index: from which
jobs and for what jobs it is created,
is it one or several different indices, where it is stored (centrally or
locally on the crawlers),...
Do you have stats on the size of your index?

3) We are running through many different errors:
- NS lost connexion to the database system,
- some jobs started and are still running,

- lots of jobs are failing with the following reason:

Tr : Netarkivet error: Trouble during postprocessing of files in
Errors accumulated during the postprocessing: IOFailure occurred, while
trying to upload files
Trouble during postprocessing of files in
Errors accumulated during the postprocessing: IOFailure occurred, while
trying to upload files

dk.netarkivet.common.exceptions.IOFailure: IOFailure occurred, while
trying to upload files
Caused by: dk.netarkivet.common.exceptions.IOFailure:
jobs/current/low/1390_1267026208952/arcs is not a directory
                 ... 4 more
 - other jobs are failing with another reason:

Tr : Netarkivet error: Fatal error while operating job 'Job 1390 (state =
SUBMITTED, HD = 4, priority = LOWPRIORITY, forcemaxcount = 10000,
forcemaxbytes = -1, orderxml = default, numconfigs = 98)'
Fatal error while operating job 'Job 1390 (state = SUBMITTED, HD = 4,
priority = LOWPRIORITY, forcemaxcount = 10000, forcemaxbytes = -1,
orderxml = default, numconfigs = 98)'
dk.netarkivet.common.exceptions.IOFailure: Timeout waiting for reply of
index request for jobs

Any help would be great!


Avant d'imprimer, pensez ? l'environnement.
Consider the environment before printing this mail.

More information about the NetarchiveSuite-users mailing list