[Netarchivesuite-users] production and maintenance questions

Bjarne Andersen bja at statsbiblioteket.dk
Thu Feb 18 09:18:11 CET 2010

Hi Sara.

I'm quite sad to hear that you face problems although domain crawling is potentially troublesome.
I'm not aware that any of our jobs should be missing heritrix reports - although we don't formally use all of them so we have never checked !! - we should definately look into this. VERY IMPORTANT. I'm not aware of any way to set the time waiting for heritrix to finish - I think the time was raised to 5 min in the last release (Developer please confirm !) - should maybe be a setting in the configurations.

Do you run very large jobs ? (we normally don't exeed 2 million URIs in 1 job - and don't seem to have this problem - I hope)

There is a deprecated module for doing part of the task creating a metadata-file:

But I'm not sure it useable out of the box. I think it was developed in earlier days when we had trouble with missing CDX-files in the metadata-files.

I think the workaround for (2) would be to do as you suggested (make-reports.pl) and put the job back into a harvester-directory and restart this. That should actually create the metadata-file again. So if you can delete the "empty" metadata-file in your archive (or maybe rename it or back it up) the new one (including all the reports) should be generated and uploaded

Fra: netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk [netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk] På vegne af sara.aubry at bnf.fr [sara.aubry at bnf.fr]
Sendt: 17. februar 2010 17:02
Til: netarchivesuite-users at lists.gforge.statsbiblioteket.dk
Cc: bert.wendland at bnf.fr; PAUL.FIEVRE at bnf.fr
Emne: [Netarchivesuite-users] production and maintenance questions

Dear all,

We are still going trough our large test crawl and facing other problems.

1) We noticed that for many jobs (we dont' have the exact figure yet but
this is still many),
we are missing Heritrix report files, which are important to us because we
are using them for stats.
Our crawl engineer saw that the HarvestController does not leave much time
to Heritrix to compile its
reports after a job finishes and shutdowns Heritrix very rapidly. Is there
a way to strech this time out?

2) One of our HarvestController went down just after a job finished. Once
again, Heritrix did not have
the time to create the reports, so we used the make_reports.pl script
which comes with the
Heritrix package to create them from the crawl.log :
-  Is there a way / a script in NS package to create the metadata ARC
- How should we transfer the ARC files? We are using the local ARC
implementation. We looked at the upload.sh script but it doesn't look like
it is going to update the database.
- Can we restart manually our HarvestController?

Thanks for your help!


Avant d'imprimer, pensez ? l'environnement.
Consider the environment before printing this mail.

More information about the NetarchiveSuite-users mailing list