[Netarchivesuite-users] production and maintenance questions

Bjarne Andersen bja at statsbiblioteket.dk
Tue Mar 30 18:03:15 CEST 2010


Hi Sara.

Do you have figures for the number of missing heritrix reports in your test crawl ?

I just investigated our last full domain crawl. Out of 1154 jobs 40 of them were also missing out on reports (except for hosts-report.txt) - a couple of them had some of the reports but not all 7.

My suspision would be like you that NAS kills heritrix before it is actually finished generating the reports. This is a serious issue (bug). I noticed you asked some questions on the heritrix list. Did you get any futher the problem or a possible solution ?

-
Bjarne
________________________________________
Fra: netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk [netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk] På vegne af sara.aubry at bnf.fr [sara.aubry at bnf.fr]
Sendt: 17. februar 2010 17:02
Til: netarchivesuite-users at lists.gforge.statsbiblioteket.dk
Cc: bert.wendland at bnf.fr; PAUL.FIEVRE at bnf.fr
Emne: [Netarchivesuite-users] production and maintenance questions

Dear all,

We are still going trough our large test crawl and facing other problems.

1) We noticed that for many jobs (we dont' have the exact figure yet but
this is still many),
we are missing Heritrix report files, which are important to us because we
are using them for stats.
Our crawl engineer saw that the HarvestController does not leave much time
to Heritrix to compile its
reports after a job finishes and shutdowns Heritrix very rapidly. Is there
a way to strech this time out?

2) One of our HarvestController went down just after a job finished. Once
again, Heritrix did not have
the time to create the reports, so we used the make_reports.pl script
which comes with the
Heritrix package to create them from the crawl.log :
-  Is there a way / a script in NS package to create the metadata ARC
file?
- How should we transfer the ARC files? We are using the local ARC
repository
implementation. We looked at the upload.sh script but it doesn't look like
it is going to update the database.
- Can we restart manually our HarvestController?

Thanks for your help!

Sara





Avant d'imprimer, pensez ? l'environnement.
Consider the environment before printing this mail.




More information about the NetarchiveSuite-users mailing list