[Netarchivesuite-users] production and maintenance questions

sara.aubry at bnf.fr sara.aubry at bnf.fr
Wed Mar 31 08:59:07 CEST 2010


Hi Bjarne,

We do not have precise figures yet (will further investigate), but out of 
a sample of 30 out of 957 jobs, more than half of them were missing.
We also noticed you may have a host-report.txt but the file can be empty.

Without running a precise analysis, we found the problem was big enought 
to fix it and developped a patch to prevent NAS from shuting down Heritrix
before the report files have been created: after a crawl is finished, we 
put each report name and size in a hash table, and check if there is any 
difference with the previous check every 20 seconds
and until a configurable maximum time.

This is one (out of 12 features) we have been working on and that we will 
commit as soon as we manage to press the Start buttton of our broad crawl.

Sara









Message de : Bjarne Andersen <bja at statsbiblioteket.dk> 
                      30/03/2010 18:03

Envoyé par : 
<netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk>

Veuillez répondre à 
<netarchivesuite-users at lists.gforge.statsbiblioteket.dk>



Pour
"netarchivesuite-users at lists.gforge.statsbiblioteket.dk" 
<netarchivesuite-users at lists.gforge.statsbiblioteket.dk>
Copie
"bert.wendland at bnf.fr" <bert.wendland at bnf.fr>, "PAUL.FIEVRE at bnf.fr" 
<PAUL.FIEVRE at bnf.fr>
Objet
Re: [Netarchivesuite-users] production and maintenance questions



Hi Sara.

Do you have figures for the number of missing heritrix reports in your 
test crawl ?

I just investigated our last full domain crawl. Out of 1154 jobs 40 of 
them were also missing out on reports (except for hosts-report.txt) - a 
couple of them had some of the reports but not all 7.

My suspision would be like you that NAS kills heritrix before it is 
actually finished generating the reports. This is a serious issue (bug). I 
noticed you asked some questions on the heritrix list. Did you get any 
futher the problem or a possible solution ?

-
Bjarne
________________________________________
Fra: netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk 
[netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk] P&#229; 
vegne af sara.aubry at bnf.fr [sara.aubry at bnf.fr]
Sendt: 17. februar 2010 17:02
Til: netarchivesuite-users at lists.gforge.statsbiblioteket.dk
Cc: bert.wendland at bnf.fr; PAUL.FIEVRE at bnf.fr
Emne: [Netarchivesuite-users] production and maintenance questions

Dear all,

We are still going trough our large test crawl and facing other problems.

1) We noticed that for many jobs (we dont' have the exact figure yet but
this is still many),
we are missing Heritrix report files, which are important to us because we
are using them for stats.
Our crawl engineer saw that the HarvestController does not leave much time
to Heritrix to compile its
reports after a job finishes and shutdowns Heritrix very rapidly. Is there
a way to strech this time out?

2) One of our HarvestController went down just after a job finished. Once
again, Heritrix did not have
the time to create the reports, so we used the make_reports.pl script
which comes with the
Heritrix package to create them from the crawl.log :
-  Is there a way / a script in NS package to create the metadata ARC
file?
- How should we transfer the ARC files? We are using the local ARC
repository
implementation. We looked at the upload.sh script but it doesn't look like
it is going to update the database.
- Can we restart manually our HarvestController?

Thanks for your help!

Sara





Avant d'imprimer, pensez ? l'environnement.
Consider the environment before printing this mail.

_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at lists.gforge.statsbiblioteket.dk
https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchivesuite-users






Avant d'imprimer, pensez à l'environnement. 
Consider the environment before printing this mail.   


More information about the NetarchiveSuite-users mailing list