[Netarchivesuite-users] production and maintenance questions
Bjarne Andersen
bja at statsbiblioteket.dk
Wed Mar 31 10:15:31 CEST 2010
Sara, thats great news. Sounds you were more seriously hit by the bug. Perhaps run run bigger jobs than we do meaning heritrix has to use more time to generate the reports
Can't wait to have all your good stuff back in the code base
Best
bjarne
Sent from min HTC Touch Pro
----- Oprindelig meddelelse -----
Fra: sara.aubry at bnf.fr <sara.aubry at bnf.fr>
Sendt: 31. marts 2010 08:59
Til: netarchivesuite-users at lists.gforge.statsbiblioteket.dk <netarchivesuite-users at lists.gforge.statsbiblioteket.dk>
Emne: Re: [Netarchivesuite-users] production and maintenance questions
Hi Bjarne,
We do not have precise figures yet (will further investigate), but out of
a sample of 30 out of 957 jobs, more than half of them were missing.
We also noticed you may have a host-report.txt but the file can be empty.
Without running a precise analysis, we found the problem was big enought
to fix it and developped a patch to prevent NAS from shuting down Heritrix
before the report files have been created: after a crawl is finished, we
put each report name and size in a hash table, and check if there is any
difference with the previous check every 20 seconds
and until a configurable maximum time.
This is one (out of 12 features) we have been working on and that we will
commit as soon as we manage to press the Start buttton of our broad crawl.
Sara
Message de : Bjarne Andersen <bja at statsbiblioteket.dk>
30/03/2010 18:03
Envoyé par :
<netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk>
Veuillez répondre à
<netarchivesuite-users at lists.gforge.statsbiblioteket.dk>
Pour
"netarchivesuite-users at lists.gforge.statsbiblioteket.dk"
<netarchivesuite-users at lists.gforge.statsbiblioteket.dk>
Copie
"bert.wendland at bnf.fr" <bert.wendland at bnf.fr>, "PAUL.FIEVRE at bnf.fr"
<PAUL.FIEVRE at bnf.fr>
Objet
Re: [Netarchivesuite-users] production and maintenance questions
Hi Sara.
Do you have figures for the number of missing heritrix reports in your
test crawl ?
I just investigated our last full domain crawl. Out of 1154 jobs 40 of
them were also missing out on reports (except for hosts-report.txt) - a
couple of them had some of the reports but not all 7.
My suspision would be like you that NAS kills heritrix before it is
actually finished generating the reports. This is a serious issue (bug). I
noticed you asked some questions on the heritrix list. Did you get any
futher the problem or a possible solution ?
-
Bjarne
________________________________________
Fra: netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk
[netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk] På
vegne af sara.aubry at bnf.fr [sara.aubry at bnf.fr]
Sendt: 17. februar 2010 17:02
Til: netarchivesuite-users at lists.gforge.statsbiblioteket.dk
Cc: bert.wendland at bnf.fr; PAUL.FIEVRE at bnf.fr
Emne: [Netarchivesuite-users] production and maintenance questions
Dear all,
We are still going trough our large test crawl and facing other problems.
1) We noticed that for many jobs (we dont' have the exact figure yet but
this is still many),
we are missing Heritrix report files, which are important to us because we
are using them for stats.
Our crawl engineer saw that the HarvestController does not leave much time
to Heritrix to compile its
reports after a job finishes and shutdowns Heritrix very rapidly. Is there
a way to strech this time out?
2) One of our HarvestController went down just after a job finished. Once
again, Heritrix did not have
the time to create the reports, so we used the make_reports.pl script
which comes with the
Heritrix package to create them from the crawl.log :
- Is there a way / a script in NS package to create the metadata ARC
file?
- How should we transfer the ARC files? We are using the local ARC
repository
implementation. We looked at the upload.sh script but it doesn't look like
it is going to update the database.
- Can we restart manually our HarvestController?
Thanks for your help!
Sara
Avant d'imprimer, pensez ? l'environnement.
Consider the environment before printing this mail.
_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at lists.gforge.statsbiblioteket.dk
https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchivesuite-users
Avant d'imprimer, pensez à l'environnement.
Consider the environment before printing this mail.
_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at lists.gforge.statsbiblioteket.dk
https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchivesuite-users
More information about the NetarchiveSuite-users
mailing list