[Netarchivesuite-users] production and maintenance questions

sara.aubry at bnf.fr sara.aubry at bnf.fr
Thu Feb 18 16:54:57 CET 2010

Hi Bjarne,

Don't be sad! Exploitation problems are normal, great when you're just 
testing and here to give some work to engineers :-)

We are running our first step with a maximum of 3500 domains per job and a 
budget of 200 URL per domain
which makes 700 000 URL max per job, so they are small jobs! 

We will definitetly check how many reports we are missing and let you 

And we followed the second workaround (create the report with 
make-reports.pl and restart the
HarvestController), it created the metadata ARC file and uploaded the ARC 

Thanks for your help!


Message de : Bjarne Andersen <bja at statsbiblioteket.dk> 
                      18/02/2010 09:18

Envoyé par : 
<netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk>

Veuillez répondre à 
<netarchivesuite-users at lists.gforge.statsbiblioteket.dk>

"netarchivesuite-users at lists.gforge.statsbiblioteket.dk" 
<netarchivesuite-users at lists.gforge.statsbiblioteket.dk>
"bert.wendland at bnf.fr" <bert.wendland at bnf.fr>, "PAUL.FIEVRE at bnf.fr" 
<PAUL.FIEVRE at bnf.fr>
Re: [Netarchivesuite-users] production and maintenance questions

Hi Sara.

I'm quite sad to hear that you face problems although domain crawling is 
potentially troublesome.
I'm not aware that any of our jobs should be missing heritrix reports - 
although we don't formally use all of them so we have never checked !! - 
we should definately look into this. VERY IMPORTANT. I'm not aware of any 
way to set the time waiting for heritrix to finish - I think the time was 
raised to 5 min in the last release (Developer please confirm !) - should 
maybe be a setting in the configurations.

Do you run very large jobs ? (we normally don't exeed 2 million URIs in 1 
job - and don't seem to have this problem - I hope)

There is a deprecated module for doing part of the task creating a 

But I'm not sure it useable out of the box. I think it was developed in 
earlier days when we had trouble with missing CDX-files in the 

I think the workaround for (2) would be to do as you suggested 
(make-reports.pl) and put the job back into a harvester-directory and 
restart this. That should actually create the metadata-file again. So if 
you can delete the "empty" metadata-file in your archive (or maybe rename 
it or back it up) the new one (including all the reports) should be 
generated and uploaded

Fra: netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk 
[netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk] På 
vegne af sara.aubry at bnf.fr [sara.aubry at bnf.fr]
Sendt: 17. februar 2010 17:02
Til: netarchivesuite-users at lists.gforge.statsbiblioteket.dk
Cc: bert.wendland at bnf.fr; PAUL.FIEVRE at bnf.fr
Emne: [Netarchivesuite-users] production and maintenance questions

Dear all,

We are still going trough our large test crawl and facing other problems.

1) We noticed that for many jobs (we dont' have the exact figure yet but
this is still many),
we are missing Heritrix report files, which are important to us because we
are using them for stats.
Our crawl engineer saw that the HarvestController does not leave much time
to Heritrix to compile its
reports after a job finishes and shutdowns Heritrix very rapidly. Is there
a way to strech this time out?

2) One of our HarvestController went down just after a job finished. Once
again, Heritrix did not have
the time to create the reports, so we used the make_reports.pl script
which comes with the
Heritrix package to create them from the crawl.log :
-  Is there a way / a script in NS package to create the metadata ARC
- How should we transfer the ARC files? We are using the local ARC
implementation. We looked at the upload.sh script but it doesn't look like
it is going to update the database.
- Can we restart manually our HarvestController?

Thanks for your help!


Avant d'imprimer, pensez ? l'environnement.
Consider the environment before printing this mail.

NetarchiveSuite-users mailing list
NetarchiveSuite-users at lists.gforge.statsbiblioteket.dk

Avant d'imprimer, pensez à l'environnement. 
Consider the environment before printing this mail.   

More information about the NetarchiveSuite-users mailing list