[Netarchivesuite-devel] RE SV: Analysing first stage

sara.aubry at bnf.fr sara.aubry at bnf.fr
Mon May 17 10:07:12 CEST 2010


Hi Bjarne,

There were no such visible error on the harvster servers.
All these domains have not been taken into account (not taken for a job, 
although their name look like they could have been taken as one job) and 
we are not missing IDs nor status in the job list.

Sara








Message de : Bjarne Andersen <bja at statsbiblioteket.dk> 
                      13/05/2010 14:30


Pour
"sara.aubry at bnf.fr" <sara.aubry at bnf.fr>, 
"netarchivesuite-devel at lists.gforge.statsbiblioteket.dk" 
<netarchivesuite-devel at lists.gforge.statsbiblioteket.dk>
Copie

Objet
SV: Analysing first stage



My best guess would be an error on one of the harvster-servers. When 
finishing a job NAS creates the metadata-file and creates statistics based 
on the crawl.log and sends a return message to NAS. This return-message 
must have failed for some reason I don't quite know.

Can you see if all the domains were harvested in the same job ? (you can 
select from the seeds-column of the job-table to find jobs with those 
domains). Is all jobs in the entire harvest reported in NAS - either DONE 
or FAILED or is any jobs missing a status ?

best
Bjarne
________________________________________
Fra: sara.aubry at bnf.fr [sara.aubry at bnf.fr]
Sendt: 12. maj 2010 15:47
Til: netarchivesuite-devel at lists.gforge.statsbiblioteket.dk; Bjarne 
Andersen
Emne: Analysing first stage

Hello everyone,

Our first stage finally completed on Monday 10 and lasted longer than
expected (we started on April 14).
We collected 1000 URL on 1 673 094 domains, using 4 063  031 seeds.

Our main figures:
- Number of jobs: 497 (among which 17 failed and were resubmitted)
- Number of crawled URL: 294 387 919
- Number of crawled hosts: 3 234 464
- Number of ARC files: 107 173
- Size of compressed ARC files: 10,44 Tb

75 424 domains reached the max object limit and will be part of the second
stage we will launch on May 17.

While QAing our first stage and crossing figures, we noticed that 3 496
domains (all starting with an s) were
missing from the historyinfo table and were not harvested at all.

Looking quickly at the missing domain list, we first thought that NAS
didn't like porn sites :-) (many domain names with sex, sexy...) but the
list is larger than that.
There are no errors in the GUIApplication log file, nor in the DB log
file.

This is no big deal, only 0.2% of the entire seed list, and we will just
run a patch crawl.
But have you ever had this problem?

Best,

Sara




Avant d'imprimer, pensez ? l'environnement.





Avant d'imprimer, pensez à l'environnement.   



More information about the Netarchivesuite-devel mailing list