[Netarchivesuite-devel] Analysing first stage

sara.aubry at bnf.fr sara.aubry at bnf.fr
Wed May 12 16:00:56 CEST 2010


Hi Soren,

No problem in creating the domains. These 3 496 domains are in domain 
table and they contain the same information as the others.

Sara







Message de : Søren Vejrup Carlsen <svc at kb.dk> 
                      12/05/2010 15:54

Envoyé par : 
<netarchivesuite-devel-bounces at lists.gforge.statsbiblioteket.dk>

Veuillez répondre à 
<netarchivesuite-devel at lists.gforge.statsbiblioteket.dk>



Pour
"netarchivesuite-devel at lists.gforge.statsbiblioteket.dk" 
<netarchivesuite-devel at lists.gforge.statsbiblioteket.dk>
Copie

Objet
Re: [Netarchivesuite-devel] Analysing first stage



Hi Sara.
No, we haven't seen this problem here.
Did you have any problem creating the domains in NetarchiveSuite. It could 
be that they weren't created in the first place?

Best regards
Søren

-----Oprindelig meddelelse-----
Fra: netarchivesuite-devel-bounces at lists.gforge.statsbiblioteket.dk 
[mailto:netarchivesuite-devel-bounces at lists.gforge.statsbiblioteket.dk] På 
vegne af sara.aubry at bnf.fr
Sendt: 12. maj 2010 15:48
Til: netarchivesuite-devel at lists.gforge.statsbiblioteket.dk; 
bja at statsbiblioteket.dk
Emne: [Netarchivesuite-devel] Analysing first stage

Hello everyone,

Our first stage finally completed on Monday 10 and lasted longer than 
expected (we started on April 14).
We collected 1000 URL on 1 673 094 domains, using 4 063  031 seeds.

Our main figures:
- Number of jobs: 497 (among which 17 failed and were resubmitted)
- Number of crawled URL: 294 387 919
- Number of crawled hosts: 3 234 464
- Number of ARC files: 107 173
- Size of compressed ARC files: 10,44 Tb

75 424 domains reached the max object limit and will be part of the second 
stage we will launch on May 17.

While QAing our first stage and crossing figures, we noticed that 3 496 
domains (all starting with an s) were missing from the historyinfo table 
and were not harvested at all.

Looking quickly at the missing domain list, we first thought that NAS 
didn't like porn sites :-) (many domain names with sex, sexy...) but the 
list is larger than that.
There are no errors in the GUIApplication log file, nor in the DB log 
file.

This is no big deal, only 0.2% of the entire seed list, and we will just 
run a patch crawl.
But have you ever had this problem?

Best,

Sara




Avant d'imprimer, pensez ? l'environnement. 

_______________________________________________
Netarchivesuite-devel mailing list
Netarchivesuite-devel at lists.gforge.statsbiblioteket.dk
https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchivesuite-devel







Avant d'imprimer, pensez à l'environnement.   



More information about the Netarchivesuite-devel mailing list