[Netarchivesuite-users] Inconsistency in the count of Bytes and Documents Harvested

sara.aubry at bnf.fr sara.aubry at bnf.fr
Wed Feb 16 14:31:38 CET 2011


Ok Bjarne, I just created bug # 2114 (with priority 3) to keep track of 
it.
We'll look at it later. The most important for us now is to know what the 
figures actually mean.
Best,
Sara








Message de : Bjarne Andersen <bja at statsbiblioteket.dk> 
                      16/02/2011 12:57

Envoyé par : 
<netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk>

Veuillez répondre à 
<netarchivesuite-users at lists.gforge.statsbiblioteket.dk>



Pour
"netarchivesuite-users at lists.gforge.statsbiblioteket.dk" 
<netarchivesuite-users at lists.gforge.statsbiblioteket.dk>
Copie

Objet
Re: [Netarchivesuite-users] Inconsistency in the count of Bytes and 
Documents Harvested



In my opinion this is a bug. It should be either with or without 
duplicates.

OR - maybe a better solution would be to report figures for both numbers 
and bytes - twice including and excluding duplicates - that would give 4 
figures:
 - number of bytes downloaded from the web-servers
 - number of objects downloaded from the web-servers
 - number of bytes stored in the archive
 - number of objects stored in the archive

That would allow the interface also to present these two figures 
(calculated from the 4 above)
 - number of bytes discarded through deduplication
 - number of objects discarded through deduplication

If one wanted to do some calculation on the two latter (deduplication) 
maybe they should be stored in the history-database as well

best
Bjarne

________________________________________
Fra: netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk 
[netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk] På 
vegne af sara.aubry at bnf.fr [sara.aubry at bnf.fr]
Sendt: 16. februar 2011 11:25
Til: netarchivesuite-users at lists.gforge.statsbiblioteket.dk
Emne: [Netarchivesuite-users] Inconsistency in the count of Bytes and 
Documents Harvested

Hello everyone,

While running tests on the last features Nicolas has developped,  we have
noticed that the counts of
Bytes Harvested and Documents Harvested from the crawl.log that appear on
the page "Details for Job XXX" are not consistent:

- Documents Harvested = the number of URL (i.e. lines in the crawl.log)
matching the domain pattern, including duplicates,

- Bytes Harvested matches = the total of content-size for all URL matching
the domain pattern, excluding duplicates.

Has this been done on purpose? Or is a bug?
If it is a bug, should we fix it?

Sara



Avant d'imprimer, pensez ? l'environnement.

_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at lists.gforge.statsbiblioteket.dk
https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchivesuite-users





Avant d'imprimer, pensez à l'environnement. 



More information about the NetarchiveSuite-users mailing list