[Netarchivesuite-users] Inconsistency in the count of Bytes and Documents Harvested

sara.aubry at bnf.fr sara.aubry at bnf.fr
Wed Feb 16 11:25:36 CET 2011


Hello everyone,

While running tests on the last features Nicolas has developped,  we have 
noticed that the counts of 
Bytes Harvested and Documents Harvested from the crawl.log that appear on 
the page "Details for Job XXX" are not consistent:

- Documents Harvested = the number of URL (i.e. lines in the crawl.log) 
matching the domain pattern, including duplicates,

- Bytes Harvested matches = the total of content-size for all URL matching 
the domain pattern, excluding duplicates.

Has this been done on purpose? Or is a bug?
If it is a bug, should we fix it? 

Sara



Avant d'imprimer, pensez à l'environnement. 



More information about the NetarchiveSuite-users mailing list