[Netarchivesuite-users] Statistic data

Bjarne Andersen netarkivet at statsbiblioteket.dk
Fri Jul 4 09:03:40 CEST 2008


Good question.

Try looking at the statistics on the job page instead (I assume this is only one job). You will realise that the database (for good reasons) 
only have statistics for domains known by the system (and included in the job) - so objects (and bytes) harvested from other domains (e.g. 
inline material) are not counted in in the database (exactly since they are from other domains possibly unknown to the system - and at least 
unknown to the job).

We have talked about registering inline material on the domain it is inlined in to exactly fix this "problem" - could be in the same figures 
per domain or could be in a seperate set of figures - e.g. called "inline material" so that each domain have 2 sets of figures per job.

best
-- 
Bjarne Andersen
Daily Manager - netarchive.dk

State & University Library
Universitetsparken
DK-8000 Aarhus C
T: +45 89462165 - C: +45 25662353
CVR/SE 10100682 - EAN 5798000791084
http://netarchive.dk

aponb at gmx.at wrote:
> I compared the statistic data of a job detail in the webgui, which comes 
> from the database (Bytes harvested and Document harvested) and compared 
> it with the datalist followed the line 
> metadata://netarkivet.dk/crawl/reports/mimetype-report.txt out of the 
> metadata.arc file of this job. I thought the sum of bytes of this 
> section should be the same as the WebGui  is showing under "Bytes 
> harvested". The same for #urls vs. Document Harvested.
> For example
> [#urls] [#bytes] [mime-types]
> 233 476507 text/html
> 117 222747 image/gif
> 32 761181 image/jpeg
> 28 12714 text/plain
> 10 1710 text/dns
> 5 9443796 application/x-shockwave-flash
> 3 12175 application/x-javascript
> 3 28404 application/xml
> 3 15087 text/css
> 2 424 no-type
> 1 1691 image/x-icon
> 1 1299 video/x-ms-asf
> 
> 438 	10977735
> 
> 
> 
> vs.
> 
> Domain 	Configuration 	Bytes Harvested 	Documents Harvested 	Stopped due to
> 
> <http://wc01:8076/HarvestDefinition/Definitions-edit-domain.jsp?name=orf.at> 
> 	
> <http://wc01:8076/HarvestDefinition/Definitions-edit-domain-config.jsp?name=orf.at&configName=alle4Stunden&editConfig=1> 
> 	10,961,205 	401 	Domain Completed
> 
> 
> 
> So, they are not so far away - but should'nt be the numbers exact the same?
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> NetarchiveSuite-users mailing list
> NetarchiveSuite-users at lists.gforge.statsbiblioteket.dk
> https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchivesuite-users



-------------- next part --------------
A non-text attachment was scrubbed...
Name: netarkivet.vcf
Type: text/x-vcard
Size: 312 bytes
Desc: not available
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20080704/646e7f78/attachment-0002.vcf>


More information about the NetarchiveSuite-users mailing list