[Netarchivesuite-users] Heritrix does not stop at limit.

Bjarne Andersen netarkivet at statsbiblioteket.dk
Fri Apr 11 12:56:57 CEST 2008

The group-max-all-kb refers the the maximum for each queue. The way heritrix distributes URL's into queues is set with
<string name="queue-assignment-policy"></string>

So in NetarchiveSuite this means that the limit is counting on each domain defined in the system. If you harvst only one domain you will 
typically go a little above because inline material is also harvested and that could come from other domains that have their own limit of 
1.5Gbytes. We have talked about making it possible to also count inline material to on the domain-queue for the domain that the inline 
material belongs to - but thats currently only a feature-request

If you want your overall job to be limited to 1.5Gbytes you should set 'max-bytes-download' in the 'crawl-order'-section. In NetarchiveSuite 
that requires you to download the order-template, edit it and upload it again

Bjarne Andersen
Daily Manager - netarchive.dk

State & University Library
DK-8000 Aarhus C
T: +45 89462165 - C: +45 25662353
CVR/SE 10100682 - EAN 5798000791084

Svein Yngvar Willassen wrote:
> Hi all,
> I configured a selective crawl with a limit of 1 500 000 000 bytes (1.5 
> Gb).  This limit shows up in Heritrix' admin console as group-max-all-kb 
> set to 1464844, which appears to be correct. (*1024 ~= 1.5 Gb)
> But the crawler has now run for about 24 hours, and in Heritrix admin 
> console, the amount of crawled content is reported to be 1.6 Gb. The 
> total number of bytes in the arc files is about 1 700 000 000.
> Why doesn't it stop at 1.5 Gb? Is there a difference of which content is 
> counted by the QuotaEnforcer and the size of the arc files?
> -- 
> Best Regards,
> Svein Y. Willassen
> http://willassen.blogspot.com/
> ------------------------------------------------------------------------
> _______________________________________________
> NetarchiveSuite-users mailing list
> NetarchiveSuite-users at lists.gforge.statsbiblioteket.dk
> https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchivesuite-users

Bjarne Andersen
Driftsleder - netarkivet.dk

8000 Århus C
Tlf. 89462165 - Mobil 25662353
CVR/SE 10100682 - EAN 5798000791084
-------------- next part --------------
A non-text attachment was scrubbed...
Name: netarkivet.vcf
Type: text/x-vcard
Size: 312 bytes
Desc: not available
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20080411/6afe2608/attachment-0002.vcf>

More information about the NetarchiveSuite-users mailing list