[Netarchivesuite-users] Heritrix does not stop at limit.

Svein Yngvar Willassen svein at willassen.no
Fri Apr 11 17:27:25 CEST 2008


Thank you.

It just finished with a little more than 2.1 Gb worth of arc files. It
appears this site has particularly much inline material.

Regards,

Svein


2008/4/11, Bjarne Andersen <netarkivet at statsbiblioteket.dk>:
>
> The group-max-all-kb refers the the maximum for each queue. The way
> heritrix distributes URL's into queues is set with
> <string name="queue-assignment-policy"></string>
>
> So in NetarchiveSuite this means that the limit is counting on each domain
> defined in the system. If you harvst only one domain you will typically go a
> little above because inline material is also harvested and that could come
> from other domains that have their own limit of 1.5Gbytes. We have talked
> about making it possible to also count inline material to on the
> domain-queue for the domain that the inline material belongs to - but thats
> currently only a feature-request
>
> If you want your overall job to be limited to 1.5Gbytes you should set
> 'max-bytes-download' in the 'crawl-order'-section. In NetarchiveSuite that
> requires you to download the order-template, edit it and upload it again
>
> best
> --
> Bjarne Andersen
> Daily Manager - netarchive.dk
>
> State & University Library
> Universitetsparken
> DK-8000 Aarhus C
> T: +45 89462165 - C: +45 25662353
> CVR/SE 10100682 - EAN 5798000791084
> http://netarchive.dk
>
> Svein Yngvar Willassen wrote:
>
> > Hi all,
> >  I configured a selective crawl with a limit of 1 500 000 000 bytes (1.5
> > Gb).  This limit shows up in Heritrix' admin console as group-max-all-kb set
> > to 1464844, which appears to be correct. (*1024 ~= 1.5 Gb)
> >  But the crawler has now run for about 24 hours, and in Heritrix admin
> > console, the amount of crawled content is reported to be 1.6 Gb. The total
> > number of bytes in the arc files is about 1 700 000 000.
> >  Why doesn't it stop at 1.5 Gb? Is there a difference of which content
> > is counted by the QuotaEnforcer and the size of the arc files?
> >
> > --
> > Best Regards,
> >
> > Svein Y. Willassen
> > http://willassen.blogspot.com/
> >
> >
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > NetarchiveSuite-users mailing list
> > NetarchiveSuite-users at lists.gforge.statsbiblioteket.dk
> >
> > https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchivesuite-users
> >
>
> --
> Bjarne Andersen
> Driftsleder - netarkivet.dk
>
> Statsbiblioteket
> Universitetsparken
> 8000 Århus C
> Tlf. 89462165 - Mobil 25662353
> CVR/SE 10100682 - EAN 5798000791084
> http://netarkivet.dk
>
> _______________________________________________
> NetarchiveSuite-users mailing list
> NetarchiveSuite-users at lists.gforge.statsbiblioteket.dk
>
> https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchivesuite-users
>
>
>


-- 
Best Regards,

Svein Y. Willassen
http://willassen.blogspot.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20080411/2487c308/attachment-0002.html>


More information about the NetarchiveSuite-users mailing list