[Netarchivesuite-users] NAS/Heritrix doesn't obey byte limits

Colin Samuel Rosenthal csr at kb.dk
Mon Mar 18 11:24:03 CET 2019

Hi Peter,

Which queueAssignmentPolicy are you using? This is defined in your crawler-beans template. We use


which is coded so that in-line images are counted as belonging to the same quota as the seed url from which they originate. This is important

  1.  In snapshot harvests because many domains use the same image hosting and so they each need a separate quota, but also
  2.  Selective harvests, because otherwise you get a separate quota for the image-hosting domain so you may end up going a long way over your overall job quota.

The SeedUriDomainnameQueueAssignmentPolicy should be working in 5.4.2 (there were some further small fixes to it in 5.5).

Colin Rosenthal PhD
Senior IT Consultant
Royal Danish Library (Aarhus)

From: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org> on behalf of Bjarne Andersen <bja at kb.dk>
Sent: Monday, March 18, 2019 11:06 AM
To: netarchivesuite-users at ml.sbforge.org
Subject: Re: [Netarchivesuite-users] NAS/Heritrix doesn't obey byte limits

I wonder weather the logic around limits and domains have changed at some point so that objects that are in-lined (like images) are counted as belonging to a specific domain thus the limit will be reached not only by objects from the specific domain Itself but most likely sooner by other in-lined objects from other domains. I know this was a feature-request in older versions of NetarchiveSuite but I haven’t followed the development that close in later years.

Domains going over the limit could be a result of very large objects fetched as some of the last objects from that domain (e.g. a 2Gb video-file) – the crawl.log should reveal that.



From: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org> On Behalf Of Tue Hejlskov Larsen
Sent: Monday, March 18, 2019 10:07 AM
To: netarchivesuite-users at ml.sbforge.org
Subject: Re: [Netarchivesuite-users] NAS/Heritrix doesn't obey byte limits

Hello Peter

What does your crawllogs tell you?

Best regards


From: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> On Behalf Of Peter Svanberg
Sent: Monday, March 18, 2019 12:07 AM
To: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Subject: [Netarchivesuite-users] NAS/Heritrix doesn't obey byte limits

Hello, NAS users and others!

We are experiencing a very strange behavior from NAS/Heritrix (see attached Excel file, with comments):

The harvests reports says ”Stopped due to … byte/object limit reached" at very different levels – sometimes much above (more than five times the limit), sometimes much below the limit. We fail to see any pattern in this, it seems more or less random.

What are we doing wrong? Is it some error in the harvest template? (Attached below each table.)

Or, if it is some kind of bug, are there workarounds?

We would much appreciate any hints, as this is quite a problem for us, both for the on-going selective harvests and the upcoming big snapshot run!

(We are running version 5.4.2, I hope that it doesn’t affect this problem, as we can’t upgrade now.)

Best regards,


Peter Svanberg
Technical officer
Digital Collections Department, Newspapers, Radio and Television Division

National Library of Sweden
PO Box 5039
SE-104 51 Stockholm
Visits: Karlavägen 100, Stockholm
Phone: +46 10 709 32 78

E-mail: peter.svanberg at kb.se<mailto:peter.svanberg at kb.se>
Web: www.kb.se<http://www.kb.se/>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20190318/97d0d0d3/attachment-0001.html>

More information about the NetarchiveSuite-users mailing list