[Netarchivesuite-users] NAS/Heritrix and its webserver and network impact – ”politeness”

Peter Svanberg Peter.Svanberg at kb.se
Mon Apr 15 14:27:24 CEST 2019


No reaction on this subject?

I would also like to know how you handle hops parameters:

org.archive.modules.deciderules.TooManyHopsDecideRule
maxHops          NAS default 20, we use 5
org.archive.modules.deciderules.TransclusionDecideRule
maxTransHops      NAS default 2, we use 5
maxSpeculativeHops  NAS default 1, which we use

(I studied the Heritrix source code to document exactly what these parameters mean – I can send that if you want.)

-----

Peter Svanberg

National Library of Sweden
Phone: +46 10 709 32 78


Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org> För Peter Svanberg
Skickat: den 9 april 2019 09:51
Till: netarchivesuite-users at ml.sbforge.org
Ämne: [Netarchivesuite-users] NAS/Heritrix and its webserver and network impact – ”politeness”

We currently use the NAS defaults for the following parameters:

# How many multiples of last fetch elapsed time to wait before recontacting
# same server. ; Heritrix default 5.0
disposition.delayFactor=1.0

# Never wait more than this long, regardless of multiple; Heritrix default 30000
disposition.maxDelayMs=1000

# Always wait this long after one completion before recontacting same
# server, regardless of multiple; Heritrix default 3000
disposition.minDelayMs=300

# Maximum per-host bandwidth usage; Heritrix default 0 (no limit)
disposition.maxPerHostBandwidthUsageKbSec=500

As you can see this differs from Heritrix's default values. How where they chosen?


The NAS defaults sometimes lead to more than two calls per second on the same server (checked in logs; 888 calls in 419 seconds in one case).


Concerning

parallelQueues=50   ; NAS default 1

we differ much from NAS (and also Heritrix?) default. In the template files from NAS it only says "TODO evaluate this default" (for the value 1).

What values do you use? Have someone done any testing with different values? Have you been criticized by site owners for using too much webserver or network resources? What’s the pros an cons with many parallel queues?

(Maybe something for todays NetarchiveSuite tele-conference?)


Best regards,
-----

Peter Svanberg
Technical officer
Digital Collections Department, Newspapers, Radio and Television Division

National Library of Sweden
PO Box 5039<x-apple-data-detectors://1/1>
SE-104 51 Stockholm<x-apple-data-detectors://1/1>
Visits: Karlavägen 100, Stockholm <x-apple-data-detectors://2>
Phone<x-apple-data-detectors://2>: +46 10 709 32 78

E-mail: peter.svanberg at kb.se<mailto:peter.svanberg at kb.se>
Web: www.kb.se<http://www.kb.se/>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20190415/4daf9c02/attachment.html>


More information about the NetarchiveSuite-users mailing list