[Netarchivesuite-users] NAS/Heritrix and its webserver and network impact – ”politeness”
Peter.Svanberg at kb.se
Tue Apr 9 09:51:07 CEST 2019
We currently use the NAS defaults for the following parameters:
# How many multiples of last fetch elapsed time to wait before recontacting
# same server. ; Heritrix default 5.0
# Never wait more than this long, regardless of multiple; Heritrix default 30000
# Always wait this long after one completion before recontacting same
# server, regardless of multiple; Heritrix default 3000
# Maximum per-host bandwidth usage; Heritrix default 0 (no limit)
As you can see this differs from Heritrix's default values. How where they chosen?
The NAS defaults sometimes lead to more than two calls per second on the same server (checked in logs; 888 calls in 419 seconds in one case).
parallelQueues=50 ; NAS default 1
we differ much from NAS (and also Heritrix?) default. In the template files from NAS it only says "TODO evaluate this default" (for the value 1).
What values do you use? Have someone done any testing with different values? Have you been criticized by site owners for using too much webserver or network resources? What’s the pros an cons with many parallel queues?
(Maybe something for todays NetarchiveSuite tele-conference?)
Digital Collections Department, Newspapers, Radio and Television Division
National Library of Sweden
<x-apple-data-detectors://1/1>PO Box 5039<x-apple-data-detectors://1/1>
SE-104 51 Stockholm<x-apple-data-detectors://1/1>
Visits: <x-apple-data-detectors://2> Karlavägen 100, Stockholm <x-apple-data-detectors://2>
Phone<x-apple-data-detectors://2>: +46 10 709 32 78
E-mail: peter.svanberg at kb.se<mailto:peter.svanberg at kb.se>
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NetarchiveSuite-users