We currently use the NAS defaults for the following parameters:

# How many multiples of last fetch elapsed time to wait before recontacting
# same server. ; Heritrix default 5.0

# Never wait more than this long, regardless of multiple; Heritrix default 30000

# Always wait this long after one completion before recontacting same
# server, regardless of multiple; Heritrix default 3000

# Maximum per-host bandwidth usage; Heritrix default 0 (no limit)

As you can see this differs from Heritrix's default values. How where they chosen?

The NAS defaults sometimes lead to more than two calls per second on the same server (checked in logs; 888 calls in 419 seconds in one case).


parallelQueues=50   ; NAS default 1

we differ much from NAS (and also Heritrix?) default. In the template files from NAS it only says "TODO evaluate this default" (for the value 1).

What values do you use? Have someone done any testing with different values? Have you been criticized by site owners for using too much webserver or network resources? What’s the pros an cons with many parallel queues?

(Maybe something for todays NetarchiveSuite tele-conference?)

