[Netarchivesuite-users] Heritrix/NAS parameters, again

Peter Svanberg Peter.Svanberg at kb.se
Thu Mar 24 01:04:53 CET 2022


I return to Heritrix/NAS parameters, for broad crawl (pass 2). Could you just cast a glance at this table and tell me if something stands out as stupid or questionable.

Raise in and out buffer? Does it matter?
Limit size of each resource? (To avoid 1 TB of nulls.) To what?
Limit both number of and summed size of resources?

Thanks in advance!
Name

Description

Default

In NAS examples

We plan to use

crawlController.maxToeThreads

Maximum number of threads processing URIs at the same time.

25

50 and 100

50

crawlController.recorderOutBufferBytes

Size in bytes of in-memory buffer to record outbound traffic

16 KiB

4 KiB

64 KiB

crawlController.recorderInBufferBytes

Size in bytes of in-memory buffer to record inbound traffic

512 KiB

64 KiB

512 KiB

TooManyHopsDecideRule.maxHops

max link-hop-count from start

20

20

20

TransclusionDecideRule.maxTransHops

Maximum number of non-refers (non-'R') in non-'L'/'S' tail of path-from-seed

2

5, 10, 15 ...

5

TransclusionDecideRule.maxSpeculativeHops

Maximum number of speculative hops ('X') in non-'L'/'S' tail of path-from-seed

1

1

0

PathologicalPathDecideRule.maxRepetitions

max identical, consecutive path-segments

2

3

3

fetchHttp.maxLengthBytes

fetched resource truncated after this limit (0 no limit)

0

0

75 % of groupMaxAllKb?

quotaenforcer.groupMaxFetchSuccesses

max number of resources (-1 no limit)

-1

set?

quotaenforcer.groupMaxAllKb

max total size of resources (-1 no limit)

-1

depending on pass

bdb.cachePercent

Caching in BDB?

40

40

40



-----
Peter Svanberg

National Library of Sweden
Phone: +46 10 709 32 78


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20220324/4d75cb80/attachment.html>


More information about the NetarchiveSuite-users mailing list