[Netarchivesuite-users] Heritrix/NAS parameters, again
Peter Svanberg
Peter.Svanberg at kb.se
Thu Mar 24 01:04:53 CET 2022
I return to Heritrix/NAS parameters, for broad crawl (pass 2). Could you just cast a glance at this table and tell me if something stands out as stupid or questionable.
Raise in and out buffer? Does it matter?
Limit size of each resource? (To avoid 1 TB of nulls.) To what?
Limit both number of and summed size of resources?
Thanks in advance!
Name
Description
Default
In NAS examples
We plan to use
crawlController.maxToeThreads
Maximum number of threads processing URIs at the same time.
25
50 and 100
50
crawlController.recorderOutBufferBytes
Size in bytes of in-memory buffer to record outbound traffic
16 KiB
4 KiB
64 KiB
crawlController.recorderInBufferBytes
Size in bytes of in-memory buffer to record inbound traffic
512 KiB
64 KiB
512 KiB
TooManyHopsDecideRule.maxHops
max link-hop-count from start
20
20
20
TransclusionDecideRule.maxTransHops
Maximum number of non-refers (non-'R') in non-'L'/'S' tail of path-from-seed
2
5, 10, 15 ...
5
TransclusionDecideRule.maxSpeculativeHops
Maximum number of speculative hops ('X') in non-'L'/'S' tail of path-from-seed
1
1
0
PathologicalPathDecideRule.maxRepetitions
max identical, consecutive path-segments
2
3
3
fetchHttp.maxLengthBytes
fetched resource truncated after this limit (0 no limit)
0
0
75 % of groupMaxAllKb?
quotaenforcer.groupMaxFetchSuccesses
max number of resources (-1 no limit)
-1
set?
quotaenforcer.groupMaxAllKb
max total size of resources (-1 no limit)
-1
depending on pass
bdb.cachePercent
Caching in BDB?
40
40
40
-----
Peter Svanberg
National Library of Sweden
Phone: +46 10 709 32 78
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20220324/4d75cb80/attachment.html>
More information about the NetarchiveSuite-users
mailing list