[Netarchivesuite-users] High NAS load again at broad crawl -- hints?
Peter Svanberg
Peter.Svanberg at kb.se
Fri Sep 19 14:55:20 CEST 2025
In our last broad crawl 2024, with Nas 7.5, we had serious problems with high load on the harvesting servers, making jobs crash. A large share of those where due to NullPointerException places in the code which was fixed in 7.6.
Our first crawl in 2025 with NAS 7.6 then went very well, with none of the above. Much lower load on average. Nice, the 7.6 version and reduction of the number of instances per harvest server made everything run smoothly, we thought. But no ...
Our first pass on next broad crawl now returned to high load and crash behavior, although not quite as bad as 2024. There are at least one more NullPointerException place to fix, we have learned, but that doesn't seem to happen so often as the previous ones. But there where also other crashed that I associate with high load:
dk.netarkivet.common.exceptions.IOFailure: Port 8213 already in use, or port is out of range
dk.netarkivet.common.exceptions.IOFailure: Heritrix3 could not be shut down
Connect to kw3-harvester12.kb.se:8243 [kw3-harvester12.kb.se/193.10.72.194] failed: Connection refused
All parameters and limits are the same between the smooth first pass 1 and this less smooth pass 1, except some minor difference in the templates. The only thing I could think of could influence anything is some changes in the crawler trap regexes: values of type
<value>.*action=buy_now.*</value>
changed to
<value>https?://[^/]+/.*action=buy_now.*</value>
and another regex was made a bit more complicated . But I doubt that this could have that much of an impact.
Do you have any tips or hints?
Regards,
[https://signaturloggor.kb.se/png/Outlook%20logo%20m%d0%a4rkbl%d0%96.png]<https://www.kb.se/>
Peter Svanberg
Technical officer
Legal Deposit and Metadata Department
Digital Material Legal Deposit Unit
National Library of Sweden
PO Box 5039, SE-102 41 Stockholm
Visits: Karlavägen 96, Stockholm
+46 10-709 32 78
Peter.Svanberg at kb.se
www.kb.se<https://www.kb.se/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20250919/2266013a/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 6642 bytes
Desc: image001.png
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20250919/2266013a/attachment-0001.png>
More information about the NetarchiveSuite-users
mailing list