[Netarchivesuite-users] Heritrix status -2, automatic blocking?
Peter Svanberg
Peter.Svanberg at kb.se
Mon Dec 13 11:10:10 CET 2021
Hello!
We are at last broad crawling! And have already encountered some things we would like your hints and advice on. First on blocking.
We get Hertrix-status -2 (HTTP connect failed) quite often.
* Current host/IP seems to be blocked, same URL works from other host.
* Often X.se works, but media.X.se gives -2.
Could it be some kind of automatic blocking? Due to excessive harvesting? But it seems as if it starts quite early in the harvesting. Sometimes a http page seem to contain https links which don't work. Will investigate more.
After -2 status there is a 2 minute delay until next URL is fetched (given that this is the only remaining domain not harvested in the job). I can't figure out why. frontier.retryDelaySeconds=30 and frontier.maxRetries=3 should only give 1,5 minute?
Should these or other parameters be changed to speed these cases up? And/or to lower the blocking risk?
Regards,
[KB Logo]<https://www.kb.se/>
Peter Svanberg
Technical officer
Kungliga biblioteket
Stockholm
+46 10 709 32 78
Peter.Svanberg at kb.se
www.kb.se<https://www.kb.se/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20211213/7404063b/attachment.html>
More information about the NetarchiveSuite-users
mailing list