[Netarchivesuite-users] Heritrix status -2, automatic blocking?

Peter Svanberg Peter.Svanberg at kb.se
Mon Dec 13 11:10:10 CET 2021


Hello!

We are at last broad crawling! And have already encountered some things we would like your hints and advice on. First on blocking.

We get Hertrix-status -2 (HTTP connect failed) quite often.


*        Current host/IP seems to be blocked, same URL works from other host.

*        Often X.se works, but media.X.se gives -2.

Could it be some kind of automatic blocking? Due to excessive harvesting? But it seems as if it starts quite early in the harvesting. Sometimes a http page seem to contain https links which don't work. Will investigate more.

After -2 status there is a 2 minute delay until next URL is fetched (given that this is the only remaining domain not harvested in the job). I can't figure out why. frontier.retryDelaySeconds=30 and frontier.maxRetries=3 should only give 1,5 minute?

Should these or other parameters be changed to speed these cases up? And/or to lower the blocking risk?

Regards,

[KB Logo]<https://www.kb.se/>

Peter Svanberg
Technical officer

Kungliga biblioteket
Stockholm
+46 10 709 32 78
Peter.Svanberg at kb.se
www.kb.se<https://www.kb.se/>



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20211213/7404063b/attachment.html>


More information about the NetarchiveSuite-users mailing list