[Netarchivesuite-users] Strange slow non-existing-domain behavior

Colin Samuel Rosenthal csr at kb.dk
Wed May 8 10:11:50 CEST 2019


Hi Peter,


I'm not 100% sure, but I think our seed-based queue assignment policy actually ignores the value of the parallel queues parameter. There's a discussion of parallel queues in the heritrix mailing list - https://groups.yahoo.com/neo/groups/archive-crawler/conversations/messages/8788 that might help.


regards,

Colin


--
Colin Rosenthal PhD
Senior IT Consultant
Royal Danish Library (Aarhus)
________________________________
From: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org> on behalf of Peter Svanberg <Peter.Svanberg at kb.se>
Sent: Monday, April 29, 2019 3:59:59 PM
To: netarchivesuite-users at ml.sbforge.org
Subject: Re: [Netarchivesuite-users] Strange slow non-existing-domain behavior


Thank you so much!



I will look more closely at these files and compare with ours. But my first question is: why do you run with parallelQueues = 1? We use 50!



I have previously found the following:



Found a response from the mailing list: Setting up parallel queries in the queueSignmentPolicy bean.



parallel queues: the default value (and historical behavior) is '1'. If instead, N, all URIs that previously entered the same single-sex queue will go into N related queues (via a consistent hash mapping of the path portion of the URL). Each queue is considered separately for traditional courtesy based on one-to-one connections and snooze delays-mid-downloads - so N queues mean that N downloads might go against one site at a time. Thus, it should only be used in an overlay setting, applied to sites that can handle multiple connections well.

------

https://wiki.searchtechnologies.com/index.php/Using_a_Custom_Heritrix_Configuration_File_(Aspire_2):



Remember to be careful about how many connections you should use per hostname, as it can cause problems with the site searched and it can be considered an attack by the web administrator.



But that doesn't seem to be correct now?  When I look at crawl.log, and sort on the field "started", the "politeness" parameters seem to be followed.



But parallelq maybe doesn’ t have any performance significance? I suppose Heritrix makes a lot of parallel calls and downloads even with parallelq = 1, but everything is in the same queue? Could someone explain?



With kind regards,

-----

Peter Svanberg

National Library of Sweden
Phone: +46 10 709 32 78

E-mail: peter.svanberg at kb.se
Web: www.kb.se








Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org> För Tue Hejlskov Larsen
Skickat: den 26 april 2019 13:38
Till: netarchivesuite-users at ml.sbforge.org
Ämne: Re: [Netarchivesuite-users] Strange slow non-existing-domain behavior



Hello Peter



We had also troubles last year with timouts regarding crawlertrap regex filters which hang infinite.



You can find all our actual timeout settings below in our selective harvester settings file and in our default_orderxml



Best regards

Tue



Here our harvester settings file:



                             :

                             :



From: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org> On Behalf Of Peter Svanberg
Sent: Friday, April 26, 2019 12:46 PM
To: netarchivesuite-users at ml.sbforge.org
Subject: Re: [Netarchivesuite-users] Strange slow non-existing-domain behavior



Hmm, I realize I have two parameters having 300 second values:

                             frontier.retryDelaySeconds=300

                             frontier.snoozeLongMs=300000



But I don’t see any “,2t” or “,3t” in these passages and the harvester doesn’t do anything else, so why snooze?



And in another job I get 10 seconds pauses. And no “Details and Actions” page in GUI … (Not a good NAS day. :()



But the weather is quite nice!



/Peter





Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> För Peter Svanberg
Skickat: den 26 april 2019 11:02
Till: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Ämne: Re: [Netarchivesuite-users] Strange slow non-existing-domain behavior



Now I discover a simular behavior, but with 404 status, 300 second wait and no problem with the domain (quick answer with wget). Is it the same issue, solved in 5.5?



2019-04-26T08:14:38.108Z   404        449 http://adcove.se/contactform.error_changefontsize_no_size REX http://55b558c7-resources.builder.misssite.com/ea01a9c/en/translations.js?sections=w

idgets,mobile,shared_views,shared_components,cookie text/html #032 20190426081438022+85 sha1:7BBNH63Q5ARINIOLBLMA6MVZODMSSUSD http://www.adcove.se content-size:828

2019-04-26T08:19:38.243Z   404        449 http://adcove.se/contactform.error_changeFormTitle_no_value REX http://55b558c7-resources.builder.misssite.com/ea01a9c/en/translations.js?sections

=widgets,mobile,shared_views,shared_components,cookie text/html #032 20190426081938163+80 sha1:7BBNH63Q5ARINIOLBLMA6MVZODMSSUSD http://www.adcove.se content-size:828

2019-04-26T08:24:38.389Z   404        449 http://adcove.se/contactform.error_changegoallink_no_source REX http://55b558c7-resources.builder.misssite.com/ea01a9c/en/translations.js?sections

=widgets,mobile,shared_views,shared_components,cookie text/html #032 20190426082438299+90 sha1:7BBNH63Q5ARINIOLBLMA6MVZODMSSUSD http://www.adcove.se content-size:828



-----

Peter Svanberg

National Library of Sweden
Phone: +46 10 709 32 78

E-mail: peter.svanberg at kb.se<mailto:peter.svanberg at kb.se>
Web: www.kb.se<http://www.kb.se>







Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> För Tue Hejlskov Larsen
Skickat: den 21 mars 2019 06:16
Till: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Ämne: Re: [Netarchivesuite-users] Strange slow non-existing-domain behavior



Hi Peter



We had also troubles with dns spam in 5.4.2.

Yes, it is fixed in 5.5.



Best regards

Tue



From: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> On Behalf Of Peter Svanberg
Sent: Wednesday, March 20, 2019 11:33 PM
To: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Subject: [Netarchivesuite-users] Strange slow non-existing-domain behavior



Hello again!



Spurred by your previous problem-solving answers, I continue.



Strange Heritrix behavior: Do dns lookup, which fails. Report that with an -6 line. Then 10 minutes pause. Then a new dns lookup and so on.



What happens during the pause? Waiting for dns lookup in 600 seconds? Trying the request despite the failed lookup?



(Maybe one of the bugs fixed in 5.5?)



Log and template below.



Best regards,

-----

Peter Svanberg
Technical officer
Digital Collections Department, Newspapers, Radio and Television Division

National Library of Sweden
PO Box 5039
SE-104 51 Stockholm
Visits: Karlavägen 100, Stockholm
Phone: +46 10 709 32 78

E-mail: peter.svanberg at kb.se<mailto:peter.svanberg at kb.se>
Web: www.kb.se<http://www.kb.se/>









crawl log:



2019-03-20T21:48:42.119Z    -6          - http://lookbackvideo7-a.akamaihd.net/ RRX https://www.facebook.com/ unknown #033 - - http://www.fbcdn.net 2t

2019-03-20T21:48:41.164Z    -1          - dns:lookbackvideo7-a.akamaihd.net<http://a.akamaihd.net> RRXP http://lookbackvideo7-a.akamaihd.net/ text/dns #047 20190320214841119+45 - http://www.fbcdn.net 3t

2019-03-20T21:38:41.006Z    -6          - http://lookbackvideo6-a.akamaihd.net/ RRX https://www.facebook.com/ unknown #024 - - http://www.fbcdn.net 2t

2019-03-20T21:38:40.063Z    -1          - dns:lookbackvideo6-a.akamaihd.net<http://a.akamaihd.net> RRXP http://lookbackvideo6-a.akamaihd.net/ text/dns #026 20190320213840006+56 - http://www.fbcdn.net 3t

2019-03-20T21:28:39.896Z    -6          - http://lookbackvideo5-a.akamaihd.net/ RRX https://www.facebook.com/ unknown #045 - - http://www.fbcdn.net 2t

2019-03-20T21:28:38.942Z    -1          - dns:lookbackvideo5-a.akamaihd.net<http://a.akamaihd.net> RRXP http://lookbackvideo5-a.a



template:



fetchDns.enabled=true

fetchDns.acceptNonDnsResolves=false

fetchDns.digestContent=true

fetchDns.digestAlgorithm=sha1



fetchHttp.enabled=true

fetchHttp.timeoutSeconds=1200

fetchHttp.soTimeoutMs=20000

fetchHttp.maxFetchKBSec=0

fetchHttp.maxLengthBytes=0

fetchHttp.ignoreCookies=false

fetchHttp.sslTrustLevel=OPEN

fetchHttp.defaultEncoding=UTF-8

fetchHttp.digestContent=true

fetchHttp.digestAlgorithm=sha1

fetchHttp.sendIfModifiedSince=true

fetchHttp.sendIfNoneMatch=true

fetchHttp.sendConnectionClose=true

fetchHttp.sendReferer=true

fetchHttp.sendRange=false




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20190508/957bea6a/attachment-0001.html>


More information about the NetarchiveSuite-users mailing list