[Netarchivesuite-users] Strange slow non-existing-domain behavior

Peter Svanberg Peter.Svanberg at kb.se
Mon Apr 29 15:59:59 CEST 2019


Thank you so much!

I will look more closely at these files and compare with ours. But my first question is: why do you run with parallelQueues = 1? We use 50!

I have previously found the following:

Found a response from the mailing list: Setting up parallel queries in the queueSignmentPolicy bean.

parallel queues: the default value (and historical behavior) is '1'. If instead, N, all URIs that previously entered the same single-sex queue will go into N related queues (via a consistent hash mapping of the path portion of the URL). Each queue is considered separately for traditional courtesy based on one-to-one connections and snooze delays-mid-downloads - so N queues mean that N downloads might go against one site at a time. Thus, it should only be used in an overlay setting, applied to sites that can handle multiple connections well.
------
https://wiki.searchtechnologies.com/index.php/Using_a_Custom_Heritrix_Configuration_File_(Aspire_2):

Remember to be careful about how many connections you should use per hostname, as it can cause problems with the site searched and it can be considered an attack by the web administrator.

But that doesn't seem to be correct now?  When I look at crawl.log, and sort on the field "started", the "politeness" parameters seem to be followed.

But parallelq maybe doesn’ t have any performance significance? I suppose Heritrix makes a lot of parallel calls and downloads even with parallelq = 1, but everything is in the same queue? Could someone explain?

With kind regards,
-----

Peter Svanberg

National Library of Sweden
Phone: +46 10 709 32 78

E-mail: peter.svanberg at kb.se
Web: www.kb.se




Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org> För Tue Hejlskov Larsen
Skickat: den 26 april 2019 13:38
Till: netarchivesuite-users at ml.sbforge.org
Ämne: Re: [Netarchivesuite-users] Strange slow non-existing-domain behavior

Hello Peter

We had also troubles last year with timouts regarding crawlertrap regex filters which hang infinite.

You can find all our actual timeout settings below in our selective harvester settings file and in our default_orderxml

Best regards
Tue

Here our harvester settings file:

                             :
                             :

From: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org> On Behalf Of Peter Svanberg
Sent: Friday, April 26, 2019 12:46 PM
To: netarchivesuite-users at ml.sbforge.org
Subject: Re: [Netarchivesuite-users] Strange slow non-existing-domain behavior

Hmm, I realize I have two parameters having 300 second values:
                             frontier.retryDelaySeconds=300
                             frontier.snoozeLongMs=300000

But I don’t see any “,2t” or “,3t” in these passages and the harvester doesn’t do anything else, so why snooze?

And in another job I get 10 seconds pauses. And no “Details and Actions” page in GUI … (Not a good NAS day. ☹)

But the weather is quite nice!

/Peter


Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> För Peter Svanberg
Skickat: den 26 april 2019 11:02
Till: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Ämne: Re: [Netarchivesuite-users] Strange slow non-existing-domain behavior

Now I discover a simular behavior, but with 404 status, 300 second wait and no problem with the domain (quick answer with wget). Is it the same issue, solved in 5.5?

2019-04-26T08:14:38.108Z   404        449 http://adcove.se/contactform.error_changefontsize_no_size REX http://55b558c7-resources.builder.misssite.com/ea01a9c/en/translations.js?sections=w
idgets,mobile,shared_views,shared_components,cookie text/html #032 20190426081438022+85 sha1:7BBNH63Q5ARINIOLBLMA6MVZODMSSUSD http://www.adcove.se content-size:828
2019-04-26T08:19:38.243Z   404        449 http://adcove.se/contactform.error_changeFormTitle_no_value REX http://55b558c7-resources.builder.misssite.com/ea01a9c/en/translations.js?sections
=widgets,mobile,shared_views,shared_components,cookie text/html #032 20190426081938163+80 sha1:7BBNH63Q5ARINIOLBLMA6MVZODMSSUSD http://www.adcove.se content-size:828
2019-04-26T08:24:38.389Z   404        449 http://adcove.se/contactform.error_changegoallink_no_source REX http://55b558c7-resources.builder.misssite.com/ea01a9c/en/translations.js?sections
=widgets,mobile,shared_views,shared_components,cookie text/html #032 20190426082438299+90 sha1:7BBNH63Q5ARINIOLBLMA6MVZODMSSUSD http://www.adcove.se content-size:828

-----

Peter Svanberg

National Library of Sweden
Phone: +46 10 709 32 78

E-mail: peter.svanberg at kb.se<mailto:peter.svanberg at kb.se>
Web: www.kb.se<http://www.kb.se>



Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> För Tue Hejlskov Larsen
Skickat: den 21 mars 2019 06:16
Till: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Ämne: Re: [Netarchivesuite-users] Strange slow non-existing-domain behavior

Hi Peter

We had also troubles with dns spam in 5.4.2.
Yes, it is fixed in 5.5.

Best regards
Tue

From: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> On Behalf Of Peter Svanberg
Sent: Wednesday, March 20, 2019 11:33 PM
To: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Subject: [Netarchivesuite-users] Strange slow non-existing-domain behavior

Hello again!

Spurred by your previous problem-solving answers, I continue.

Strange Heritrix behavior: Do dns lookup, which fails. Report that with an -6 line. Then 10 minutes pause. Then a new dns lookup and so on.

What happens during the pause? Waiting for dns lookup in 600 seconds? Trying the request despite the failed lookup?

(Maybe one of the bugs fixed in 5.5?)

Log and template below.

Best regards,
-----

Peter Svanberg
Technical officer
Digital Collections Department, Newspapers, Radio and Television Division

National Library of Sweden
PO Box 5039<x-apple-data-detectors://1/1>
SE-104 51 Stockholm<x-apple-data-detectors://1/1>
Visits: Karlavägen 100, Stockholm <x-apple-data-detectors://2>
Phone<x-apple-data-detectors://2>: +46 10 709 32 78

E-mail: peter.svanberg at kb.se<mailto:peter.svanberg at kb.se>
Web: www.kb.se<http://www.kb.se/>




crawl log:

2019-03-20T21:48:42.119Z    -6          - http://lookbackvideo7-a.akamaihd.net/ RRX https://www.facebook.com/ unknown #033 - - http://www.fbcdn.net 2t
2019-03-20T21:48:41.164Z    -1          - dns:lookbackvideo7-a.akamaihd.net<http://a.akamaihd.net> RRXP http://lookbackvideo7-a.akamaihd.net/ text/dns #047 20190320214841119+45 - http://www.fbcdn.net 3t
2019-03-20T21:38:41.006Z    -6          - http://lookbackvideo6-a.akamaihd.net/ RRX https://www.facebook.com/ unknown #024 - - http://www.fbcdn.net 2t
2019-03-20T21:38:40.063Z    -1          - dns:lookbackvideo6-a.akamaihd.net<http://a.akamaihd.net> RRXP http://lookbackvideo6-a.akamaihd.net/ text/dns #026 20190320213840006+56 - http://www.fbcdn.net 3t
2019-03-20T21:28:39.896Z    -6          - http://lookbackvideo5-a.akamaihd.net/ RRX https://www.facebook.com/ unknown #045 - - http://www.fbcdn.net 2t
2019-03-20T21:28:38.942Z    -1          - dns:lookbackvideo5-a.akamaihd.net<http://a.akamaihd.net> RRXP http://lookbackvideo5-a.a

template:

fetchDns.enabled=true
fetchDns.acceptNonDnsResolves=false
fetchDns.digestContent=true
fetchDns.digestAlgorithm=sha1

fetchHttp.enabled=true
fetchHttp.timeoutSeconds=1200
fetchHttp.soTimeoutMs=20000
fetchHttp.maxFetchKBSec=0
fetchHttp.maxLengthBytes=0
fetchHttp.ignoreCookies=false
fetchHttp.sslTrustLevel=OPEN
fetchHttp.defaultEncoding=UTF-8
fetchHttp.digestContent=true
fetchHttp.digestAlgorithm=sha1
fetchHttp.sendIfModifiedSince=true
fetchHttp.sendIfNoneMatch=true
fetchHttp.sendConnectionClose=true
fetchHttp.sendReferer=true
fetchHttp.sendRange=false


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20190429/c0f942ce/attachment-0001.html>


More information about the NetarchiveSuite-users mailing list