[Netarchivesuite-users] My experiences - any hints, advices or comments?

Peter Svanberg Peter.Svanberg at kb.se
Mon May 27 13:30:31 CEST 2019


Hello all!

First, thanks very much, Colin, for your article Better QA With Logtrix<https://sbforge.org/display/NAS/Better+QA+With+Logtrix>! I've done some Python tools myself but Logtrix and your extensions seems very interesting and I will test it soon!

Below are some experiences the latest weeks. I would appreciate some sounding boarding from you!


1)      Strange Facebook behaviour: at different selective harvest (but not all), the request to facebook.com seemed to be delayed with 10 seconds - the duration value seemed to be about 10000 (ms) higher than similar request to other domains. Anyone seen that? Do Facebook sometimes consider me as an evil robot and silently give some punishment ...?

2)      Often, at the end of a harvest job which is part of a snapshot (test) run with just one domain left, there is a 10 second delay between each request (not duration, just wait). It seems like a politeness thing, but I can't map any of those settings to 10 seconds ... (values below). Any hints?

3)      When you have limited a harvest (essentially MaxBytesPerDomain) you want to optimize so that when the limit hits, you have fetched the most important URLs on each seed. My Python crawl log digging showed that this was not the case with the default CostUriPrecedencePolicy. So I found HopsUriPrecedencePolicy, which sorts the queue on length of hops path (or just the number of links, Ls, in hops path, which I use). This seem to make the seeds more equal on how they are harvested before the limit stops. Are you using this or something else?

4)      The algorithm for TransclusionDecideRule allow for infinitely long hops paths (as long as it doesn't contain Ls or Ss; or Xs or non-Rs after Ls; recent example RRLLERRRREPRP; looping redirects never seems to stop!).  Can you repeat the TooManyHopsDecideRule after the TransclusionDecideRule with a new value? (Example below.)

5)      To further optimize what we fetch, I would like to prioritize in-seed/in-domain/in-country URLs, without excluding out-of-ditto embedded images etc. But that seems hard to accomplish with decide rules or precedence policies - or?


Politeness settings:
disposition.delayFactor=1.0
disposition.maxDelayMs=1000
disposition.minDelayMs=300

Decide rule example:
<bean class="...TooManyHopsDecideRule">
<property name="maxHops" value="5" />
</bean>
<bean class="...TransclusionDecideRule">
<property name="maxTransHops" value="3" />
<property name="maxSpeculativeHops" value="0" />
</bean>
<bean class="...TooManyHopsDecideRule">
<property name="maxHops" value="10" />
</bean>

Regards,

-----

Peter Svanberg
Technical officer
Digital Collections Department, Newspapers, Radio and Television Division

National Library of Sweden
PO Box 5039
SE-104 51 Stockholm
Visits: Karlavägen 100, Stockholm
Phone: +46 10 709 32 78

E-mail: peter.svanberg at kb.se
Web: www.kb.se


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20190527/b87bb315/attachment.html>


More information about the NetarchiveSuite-users mailing list