[Netarchivesuite-users] My experiences - any hints, advices or comments?

Tue Jun 4 01:31:34 CEST 2019

Hello!

Sometimes it is good to formulate questions, it makes you think. I've got answers to (4) and (5) now, I hope. (The TransclusionDecideRule is just for what I wanted in (4) but I mixed it up a bit - the limit is for non-R:s, not R:s, as I remembered it.)

BTW, do you set candidates.seedsRedirectNewSeeds=true? That is another area where the harvest can walk away outside the intended top domain. But hopefully a page which is redirected to from a top domain X *has* contents related to that top domain (when it comes to language and subjects). Or?

It seems we all are struggling to optimize NAS and Heritrix parameters to get just what we want and avoid problems. Could we find a way to share such experiences, to learn from each other, and to avoid all newcomers from having to reinvent the wheel? A dedicated mailing list/forum? A shared spreadsheet?

/Peter Svanberg

Från: Peter Svanberg
Skickat: den 27 maj 2019 13:31
Till: 'netarchivesuite-users at ml.sbforge.org' <netarchivesuite-users at ml.sbforge.org>
Ämne: My experiences - any hints, advices or comments?

Hello all!

First, thanks very much, Colin, for your article Better QA With Logtrix<https://sbforge.org/display/NAS/Better+QA+With+Logtrix>! I've done some Python tools myself but Logtrix and your extensions seems very interesting and I will test it soon!

Below are some experiences the latest weeks. I would appreciate some sounding boarding from you!

1)      Strange Facebook behaviour: at different selective harvest (but not all), the request to facebook.com seemed to be delayed with 10 seconds - the duration value seemed to be about 10000 (ms) higher than similar request to other domains. Anyone seen that? Do Facebook sometimes consider me as an evil robot and silently give some punishment ...?

2)      Often, at the end of a harvest job which is part of a snapshot (test) run with just one domain left, there is a 10 second delay between each request (not duration, just wait). It seems like a politeness thing, but I can't map any of those settings to 10 seconds ... (values below). Any hints?

3)      When you have limited a harvest (essentially MaxBytesPerDomain) you want to optimize so that when the limit hits, you have fetched the most important URLs on each seed. My Python crawl log digging showed that this was not the case with the default CostUriPrecedencePolicy. So I found HopsUriPrecedencePolicy, which sorts the queue on length of hops path (or just the number of links, Ls, in hops path, which I use). This seem to make the seeds more equal on how they are harvested before the limit stops. Are you using this or something else?

4)      The algorithm for TransclusionDecideRule allow for infinitely long hops paths (as long as it doesn't contain Ls or Ss; or Xs or non-Rs after Ls; recent example RRLLERRRREPRP; looping redirects never seems to stop!).  Can you repeat the TooManyHopsDecideRule after the TransclusionDecideRule with a new value? (Example below.)

5)      To further optimize what we fetch, I would like to prioritize in-seed/in-domain/in-country URLs, without excluding out-of-ditto embedded images etc. But that seems hard to accomplish with decide rules or precedence policies - or?

Politeness settings:
disposition.delayFactor=1.0
disposition.maxDelayMs=1000
disposition.minDelayMs=300

Decide rule example:
<bean class="...TooManyHopsDecideRule">
<property name="maxHops" value="5" />
</bean>
<bean class="...TransclusionDecideRule">
<property name="maxTransHops" value="3" />
<property name="maxSpeculativeHops" value="0" />
</bean>
<bean class="...TooManyHopsDecideRule">
<property name="maxHops" value="10" />
</bean>

Regards,

-----

Peter Svanberg
Technical officer
Digital Collections Department, Newspapers, Radio and Television Division

National Library of Sweden
PO Box 5039
SE-104 51 Stockholm
Visits: Karlavägen 100, Stockholm
Phone: +46 10 709 32 78

E-mail: peter.svanberg at kb.se<mailto:peter.svanberg at kb.se>
Web: www.kb.se<http://www.kb.se>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20190603/293f7733/attachment.html>