<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
        {font-family:Verdana;
        panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0cm;
        margin-bottom:.0001pt;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;
        mso-fareast-language:EN-US;}
h1
        {mso-style-priority:9;
        mso-style-link:"Rubrik 1 Char";
        margin-top:12.0pt;
        margin-right:0cm;
        margin-bottom:0cm;
        margin-left:0cm;
        margin-bottom:.0001pt;
        page-break-after:avoid;
        font-size:16.0pt;
        font-family:"Calibri Light",sans-serif;
        color:#2E74B5;
        mso-fareast-language:EN-US;
        font-weight:normal;}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:#0563C1;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:#954F72;
        text-decoration:underline;}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
        {mso-style-priority:34;
        margin-top:0cm;
        margin-right:0cm;
        margin-bottom:0cm;
        margin-left:36.0pt;
        margin-bottom:.0001pt;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;
        mso-fareast-language:EN-US;}
span.E-postmall17
        {mso-style-type:personal-compose;
        font-family:"Calibri",sans-serif;
        color:windowtext;}
span.Rubrik1Char
        {mso-style-name:"Rubrik 1 Char";
        mso-style-priority:9;
        mso-style-link:"Rubrik 1";
        font-family:"Calibri Light",sans-serif;
        color:#2E74B5;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-family:"Calibri",sans-serif;
        mso-fareast-language:EN-US;}
@page WordSection1
        {size:612.0pt 792.0pt;
        margin:70.85pt 70.85pt 70.85pt 70.85pt;}
div.WordSection1
        {page:WordSection1;}
/* List Definitions */
@list l0
        {mso-list-id:1143740229;
        mso-list-type:hybrid;
        mso-list-template-ids:780160372 69009425 69009433 69009435 69009423 69009433 69009435 69009423 69009433 69009435;}
@list l0:level1
        {mso-level-text:"%1\)";
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-18.0pt;}
@list l0:level2
        {mso-level-number-format:alpha-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-18.0pt;}
@list l0:level3
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        text-indent:-9.0pt;}
@list l0:level4
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-18.0pt;}
@list l0:level5
        {mso-level-number-format:alpha-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-18.0pt;}
@list l0:level6
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        text-indent:-9.0pt;}
@list l0:level7
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-18.0pt;}
@list l0:level8
        {mso-level-number-format:alpha-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-18.0pt;}
@list l0:level9
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        text-indent:-9.0pt;}
ol
        {margin-bottom:0cm;}
ul
        {margin-bottom:0cm;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="SV" link="#0563C1" vlink="#954F72">
<div class="WordSection1">
<p class="MsoNormal">Hello all!<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><span lang="EN-GB">First, thanks very much, Colin, for your article
</span><a href="https://sbforge.org/display/NAS/Better+QA+With+Logtrix"><span lang="EN-GB">Better QA With Logtrix</span></a><span lang="EN-GB">! I’ve done some Python tools myself but Logtrix and your extensions seems very interesting and I will test it soon!<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB">Below are some experiences the latest weeks. I would appreciate some sounding boarding from you!<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"><o:p> </o:p></span></p>
<p class="MsoListParagraph" style="text-indent:-18.0pt;mso-list:l0 level1 lfo1"><![if !supportLists]><span lang="EN-GB"><span style="mso-list:Ignore">1)<span style="font:7.0pt "Times New Roman"">     
</span></span></span><![endif]><span lang="EN-GB">Strange Facebook behaviour: at different selective harvest (but not all), the request to facebook.com seemed to be delayed with 10 seconds – the duration value seemed to be about 10000 (ms) higher than similar
 request to other domains. Anyone seen that? Do Facebook sometimes consider me as an evil robot and silently give some punishment …?<o:p></o:p></span></p>
<p class="MsoListParagraph" style="text-indent:-18.0pt;mso-list:l0 level1 lfo1"><![if !supportLists]><span lang="EN-GB"><span style="mso-list:Ignore">2)<span style="font:7.0pt "Times New Roman"">     
</span></span></span><![endif]><span lang="EN-GB">Often, at the end of a harvest job which is part of a snapshot (test) run with just one domain left, there is a 10 second delay between each request (not duration, just wait). It seems like a politeness thing,
 but I can’t map any of those settings to 10 seconds … (values below). Any hints?<o:p></o:p></span></p>
<p class="MsoListParagraph" style="text-indent:-18.0pt;mso-list:l0 level1 lfo1"><![if !supportLists]><span lang="EN-GB"><span style="mso-list:Ignore">3)<span style="font:7.0pt "Times New Roman"">     
</span></span></span><![endif]><span lang="EN-GB">When you have limited a harvest (essentially
</span><span lang="EN-GB" style="font-size:9.0pt;font-family:"Verdana",sans-serif;color:#333333;background:white">MaxBytesPerDomain) you want to optimize so that when the limit hits, you have fetched the most important URLs on each seed. My Python crawl log
 digging showed that this was not the case with the default </span><span lang="EN-GB">CostUriPrecedencePolicy. So I found HopsUriPrecedencePolicy, which sorts the queue on length of hops path (or just the number of links, Ls, in hops path, which I use). This
 seem to make the seeds more equal on how they are harvested before the limit stops. Are you using this or something else?<o:p></o:p></span></p>
<p class="MsoListParagraph" style="text-indent:-18.0pt;mso-list:l0 level1 lfo1"><![if !supportLists]><span lang="EN-GB"><span style="mso-list:Ignore">4)<span style="font:7.0pt "Times New Roman"">     
</span></span></span><![endif]><span lang="EN-GB">The algorithm for TransclusionDecideRule allow for infinitely long hops paths (as long as it doesn’t contain Ls or Ss; or Xs or non-Rs after Ls; recent example RRLLERRRREPRP; looping redirects never seems to
 stop!).  Can you repeat the TooManyHopsDecideRule after the TransclusionDecideRule with a new value? (Example below.)<o:p></o:p></span></p>
<p class="MsoListParagraph" style="text-indent:-18.0pt;mso-list:l0 level1 lfo1"><![if !supportLists]><span lang="EN-GB"><span style="mso-list:Ignore">5)<span style="font:7.0pt "Times New Roman"">     
</span></span></span><![endif]><span lang="EN-GB">To further optimize what we fetch, I would like to prioritize in-seed/in-domain/in-country URLs, without excluding out-of-ditto embedded images etc. But that seems hard to accomplish with decide rules or precedence
 policies – or?<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB">Politeness settings:<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB">disposition.delayFactor=1.0<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB">disposition.maxDelayMs=1000<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB">disposition.minDelayMs=300             <o:p>
</o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB">Decide rule example:<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"><bean class="...TooManyHopsDecideRule"><o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:18.0pt"><span lang="EN-GB"><property name="maxHops" value="5" /><o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"></bean><o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"><bean class="...TransclusionDecideRule"><o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:18.0pt"><span lang="EN-GB"><property name="maxTransHops" value="3" /><o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:18.0pt"><span lang="EN-GB"><property name="maxSpeculativeHops" value="0" /><o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"></bean><o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"><bean class="...TooManyHopsDecideRule"><o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:18.0pt"><span lang="EN-GB"><property name="maxHops" value="10" /><o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"></bean><o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB">Regards,<o:p></o:p></span></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><span lang="EN-GB" style="font-size:9.0pt;font-family:"Arial",sans-serif;color:black;mso-fareast-language:SV">-----<br>
<br>
</span><span lang="EN-GB" style="font-family:"Arial",sans-serif;color:black;mso-fareast-language:SV">Peter Svanberg</span><span lang="EN-GB" style="mso-fareast-language:SV"><br>
</span><span lang="EN-GB" style="font-size:9.0pt;font-family:"Arial",sans-serif;mso-fareast-language:SV">Technical officer<br>
Digital Collections Department, Newspapers, Radio and Television Division</span><span lang="EN-GB" style="font-size:9.0pt;mso-fareast-language:SV">
<br>
<br>
</span><span lang="EN-GB" style="font-family:"Arial",sans-serif;mso-fareast-language:SV">National Library of Sweden</span><span lang="EN-GB" style="mso-fareast-language:SV"><br>
</span><span lang="EN-GB" style="font-size:9.0pt;font-family:"Arial",sans-serif;mso-fareast-language:SV">PO Box 5039
<br>
SE-104 51 Stockholm<br>
Visits: Karlavägen 100, Stockholm <br>
Phone: +46 10 709 <span style="color:black">32 78</span></span><span lang="EN-GB" style="font-size:9.0pt;mso-fareast-language:SV"><br>
<br>
</span><span lang="EN-GB" style="font-size:9.0pt;font-family:"Arial",sans-serif;mso-fareast-language:SV">E-mail</span><span lang="EN-GB" style="font-size:9.0pt;mso-fareast-language:SV">:
</span><span lang="EN-GB" style="font-size:9.0pt;font-family:"Arial",sans-serif;color:black;mso-fareast-language:SV">peter.svanberg@kb.se</span><span lang="EN-GB" style="font-size:9.0pt;color:black;mso-fareast-language:SV"><br>
</span><span lang="EN-GB" style="font-size:9.0pt;font-family:"Arial",sans-serif;mso-fareast-language:SV">Web</span><span lang="EN-GB" style="font-size:9.0pt;mso-fareast-language:SV">:
</span><span lang="EN-GB" style="font-size:9.0pt;font-family:"Arial",sans-serif;mso-fareast-language:SV">www.kb.se</span><span lang="EN-GB" style="font-size:9.0pt;mso-fareast-language:SV"><br>
<br>
</span><span lang="EN-GB" style="mso-fareast-language:SV"><o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"><o:p></o:p></span></p>
</div>
</body>
</html>