<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:"Calibri Light";
panose-1:2 15 3 2 2 2 4 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:Verdana;
panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif;
mso-fareast-language:EN-US;}
h1
{mso-style-priority:9;
mso-style-link:"Rubrik 1 Char";
margin-top:12.0pt;
margin-right:0cm;
margin-bottom:0cm;
margin-left:0cm;
margin-bottom:.0001pt;
page-break-after:avoid;
font-size:16.0pt;
font-family:"Calibri Light",sans-serif;
color:#2E74B5;
mso-fareast-language:EN-US;
font-weight:normal;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:#0563C1;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:#954F72;
text-decoration:underline;}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
{mso-style-priority:34;
margin-top:0cm;
margin-right:0cm;
margin-bottom:0cm;
margin-left:36.0pt;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif;
mso-fareast-language:EN-US;}
span.Rubrik1Char
{mso-style-name:"Rubrik 1 Char";
mso-style-priority:9;
mso-style-link:"Rubrik 1";
font-family:"Calibri Light",sans-serif;
color:#2E74B5;}
p.msonormal0, li.msonormal0, div.msonormal0
{mso-style-name:msonormal;
mso-margin-top-alt:auto;
margin-right:0cm;
mso-margin-bottom-alt:auto;
margin-left:0cm;
font-size:12.0pt;
font-family:"Times New Roman",serif;}
span.E-postmall20
{mso-style-type:personal;
font-family:"Calibri",sans-serif;
color:windowtext;}
span.E-postmall21
{mso-style-type:personal;
font-family:"Calibri",sans-serif;
color:#1F497D;}
span.E-postmall22
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:#1F497D;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
@page WordSection1
{size:612.0pt 792.0pt;
margin:70.85pt 70.85pt 70.85pt 70.85pt;}
div.WordSection1
{page:WordSection1;}
/* List Definitions */
@list l0
{mso-list-id:1143740229;
mso-list-type:hybrid;
mso-list-template-ids:780160372 69009425 69009433 69009435 69009423 69009433 69009435 69009423 69009433 69009435;}
@list l0:level1
{mso-level-text:"%1\)";
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;}
@list l0:level2
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;}
@list l0:level3
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
@list l0:level4
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;}
@list l0:level5
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;}
@list l0:level6
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
@list l0:level7
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;}
@list l0:level8
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;}
@list l0:level9
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
ol
{margin-bottom:0cm;}
ul
{margin-bottom:0cm;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="SV" link="#0563C1" vlink="#954F72">
<div class="WordSection1">
<p class="MsoNormal"><span lang="EN-GB" style="color:#1F497D">Question (2) resolved: Through study of Heritrix source code, I found the parameter<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal" style="text-indent:65.2pt"><span lang="EN-GB" style="color:#1F497D">disposition.</span><span lang="EN-GB">
</span><span lang="EN-GB" style="color:#1F497D">respectCrawlDelayUpToSeconds<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="color:#1F497D">(In my naivety I thought that metadata.robotsPolicyName=ignore would mean that *<b>all</b>* of robots.txt would be ignored, but …)<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="color:#1F497D">Our latest (good!) video meeting revealed that it varies amongst us concerning the value of
</span><span lang="EN-GB" style="color:#1F497D">candidates.seedsRedirectNewSeeds.<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="color:#1F497D">I found an old table on our sbforge web from France with NAS/Heritrix parameter values, which I am now converting to H3 names an expands with more parameters. I’ll soon release a document and page
where you can upload your templates (which I will read and extract values to the table from).<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB">Regards,<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="font-size:9.0pt;font-family:"Arial",sans-serif;color:black;mso-fareast-language:SV">-----<br>
<br>
</span><span lang="EN-GB" style="font-family:"Arial",sans-serif;color:black;mso-fareast-language:SV">Peter Svanberg</span><span lang="EN-GB" style="color:#1F497D;mso-fareast-language:SV"><br>
</span><span lang="EN-GB" style="font-size:9.0pt;color:#1F497D;mso-fareast-language:SV"><br>
</span><span lang="EN-GB" style="font-family:"Arial",sans-serif;color:#1F497D;mso-fareast-language:SV">National Library of Sweden</span><span lang="EN-GB" style="color:#1F497D;mso-fareast-language:SV"><br>
</span><span lang="EN-GB" style="font-size:9.0pt;font-family:"Arial",sans-serif;color:#1F497D;mso-fareast-language:SV">Phone: +46 10 709
</span><span lang="EN-GB" style="font-size:9.0pt;font-family:"Arial",sans-serif;color:black;mso-fareast-language:SV">32 78</span><span lang="EN-GB" style="font-size:9.0pt;color:#1F497D;mso-fareast-language:SV"><br>
<br>
</span><span lang="EN-GB" style="font-size:9.0pt;font-family:"Arial",sans-serif;color:#1F497D;mso-fareast-language:SV">E-mail</span><span lang="EN-GB" style="font-size:9.0pt;color:#1F497D;mso-fareast-language:SV">:
</span><span lang="EN-GB" style="font-size:9.0pt;font-family:"Arial",sans-serif;color:black;mso-fareast-language:SV">peter.svanberg@kb.se</span><span lang="EN-GB" style="font-size:9.0pt;color:black;mso-fareast-language:SV"><br>
</span><span lang="EN-GB" style="font-size:9.0pt;font-family:"Arial",sans-serif;color:#1F497D;mso-fareast-language:SV">Web</span><span lang="EN-GB" style="font-size:9.0pt;color:#1F497D;mso-fareast-language:SV">:
</span><span lang="EN-GB" style="font-size:9.0pt;font-family:"Arial",sans-serif;color:#1F497D;mso-fareast-language:SV">www.kb.se</span><span lang="EN-GB" style="font-size:9.0pt;color:#1F497D;mso-fareast-language:SV"><br>
<br>
</span><span lang="EN-GB" style="color:#1F497D;mso-fareast-language:SV"><o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="color:#1F497D"><o:p> </o:p></span></p>
<div>
<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0cm 0cm 0cm">
<p class="MsoNormal"><b><span style="mso-fareast-language:SV">Från:</span></b><span style="mso-fareast-language:SV"> NetarchiveSuite-users <netarchivesuite-users-bounces@ml.sbforge.org>
<b>För </b>Peter Svanberg<br>
<b>Skickat:</b> den 4 juni 2019 01:32<br>
<b>Till:</b> netarchivesuite-users@ml.sbforge.org<br>
<b>Ämne:</b> Re: [Netarchivesuite-users] My experiences - any hints, advices or comments?<o:p></o:p></span></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><span lang="EN-GB" style="color:#1F497D">Hello!<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="color:#1F497D">Sometimes it is good to formulate questions, it makes you think. I’ve got answers to (4) and (5) now, I hope. (The
</span><span lang="EN-GB">TransclusionDecideRule<span style="color:#1F497D"> is just for what I wanted in (4) but I mixed it up a bit – the limit is for non-R:s, not R:s, as I remembered it.)<o:p></o:p></span></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="color:#1F497D">BTW, do you set candidates.seedsRedirectNewSeeds=true? That is another area where the harvest can walk away outside the intended top domain. But hopefully a page which is redirected to from a top
domain X *<b>has</b>* contents related to that top domain (when it comes to language and subjects). Or?<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="color:#1F497D">It seems we all are struggling to optimize NAS and Heritrix parameters to get just what we want and avoid problems. Could we find a way to share such experiences, to learn from each other, and to
avoid all newcomers from having to reinvent the wheel? A dedicated mailing list/forum? A shared spreadsheet?<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="color:#1F497D">/Peter Svanberg<o:p></o:p></span></p>
<p class="MsoNormal"><b><span lang="EN-GB" style="color:#1F497D"><o:p> </o:p></span></b></p>
<p class="MsoNormal"><span lang="EN-GB" style="color:#1F497D"><o:p> </o:p></span></p>
<div>
<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0cm 0cm 0cm">
<p class="MsoNormal"><b><span lang="EN-GB" style="mso-fareast-language:SV">Från:</span></b><span lang="EN-GB" style="mso-fareast-language:SV"> Peter Svanberg
<br>
<b>Skickat:</b> den 27 maj 2019 13:31<br>
<b>Till:</b> 'netarchivesuite-users@ml.sbforge.org' <<a href="mailto:netarchivesuite-users@ml.sbforge.org">netarchivesuite-users@ml.sbforge.org</a>><br>
<b>Ämne:</b> My experiences - any hints, advices or comments?<o:p></o:p></span></p>
</div>
</div>
<p class="MsoNormal"><span lang="EN-GB"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB">Hello all!<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB">First, thanks very much, Colin, for your article
</span><a href="https://sbforge.org/display/NAS/Better+QA+With+Logtrix"><span lang="EN-GB">Better QA With Logtrix</span></a><span lang="EN-GB">! I’ve done some Python tools myself but Logtrix and your extensions seems very interesting and I will test it soon!<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB">Below are some experiences the latest weeks. I would appreciate some sounding boarding from you!<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"><o:p> </o:p></span></p>
<p class="MsoListParagraph" style="text-indent:-18.0pt;mso-list:l0 level1 lfo2"><![if !supportLists]><span lang="EN-GB"><span style="mso-list:Ignore">1)<span style="font:7.0pt "Times New Roman"">
</span></span></span><![endif]><span lang="EN-GB">Strange Facebook behaviour: at different selective harvest (but not all), the request to facebook.com seemed to be delayed with 10 seconds – the duration value seemed to be about 10000 (ms) higher than similar
request to other domains. Anyone seen that? Do Facebook sometimes consider me as an evil robot and silently give some punishment …?<o:p></o:p></span></p>
<p class="MsoListParagraph" style="text-indent:-18.0pt;mso-list:l0 level1 lfo2"><![if !supportLists]><span lang="EN-GB"><span style="mso-list:Ignore">2)<span style="font:7.0pt "Times New Roman"">
</span></span></span><![endif]><span lang="EN-GB">Often, at the end of a harvest job which is part of a snapshot (test) run with just one domain left, there is a 10 second delay between each request (not duration, just wait). It seems like a politeness thing,
but I can’t map any of those settings to 10 seconds … (values below). Any hints?<o:p></o:p></span></p>
<p class="MsoListParagraph" style="text-indent:-18.0pt;mso-list:l0 level1 lfo2"><![if !supportLists]><span lang="EN-GB"><span style="mso-list:Ignore">3)<span style="font:7.0pt "Times New Roman"">
</span></span></span><![endif]><span lang="EN-GB">When you have limited a harvest (essentially
</span><span lang="EN-GB" style="font-size:9.0pt;font-family:"Verdana",sans-serif;color:#333333;background:white">MaxBytesPerDomain) you want to optimize so that when the limit hits, you have fetched the most important URLs on each seed. My Python crawl log
digging showed that this was not the case with the default </span><span lang="EN-GB">CostUriPrecedencePolicy. So I found HopsUriPrecedencePolicy, which sorts the queue on length of hops path (or just the number of links, Ls, in hops path, which I use). This
seem to make the seeds more equal on how they are harvested before the limit stops. Are you using this or something else?<o:p></o:p></span></p>
<p class="MsoListParagraph" style="text-indent:-18.0pt;mso-list:l0 level1 lfo2"><![if !supportLists]><span lang="EN-GB"><span style="mso-list:Ignore">4)<span style="font:7.0pt "Times New Roman"">
</span></span></span><![endif]><span lang="EN-GB">The algorithm for TransclusionDecideRule allow for infinitely long hops paths (as long as it doesn’t contain Ls or Ss; or Xs or non-Rs after Ls; recent example RRLLERRRREPRP; looping redirects never seems to
stop!). Can you repeat the TooManyHopsDecideRule after the TransclusionDecideRule with a new value? (Example below.)<o:p></o:p></span></p>
<p class="MsoListParagraph" style="text-indent:-18.0pt;mso-list:l0 level1 lfo2"><![if !supportLists]><span lang="EN-GB"><span style="mso-list:Ignore">5)<span style="font:7.0pt "Times New Roman"">
</span></span></span><![endif]><span lang="EN-GB">To further optimize what we fetch, I would like to prioritize in-seed/in-domain/in-country URLs, without excluding out-of-ditto embedded images etc. But that seems hard to accomplish with decide rules or precedence
policies – or?<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB">Politeness settings:<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB">disposition.delayFactor=1.0<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB">disposition.maxDelayMs=1000<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB">disposition.minDelayMs=300 <o:p>
</o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB">Decide rule example:<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"><bean class="...TooManyHopsDecideRule"><o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:18.0pt"><span lang="EN-GB"><property name="maxHops" value="5" /><o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"></bean><o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"><bean class="...TransclusionDecideRule"><o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:18.0pt"><span lang="EN-GB"><property name="maxTransHops" value="3" /><o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:18.0pt"><span lang="EN-GB"><property name="maxSpeculativeHops" value="0" /><o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"></bean><o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"><bean class="...TooManyHopsDecideRule"><o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left:18.0pt"><span lang="EN-GB"><property name="maxHops" value="10" /><o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"></bean><o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB">Regards,<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"><o:p> </o:p></span></p>
<p class="MsoNormal" style="margin-bottom:12.0pt"><span lang="EN-GB" style="font-size:9.0pt;font-family:"Arial",sans-serif;color:black;mso-fareast-language:SV">-----<br>
<br>
</span><span lang="EN-GB" style="font-family:"Arial",sans-serif;color:black;mso-fareast-language:SV">Peter Svanberg</span><span lang="EN-GB" style="mso-fareast-language:SV"><br>
</span><span lang="EN-GB" style="font-size:9.0pt;font-family:"Arial",sans-serif;mso-fareast-language:SV">Technical officer<br>
Digital Collections Department, Newspapers, Radio and Television Division</span><span lang="EN-GB" style="font-size:9.0pt;mso-fareast-language:SV">
<br>
<br>
</span><span lang="EN-GB" style="font-family:"Arial",sans-serif;mso-fareast-language:SV">National Library of Sweden</span><span lang="EN-GB" style="mso-fareast-language:SV"><br>
</span><span lang="EN-GB" style="font-size:9.0pt;font-family:"Arial",sans-serif;mso-fareast-language:SV">PO Box 5039
<br>
SE-104 51 Stockholm<br>
Visits: Karlavägen 100, Stockholm <br>
Phone: +46 10 709 <span style="color:black">32 78</span></span><span lang="EN-GB" style="font-size:9.0pt;mso-fareast-language:SV"><br>
<br>
</span><span lang="EN-GB" style="font-size:9.0pt;font-family:"Arial",sans-serif;mso-fareast-language:SV">E-mail</span><span lang="EN-GB" style="font-size:9.0pt;mso-fareast-language:SV">:
</span><a href="mailto:peter.svanberg@kb.se"><span lang="EN-GB" style="font-size:9.0pt;font-family:"Arial",sans-serif;mso-fareast-language:SV">peter.svanberg@kb.se</span></a><span lang="EN-GB" style="font-size:9.0pt;color:black;mso-fareast-language:SV"><br>
</span><span lang="EN-GB" style="font-size:9.0pt;font-family:"Arial",sans-serif;mso-fareast-language:SV">Web</span><span lang="EN-GB" style="font-size:9.0pt;mso-fareast-language:SV">:
</span><a href="http://www.kb.se"><span lang="EN-GB" style="font-size:9.0pt;font-family:"Arial",sans-serif;mso-fareast-language:SV">www.kb.se</span></a><span lang="EN-GB" style="mso-fareast-language:SV"><o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB"><o:p> </o:p></span></p>
</div>
</body>
</html>