<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]--><style><!--
/* Font Definitions */
@font-face
{font-family:Mangal;
panose-1:0 0 4 0 0 0 0 0 0 0;}
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif;
mso-fareast-language:EN-US;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:#0563C1;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:#954F72;
text-decoration:underline;}
span.E-postmall17
{mso-style-type:personal-compose;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
mso-fareast-language:EN-US;}
@page WordSection1
{size:612.0pt 792.0pt;
margin:70.85pt 70.85pt 70.85pt 70.85pt;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="SV" link="#0563C1" vlink="#954F72">
<div class="WordSection1">
<p class="MsoNormal"><span lang="EN-GB" style="font-family:"Arial",sans-serif">Post meeting comments on the issue of that NAS’ now is adding four seeds for each new domain: http/https and leading www or not. Do you then get all pages there is but not too much
duplicates?<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="font-family:"Arial",sans-serif"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="font-family:"Arial",sans-serif">What impacts the www case is also the Heritrix canonicalization. The StripWWWRule rule removes the www part, unless the URL is to the top of the site:<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="font-family:"Arial",sans-serif"><o:p> </o:p></span></p>
<p class="MsoNormal" style="margin-left:65.2pt;background:white"><span lang="EN-GB" style="font-size:10.5pt;font-family:"Arial",sans-serif;color:#474747;mso-fareast-language:SV">Strip any 'www' found on http/https URLs, IF they have some path/query component
(content after third slash). (Top 'slash page' URIs are left unstripped, so that we prefer crawling redundant top pages to missing an entire site only available from either the www-full or www-less hostname, but not both).<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="font-family:"Arial",sans-serif"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="font-family:"Arial",sans-serif">If you use this, which is default, I seems to imply – partly contradicting the description above – that *<b>if</b>* the entrance pages with and without www are different, you get
them both, but if pages below are different but have the same path URL part as the ones in the non-www URL:s, you still just get one of them, and which one you get is probably random. (Heritrix makes the canonicalization before storing the URL in its index
of what is already harvested.)<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="font-family:"Arial",sans-serif"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="font-family:"Arial",sans-serif">Have you turned this rule off, generally?<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="font-family:"Arial",sans-serif"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="font-family:"Arial",sans-serif">Another default rule is LowercaseRule, which imply that there shouldn’t be any two different pages which URL:s just differ in casing. Hopefully true. But URL:s are not case insensitive.<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="font-family:"Arial",sans-serif"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="font-family:"Arial",sans-serif">Another aspect about URL normalization/canonicalization is that the access infrastructure (Pywb indexing and UI) should use the same normalization. (See
<a href="https://kris-sigur.blogspot.com/2015/03/uri-canonicalization-in-web-archiving.html">
https://kris-sigur.blogspot.com/2015/03/uri-canonicalization-in-web-archiving.html</a> )<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="font-family:"Arial",sans-serif"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="font-family:"Arial",sans-serif">(See also in
<a href="https://iipc.slack.com/archives/C2F63EUV7/p1674684907864079">Slack</a>.)<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="font-family:"Arial",sans-serif"><o:p> </o:p></span></p>
<table class="MsoNormalTable" border="0" cellspacing="3" cellpadding="0">
<tbody>
<tr>
<td style="padding:0cm 0cm 0cm 0cm">
<p class="MsoNormal"><a href="https://www.kb.se/"><span style="font-size:9.0pt;font-family:"Arial",sans-serif;color:blue;mso-fareast-language:SV;text-decoration:none"><img border="0" width="113" height="170" style="width:1.1736in;height:1.7708in" id="_x0000_i1025" src="https://signaturloggor.kb.se/png/Outlook%20logo%20m%d0%a4rkbl%d0%96.png" alt="KB Logo"></span></a><span style="font-size:9.0pt;font-family:"Arial",sans-serif;color:black;mso-fareast-language:SV"><o:p></o:p></span></p>
</td>
<td style="padding:0cm 0cm 0cm 5.25pt">
<p class="MsoNormal" style="mso-margin-top-alt:2.0pt;margin-right:0cm;margin-bottom:1.0pt;margin-left:0cm">
<b><span lang="EN-GB" style="font-size:9.0pt;font-family:"Arial",sans-serif;color:black;mso-fareast-language:SV">Peter Svanberg</span></b><span lang="EN-GB" style="font-size:9.0pt;font-family:"Arial",sans-serif;color:black;mso-fareast-language:SV"><o:p></o:p></span></p>
<p class="MsoNormal" style="mso-margin-top-alt:2.0pt;margin-right:0cm;margin-bottom:1.0pt;margin-left:0cm">
<b><span lang="EN-GB" style="font-size:8.0pt;font-family:"Arial",sans-serif;color:black;mso-fareast-language:SV">Technical officer
</span></b><span lang="EN-GB" style="font-size:8.0pt;font-family:"Arial",sans-serif;color:black;mso-fareast-language:SV"><o:p></o:p></span></p>
<p class="MsoNormal" style="mso-margin-top-alt:2.0pt;margin-right:0cm;margin-bottom:1.0pt;margin-left:0cm">
<span lang="EN-GB" style="font-size:8.0pt;font-family:"Arial",sans-serif;color:black;mso-fareast-language:SV">Aquisitions and Metadata Department<br>
Film, Games, Sheet Music and Web Unit<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="font-size:9.0pt;font-family:"Arial",sans-serif;color:black;mso-fareast-language:SV"><o:p> </o:p></span></p>
<p class="MsoNormal" style="mso-margin-top-alt:2.0pt;margin-right:0cm;margin-bottom:1.0pt;margin-left:0cm">
<b><span lang="EN-GB" style="font-size:8.0pt;font-family:"Arial",sans-serif;color:black;mso-fareast-language:SV">National Library of Sweden</span></b><span lang="EN-GB" style="font-size:8.0pt;font-family:"Arial",sans-serif;color:black;mso-fareast-language:SV"><o:p></o:p></span></p>
<p class="MsoNormal" style="mso-margin-top-alt:2.0pt;margin-right:0cm;margin-bottom:1.0pt;margin-left:0cm">
<span lang="EN-GB" style="font-size:8.0pt;font-family:"Arial",sans-serif;color:black;mso-fareast-language:SV">PO Box 5039, SE-102 41 Stockholm<o:p></o:p></span></p>
<p class="MsoNormal" style="mso-margin-top-alt:2.0pt;margin-right:0cm;margin-bottom:1.0pt;margin-left:0cm">
<span style="font-size:8.0pt;font-family:"Arial",sans-serif;color:black;mso-fareast-language:SV">Visits: Karlavägen 96, Stockholm<o:p></o:p></span></p>
<p class="MsoNormal" style="mso-margin-top-alt:2.0pt;margin-right:0cm;margin-bottom:1.0pt;margin-left:0cm">
<span style="font-size:8.0pt;font-family:"Arial",sans-serif;color:black;mso-fareast-language:SV">+46 10-709 32 78<o:p></o:p></span></p>
<p class="MsoNormal" style="mso-margin-top-alt:2.0pt;margin-right:0cm;margin-bottom:1.0pt;margin-left:0cm">
<span style="font-size:8.0pt;font-family:"Arial",sans-serif;color:black;mso-fareast-language:SV">Peter.Svanberg@kb.se<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:8.0pt;font-family:"Arial",sans-serif;color:black;mso-fareast-language:SV"><a href="https://www.kb.se/"><span style="color:blue">www.kb.se</span></a><o:p></o:p></span></p>
</td>
</tr>
</tbody>
</table>
<p class="MsoNormal"><span style="font-family:"Arial",sans-serif;mso-fareast-language:SV"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-GB" style="font-family:"Arial",sans-serif"><o:p> </o:p></span></p>
</div>
</body>
</html>