[Netarchivesuite-users] The http/https and www. or not issue
Peter Svanberg
Peter.Svanberg at kb.se
Tue Feb 7 15:24:23 CET 2023
Post meeting comments on the issue of that NAS' now is adding four seeds for each new domain: http/https and leading www or not. Do you then get all pages there is but not too much duplicates?
What impacts the www case is also the Heritrix canonicalization. The StripWWWRule rule removes the www part, unless the URL is to the top of the site:
Strip any 'www' found on http/https URLs, IF they have some path/query component (content after third slash). (Top 'slash page' URIs are left unstripped, so that we prefer crawling redundant top pages to missing an entire site only available from either the www-full or www-less hostname, but not both).
If you use this, which is default, I seems to imply - partly contradicting the description above - that *if* the entrance pages with and without www are different, you get them both, but if pages below are different but have the same path URL part as the ones in the non-www URL:s, you still just get one of them, and which one you get is probably random. (Heritrix makes the canonicalization before storing the URL in its index of what is already harvested.)
Have you turned this rule off, generally?
Another default rule is LowercaseRule, which imply that there shouldn't be any two different pages which URL:s just differ in casing. Hopefully true. But URL:s are not case insensitive.
Another aspect about URL normalization/canonicalization is that the access infrastructure (Pywb indexing and UI) should use the same normalization. (See https://kris-sigur.blogspot.com/2015/03/uri-canonicalization-in-web-archiving.html )
(See also in Slack<https://iipc.slack.com/archives/C2F63EUV7/p1674684907864079>.)
[KB Logo]<https://www.kb.se/>
Peter Svanberg
Technical officer
Aquisitions and Metadata Department
Film, Games, Sheet Music and Web Unit
National Library of Sweden
PO Box 5039, SE-102 41 Stockholm
Visits: Karlavägen 96, Stockholm
+46 10-709 32 78
Peter.Svanberg at kb.se
www.kb.se<https://www.kb.se/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20230207/a74b5e26/attachment.html>
More information about the NetarchiveSuite-users
mailing list