[Netarchivesuite-users] Deduplication of 3xx-code non-text responses
Peter Svanberg
Peter.Svanberg at kb.se
Tue Jun 3 19:08:09 CEST 2025
Hello!
Sara mentioned a problem with deduplication of redirect responses. Did I understand it correctly that there have been an increase of non-text content types on 3xx-responses lately?
Formally the content type of such a response should describe the type of the body of the request, if present. Nothing else. So I suppose those other type values refers to the content of the pointed to URL, which is incorrect.
I checked some warcs from our current broad craw and found the following on 3xx responses:
Amount Code Content type
83 301 application/binary
166 302 application/binary
1 303 application/binary
1 301 application/rss+xml
1 301 image/jpeg
1 302 image/pjpeg
4 301 image/png
1 301 image/x-icon
28560 301 text/html
7195 302 text/html
251 307 text/html
6 308 text/html
1 301 text/plain
10241 302 text/plain
3 308 text/plain
161 301 unknown
673 302 unknown
8 307 unknown
14 308 unknown
How should this be solved? Adding a way to filter what should be deduplicated also on response codes?
[https://signaturloggor.kb.se/png/Outlook%20logo%20m%d0%a4rkbl%d0%96.png]<https://www.kb.se/>
Peter Svanberg
Technical officer
Legal Deposit and Metadata Department
Digital Material Legal Deposit Unit
National Library of Sweden
PO Box 5039, SE-102 41 Stockholm
Visits: Karlavägen 96, Stockholm
+46 10-709 32 78
Peter.Svanberg at kb.se
www.kb.se<https://www.kb.se/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20250603/b718021a/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.png
Type: image/png
Size: 11224 bytes
Desc: image002.png
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20250603/b718021a/attachment-0001.png>
More information about the NetarchiveSuite-users
mailing list