[Netarchivesuite-users] Deduplication of 3xx-code non-text responses

Peter Svanberg Peter.Svanberg at kb.se
Tue Jun 3 19:08:09 CEST 2025


Hello!

Sara mentioned a problem with deduplication of redirect responses. Did I understand it correctly that there have been an increase of non-text content types on 3xx-responses lately?

Formally the content type of such a response should describe the type of the body of the request, if present. Nothing else. So I suppose those other type values refers to the content of the pointed to URL, which is incorrect.

I checked some warcs from our current broad craw and found the following on 3xx responses:

Amount Code     Content type
    83  301     application/binary
    166 302     application/binary
      1 303     application/binary
      1 301     application/rss+xml

      1 301     image/jpeg
      1 302     image/pjpeg
      4 301     image/png
      1 301     image/x-icon

  28560 301     text/html
   7195 302     text/html
    251 307     text/html
      6 308     text/html
      1 301     text/plain
  10241 302     text/plain
      3 308     text/plain

    161 301     unknown
    673 302     unknown
      8 307     unknown
     14 308     unknown

How should this be solved? Adding a way to filter what should be deduplicated also on response codes?


[https://signaturloggor.kb.se/png/Outlook%20logo%20m%d0%a4rkbl%d0%96.png]<https://www.kb.se/>
Peter Svanberg
Technical officer
Legal Deposit and Metadata Department
Digital Material Legal Deposit Unit

National Library of Sweden
PO Box 5039, SE-102 41 Stockholm
Visits: Karlavägen 96, Stockholm
+46 10-709 32 78
Peter.Svanberg at kb.se
www.kb.se<https://www.kb.se/>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20250603/b718021a/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.png
Type: image/png
Size: 11224 bytes
Desc: image002.png
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20250603/b718021a/attachment-0001.png>


More information about the NetarchiveSuite-users mailing list