[Netarchivesuite-users] Problems with non-fetched image resolutions (srcset data, responsive pages)
Bjarne Andersen
bja at kb.dk
Wed Aug 23 16:26:13 CEST 2023
This is most likely more of a playback-problem than a crawl-problem I think.
BUT - when it comes to crawling:
To be able to crawl multiple versions of images I would think, that you would need a more advanced crawler like browsertrix that you can exactly configure to playback pages in different "sizes" as far as I recall. So its almost a site by site problem to crawl such sites. Or heritrix has a "URL rewrite Extractor" (I forgot the specific name) where you can configure rules like "When you see a URL that matches this RegExp you should also crawl a URL that is like this pattern (based on parts of the original URL)"
AND - when it comes to playback
I guess you can set up advanced playback-rules to strip parts of URLs when playback requests 1 specific URL you can serve a URL that "looks like it" - im not familiar with the specific possibilities in pywb but my guess would be there are possibilities existing or mayby some kind of plugin-infrastructure where you can write your own re-write rules for playback. Again it would most likely be a site by site configuration to make that work - so a very tough job to do for broadcrawls on thousands of domains.
The only good thing is that such advanced crawling og playback configurations could/should be shared among the community for common applications used all over the internet.
Best
Bjarne
From: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org> On Behalf Of Peter Svanberg
Sent: Wednesday, August 23, 2023 10:50 AM
To: netarchivesuite-users at ml.sbforge.org
Subject: [Netarchivesuite-users] Problems with non-fetched image resolutions (srcset data, responsive pages)
Hello! (sending this to Slack also)
We have problems with harvested responsive pages not having all image resolutions to show.
This was supposedly solved in Heritrix issue #477<https://github.com/internetarchive/heritrix3/issues/477> and #488<https://github.com/internetarchive/heritrix3/issues/478>, which seem to be present in the 3.4.0-NAS-7.4.3 version in NAS. (There are several 3.4.0 releases, but according to the dates in the jar files it is probably 3.4.0-20220727. Is the full release name stated somewhere?)
We get outlink lines in the warc like the description in #477<https://github.com/internetarchive/heritrix3/issues/477>: (this is one long line)
outlink: https://svd.vgc.no/v2/images/c3a843d6-8404-47ae-a3f5-2480efdd2709?fit=crop&h=551&q=80&upscale=true&w=980&s=c1baa7b31d1bfa8fbb1080da654cb82e48cc513d%20980w,%20https://svd.vgc.no/v2/images/c3a843d6-8404-47ae-a3f5-2480efdd2709?fit=crop&h=506&q=80&upscale=true&w=900&s=5cf1df8749c0482e2f7122ba36fa30c29ff5895e%20900w,%20https://svd.vgc.no/v2/images/c3a843d6-8404-47ae-a3f5-2480efdd2709?fit=crop&h=450&q=80&upscale=true&w=800&s=7a48dafd5bbd885311c40dc02ab942aeb6999f02%20800w,%20https://svd.vgc.no/v2/images/c3a843d6-8404-47ae-a3f5-2480efdd2709?fit=crop&h=394&q=80&upscale=true&w=700&s=5474c37ad3f95dd4a6659f1554ce6ee0722a941a%20700w,%20https://svd.vgc.no/v2/images/c3a843d6-8404-47ae-a3f5-2480efdd2709?fit=crop&h=338&q=80&upscale=true&w=600&s=6d93ce794d8d7a1d8006d8ee6b95c3ab8324fc5a%20600w,%20https://svd.vgc.no/v2/images/c3a843d6-8404-47ae-a3f5-2480efdd2709?fit=crop&h=281&q=80&upscale=true&w=500&s=125437d7fbf93241a2d43f4f992aefe98efc524b%20500w,%20https://svd.vgc.no/v2/images/c3a843d6-8404-47ae-a3f5-2480efdd2709?fit=crop&h=225&q=80&upscale=true&w=400&s=5193e3067db868c1852fa0f3960d92118ab41a25%20400w,%20https://svd.vgc.no/v2/images/c3a843d6-8404-47ae-a3f5-2480efdd2709?fit=crop&h=169&q=80&upscale=true&w=300&s=dd7d57437c2c130f6dfa47b3b2c0eb7fdf859a32%20300w E source/@srcset
So, wrong ExtractorHTML code in the jar file or bug still not solved? (Or have I misunderstood?)
Side track: Are there not-too-hard ways to handle all existing warc files with srcset-pages where most resolutions are missing? Could Pywb use the resolutions available, instead of showing nothing?
(What happens now is that if the image for the current web browser window size is missing, no image is shown. If you make your browser window smaller, the image may suddenly show up. This is when there is just one image. Seems more complex when there are several.)
Regards,
[KB Logo]<https://www.kb.se/>
Peter Svanberg
Technical officer
Aquisitions and Metadata Department
Film, Games, Sheet Music and Web Unit
National Library of Sweden
PO Box 5039, SE-102 41 Stockholm
Visits: Karlavägen 96, Stockholm
+46 10-709 32 78
Peter.Svanberg at kb.se<mailto:Peter.Svanberg at kb.se>
www.kb.se<https://www.kb.se/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20230823/39fce583/attachment.html>
More information about the NetarchiveSuite-users
mailing list