[Netarchivesuite-curator] BnF NAS update for March
geraldine.camile at bnf.fr
geraldine.camile at bnf.fr
Fri Mar 6 18:43:23 CET 2020
Dear All,
In February, we did a lot of tests : NAS Heritrix IIPC, Instagram and
Twitter. We hope to finish all the analysis in March.
During our last tests with NAS heritrix IIPC, we found that some images
are missing in the crawl log. Our first hypothesis is these images,
crawled by the current version of Heritrix, are noise because they come
from big images repository and the link with the seed domain is not
obvious. Our second hypothesis is : the new extractor doesn't identify
them.
Otherwise we noticed that Heritirx (the current version and IIPC version)
won't be able to crawl some responsive images : that's why some small
images are missing on the home of news websites.
We tried to improve our crawling regarding Instagram profil page. We used
an command line tool: Instalooter (
https://instalooter.readthedocs.io/en/latest/) to extract meta-datas
(urls, desc, comments, ...) about the Instagram post as JSON files.
Currently, we add thumbnails urls and pictures urls to the seed list, to
crawl them. In the archives, we have the profile page with the 12 last
posts as thumbnails. To have a successful Instagram crawl, we have to
crawl the post page and be abble to show the post meta-datas.
Since Twitter has changed its tabs, we have problems with the crawl of the
hashtags : the new tab https://twitter.com/hashtag/Recherche?f=live isn't
crawl if it isn't a seed URL ou an additionnal URL. And in the wayback,
the new tab redirects to the home of the hashtag.
Best regards,
The BnF digital legal deposit team
Exposition Claudine Nougaret - dégager l'écoute. Le son dans le cinéma de Raymond Depardon - jusqu'au 15 mars 2020 - BnF - François-Mitterrand Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20200306/20805d78/attachment.html>
More information about the Netarchivesuite-curator
mailing list