[Netarchivesuite-curator] BnF NAS update for March

geraldine.camile at bnf.fr geraldine.camile at bnf.fr
Fri Mar 6 18:43:23 CET 2020

Dear All,

In February, we did a lot of tests : NAS Heritrix IIPC, Instagram and 
Twitter. We hope to finish all the analysis in March.

During our last tests with NAS heritrix IIPC, we found that some images 
are missing in the crawl log. Our first hypothesis is these images, 
crawled by the current version of Heritrix, are noise because they come 
from big images repository and the link with the seed domain is not 
obvious. Our second hypothesis is : the new extractor doesn't identify 
Otherwise we noticed that Heritirx (the current version and IIPC version) 
won't be able to crawl some responsive images : that's why some small 
images are missing on the home of news websites.

We tried to improve our crawling regarding Instagram profil page. We used 
an command line tool: Instalooter (
https://instalooter.readthedocs.io/en/latest/) to extract meta-datas 
(urls, desc, comments, ...) about the Instagram post as JSON files.
Currently, we add thumbnails urls and pictures urls to the seed list, to 
crawl them. In the archives, we have the profile page with the 12 last 
posts as thumbnails. To have a successful Instagram crawl, we have to 
crawl the post page and be abble to show the post meta-datas.

Since Twitter has changed its tabs, we have problems with the crawl of the 
hashtags : the new tab https://twitter.com/hashtag/Recherche?f=live isn't 
crawl if it isn't a seed URL ou an additionnal URL. And in the wayback, 
the new tab redirects to the home of the hashtag.

Best regards,

The BnF digital legal deposit team
Exposition  Claudine Nougaret - dégager l'écoute. Le son dans le cinéma de Raymond Depardon  - jusqu'au 15 mars 2020 - BnF - François-Mitterrand Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20200306/20805d78/attachment.html>

More information about the Netarchivesuite-curator mailing list