[Netarchivesuite-curator] BnF NAS update for May

geraldine.camile at bnf.fr geraldine.camile at bnf.fr
Fri May 10 16:24:00 CEST 2019

Hello all,

In April, we organized an internal workshop on responsive websites. As a 
start, we selected a sample of websites (NOMBRE, TYPES). We first tried to 
visualize the archives of these sites with a more recent version of 
Firefox and Chromium : half of the problems disappeared which lead us to 
conclude that many problems are in fact access issues and not crawling 
issues. As we used Firefox as User agent, the visual quality was better 
with Firefox than Chromium.

In a second step, we analysed the source code of the websites which had 
crawling issues. The conclusion of these analysis was that each site has 
peculiarities that are specific to it. To solve the crawling problems, we 
tried :
- to use various user-agents (e.g. specific version of firefox user-agent, 
Chrome) but this change did not significantly change the quality of the 
crawl and the choice of the user-agent must be consistent with the choice 
of the browser used for the access.
- to crawl the websites with the IIPC Heritrix 3 version and the defaut 
Javascript extractor. It solved some problems, but not all : it 
dramatically reduced the number of 404 errors related to javascript 
through a javascript extractor which seemed to be more powerful.
- to crawl the websites with the latest release of Umbra included in NAS. 
During the tests, Umbra fell as during our first tests in December. It's 
very efficient for social networks as Instagram or pinterest, especially 
to crawl images. But due to the instability of the application, it's 
impossible to put it in production. We'll probably test it again during 
the preparation of our broad crawl tests."

Further to the fire at Notre-Dame Cathedral in Paris on April 15th, the 
IIPC has tremendously contributed to the special crawl we launched. 256 
sites were selected by the community. We thank all the contributers for 
their help.

Best regards,
The BnF digital legal deposit team
Expositions  Manuscrits de l’extrême  – jusqu'au 7 juillet 2019 | François-Mitterrand 
et  Le Monde en sphères  – jusqu'au 21 juillet 2019 | François-Mitterrand Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20190510/7900c641/attachment.html>

More information about the Netarchivesuite-curator mailing list