<font size=2 face="sans-serif">Hello all,</font><br><br><br><font size=2 face="sans-serif">In April, we organized an internal workshop
on responsive websites. As a start, we selected a sample of websites (NOMBRE,
TYPES). We first tried to visualize the archives of these sites with a
more recent version of Firefox and Chromium : half of the problems disappeared
which lead us to conclude that many problems are in fact access issues
and not crawling issues. As we used Firefox as User agent, the visual quality
was better with Firefox than Chromium.</font><br><br><font size=2 face="sans-serif">In a second step, we analysed the source
code of the websites which had crawling issues. The conclusion of these
analysis was that each site has peculiarities that are specific to it.
To solve the crawling problems, we tried :</font><br><font size=2 face="sans-serif">- to use various user-agents (e.g. specific
version of firefox user-agent, Chrome) but this change did not significantly
change the quality of the crawl and the choice of the user-agent must be
consistent with the choice of the browser used for the access.</font><br><font size=2 face="sans-serif">- to crawl the websites with the IIPC
Heritrix 3 version and the defaut Javascript extractor. It solved some
problems, but not all : it dramatically reduced the number of 404 errors
related to javascript through a javascript extractor which seemed to be
more powerful.</font><br><font size=2 face="sans-serif">- to crawl the websites with the latest
release of Umbra included in NAS. During the tests, Umbra fell as during
our first tests in December. It's very efficient for social networks as
Instagram or pinterest, especially to crawl images. But due to the instability
of the application, it's impossible to put it in production. We'll probably
test it again during the preparation of our broad crawl tests."</font><br><br><font size=2 face="sans-serif">Further to the fire at Notre-Dame Cathedral
in Paris on April 15th, the IIPC has tremendously contributed to the special
crawl we launched. 256 sites were selected by the community. We thank all
the contributers for their help.</font><br><br><br><font size=2 face="sans-serif">Best regards,</font><br><font size=2 face="sans-serif">The BnF digital legal deposit team</font><font face="sans-serif"><hr />
<p>Expositions <strong><em><a href="https://www.bnf.fr/fr/agenda/manuscrits-de-lextreme">Manuscrits de l’extrême </a></em></strong> – jusqu'au 7 juillet 2019 | François-Mitterrand<br />et <strong><em><a href="https://www.bnf.fr/fr/agenda/le-monde-en-spheres">Le Monde en sphères</a></em></strong> – jusqu'au 21 juillet 2019 | François-Mitterrand</p>
<p style="color:#008000"><strong>Avant d'imprimer, pensez à l'environnement.</strong></p></font>