<font size=2 face="sans-serif">Hello all,</font><br><br><font size=2 face="sans-serif">We have finished analysing all the 2017
crawl reports. Over the year we crawled 2.2 billion URLs and 145.06 TB
(70% for the broad crawl and 30% for the selective crawls). The quality
of the broad crawl is good despite the small budget per domain: 85.1% of
the domains were archived totally and after a visual inspection, 90% of
the captures have a good quality in the web archives. Our collections now
represent 938.58 TB. The statistics also confirm that Heritrix 3 creates
less errors due to incorrect URLs.</font><br><br><font size=2 face="sans-serif">We did a short analysis on the disappearance
of websites based on our latest broad crawls. Between 2016 and 2017, 676,631
domains disappeared, which shows the value of our crawl activity. They
cover 361 different TLDs, or 52% of TLDs in the seed list, including both
old and new ones (eg. .paris lost 4,098 domains).</font><br><br><font size=2 face="sans-serif">We also analysed IDNs, as it's the first
time we've been able to crawl them. However after a peak in 2013 with more
than 50,000 IDNs, their number is decreasing (34,656 in 2017). From a sample
of 2,500 IDNs, at least 376 or more than 15% are mirror websites (eg. there
are 90 IDN for numerix.fr). It is likely that producers reserved IDNs as
a precaution but did not keep them. Only 11 in the sample use non Latin
characters, although IDNs are in expansion in Arabic-speaking and Asian
regions, outside the French scope. Out of the 34,656 IDNs, 75% are active
(http response 200). After a visual inspection, only 8% correspond to a
website with content (compared to 47% for the global broad crawl) and 14%
redirect to a non-IDN domain: this confirms our hypothesis on the reservation
of IDNs. Among websites with content, 70.8% are commercial, compared to
67.7% for the global broad crawl.</font><br><br><font size=2 face="sans-serif">Finally, we also studied seeds from
last year's election crawl in the broad crawl: out of 12,659 seeds, only
47% are still on line. Based on a sample of 100 active websites, 47% have
been updated. This shows that the broad crawl is a real complement to the
selective elections crawl.</font><br><font size=2 face="sans-serif"><br>Best regards,</font><br><font size=2 face="sans-serif">The BnF digital legal deposit team</font><font face="sans-serif"><hr />
<p><strong><a href="http://www.bnf.fr/fr/collections_et_services/anx_bib_num/a.gallica_20ans.html">20 ans de Gallica : la plus grande bibliothèque numérique en accès libre fête son anniversaire</a></p>
<p style="color:#008000"><strong>Avant d'imprimer, pensez à l'environnement.</strong></p></font>