[Netarchivesuite-curator] BnF NAS update for February
peter.stirling at bnf.fr
peter.stirling at bnf.fr
Fri Feb 9 09:31:32 CET 2018
Hello all,
We have finished analysing all the 2017 crawl reports. Over the year we
crawled 2.2 billion URLs and 145.06 TB (70% for the broad crawl and 30%
for the selective crawls). The quality of the broad crawl is good despite
the small budget per domain: 85.1% of the domains were archived totally
and after a visual inspection, 90% of the captures have a good quality in
the web archives. Our collections now represent 938.58 TB. The statistics
also confirm that Heritrix 3 creates less errors due to incorrect URLs.
We did a short analysis on the disappearance of websites based on our
latest broad crawls. Between 2016 and 2017, 676,631 domains disappeared,
which shows the value of our crawl activity. They cover 361 different
TLDs, or 52% of TLDs in the seed list, including both old and new ones
(eg. .paris lost 4,098 domains).
We also analysed IDNs, as it's the first time we've been able to crawl
them. However after a peak in 2013 with more than 50,000 IDNs, their
number is decreasing (34,656 in 2017). From a sample of 2,500 IDNs, at
least 376 or more than 15% are mirror websites (eg. there are 90 IDN for
numerix.fr). It is likely that producers reserved IDNs as a precaution but
did not keep them. Only 11 in the sample use non Latin characters,
although IDNs are in expansion in Arabic-speaking and Asian regions,
outside the French scope. Out of the 34,656 IDNs, 75% are active (http
response 200). After a visual inspection, only 8% correspond to a website
with content (compared to 47% for the global broad crawl) and 14% redirect
to a non-IDN domain: this confirms our hypothesis on the reservation of
IDNs. Among websites with content, 70.8% are commercial, compared to 67.7%
for the global broad crawl.
Finally, we also studied seeds from last year's election crawl in the
broad crawl: out of 12,659 seeds, only 47% are still on line. Based on a
sample of 100 active websites, 47% have been updated. This shows that the
broad crawl is a real complement to the selective elections crawl.
Best regards,
The BnF digital legal deposit team
20 ans de Gallica : la plus grande bibliothèque numérique en accès libre fête son anniversaire Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20180209/61964c90/attachment.html>
More information about the Netarchivesuite-curator
mailing list