[Netarchivesuite-curator] BnF NAS update for February

peter.stirling at bnf.fr peter.stirling at bnf.fr
Fri Feb 9 09:31:32 CET 2018


Hello all,

We have finished analysing all the 2017 crawl reports. Over the year we 
crawled 2.2 billion URLs and 145.06 TB (70% for the broad crawl and 30% 
for the selective crawls). The quality of the broad crawl is good despite 
the small budget per domain: 85.1% of the domains were archived totally 
and after a visual inspection, 90% of the captures have a good quality in 
the web archives. Our collections now represent 938.58 TB. The statistics 
also confirm that Heritrix 3 creates less errors due to incorrect URLs.

We did a short analysis on the disappearance of websites based on our 
latest broad crawls. Between 2016 and 2017, 676,631 domains disappeared, 
which shows the value of our crawl activity. They cover 361 different 
TLDs, or 52% of TLDs in the seed list, including both old and new ones 
(eg. .paris lost 4,098 domains).

We also analysed IDNs, as it's the first time we've been able to crawl 
them. However after a peak in 2013 with more than 50,000 IDNs, their 
number is decreasing (34,656 in 2017). From a sample of 2,500 IDNs, at 
least 376 or more than 15% are mirror websites (eg. there are 90 IDN for 
numerix.fr). It is likely that producers reserved IDNs as a precaution but 
did not keep them. Only 11 in the sample use non Latin characters, 
although IDNs are in expansion in Arabic-speaking and Asian regions, 
outside the French scope. Out of the 34,656 IDNs, 75% are active (http 
response 200). After a visual inspection, only 8% correspond to a website 
with content (compared to 47% for the global broad crawl) and 14% redirect 
to a non-IDN domain: this confirms our hypothesis on the reservation of 
IDNs. Among websites with content, 70.8% are commercial, compared to 67.7% 
for the global broad crawl.

Finally, we also studied seeds from last year's election crawl in the 
broad crawl: out of 12,659 seeds, only 47% are still on line. Based on a 
sample of 100 active websites, 47% have been updated. This shows that the 
broad crawl is a real complement to the selective elections crawl.

Best regards,
The BnF digital legal deposit team
20 ans de Gallica : la plus grande bibliothèque numérique en accès libre fête son anniversaire Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20180209/61964c90/attachment.html>


More information about the Netarchivesuite-curator mailing list