[Netarchivesuite-curator] BnF NAS update for January
peter.stirling at bnf.fr
peter.stirling at bnf.fr
Tue Jan 13 10:15:16 CET 2015
Hello all,
Our 2014 broad crawl finished at the beginning of December after 50 days.
We gathered a total of 1.7 billion URLs for 67 TB. The volume was much
bigger than expected because of the increase in the average size of each
URL (42 KB in 2014, 35 KB in 2013). Our explanation for this is that the
greater bandwidth (867 KB in 2014, 529 KB in 2013) gave the opportunity to
harvest complete video files; in fact, we suppose that lots of them were
truncated in 2013.
We had some difficulties with workspaces used to generate the different
kinds of indexes. As the time taken to generate each deduplication index
increased too much, the schedule for focused crawls was disturbed. At the
same time, the indexing of harvested URLs needed resources used by the
crawlers, so we decided to wait the end of the broad crawl to finish this
process.
The seed list had 4.1 million domains. We paid attention to new geographic
TLDs as France had two new extensions in 2014 (.paris for the capital and
.bzh for Brittany). In addition, we analysed and checked more particularly
23,600 active websites from the French overseas departments.
Since last week we have also been collecting sites in relation to the
terrorist attacks in France, with URLs being selected both by librarians
at the BnF and IIPC members. We would like to thank you for the sites you
have sent us for these crawls.
Best regards,
The BnF digital legal deposit team
Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20150113/13b5c9e2/attachment.html>
More information about the Netarchivesuite-curator
mailing list