[Netarchivesuite-curator] BnF NAS update for December
peter.stirling at bnf.fr
peter.stirling at bnf.fr
Thu Dec 17 13:52:23 CET 2015
Hello all,
Our 2015 broad crawl finished at the beginning of November after 42 days.
This was much shorter than in previous years due to better management of
the bandwidth and improved communication between the crawlers and the
storage bays. For the first time, the indexing was finished at the same
time thanks to the production engineer, who designed a new workflow.
From a seed list of 4.4 million domains, we collected a total of 1.6
billion URLs for a total volume of 62 TB. We observed that 550,000 seed
domains have disappeared since last year, which represents a large amount
of documents now only present at the BnF. As with last year, the
monitoring done by the curators was very light because each job stopped
automatically after a duration of three days (generally meaning that only
4 or 5 websites which respond very slowly are not finished).
The quality assurance was done using statistics as usual. We also paid
particular attention to two aspects: new gTLDs and e-books. The number of
gTLDs increased from 470 in 2014 to 650 in 2015 with extensions such as
.hiphop or .ninja. If this rapid growth continues, it may cause some
problems for the configuration of NAS in the process of generating jobs
and in the control of domains. For e-books, we wanted to observe the
number and the quality of these documents. We performed an extraction of
all URLs in .epub and .mobi present in the broad crawl: in fact, 9,200
files representing 475 domains (with ten of them representing half of the
files). We found a wide range of themes from history to health or
religion, from poetry to science-fiction and children's literature; and it
was mostly e-books but sometimes articles or press releases or technical
notes.
In October and November we also performed a crawl of sites relating to the
refugee crisis. Librarians from the Department of Philosophy, History and
Social Science selected sites, some of them drawn from an existing
selective crawl on the theme of Solidarity and others that were added
specially. These sites were also sent for crawling as part of the IIPC
collection using Archive-It. As the broad crawl was also running during
this period it will no doubt include material related to the crisis, and
we also have our daily crawl of news sites.
Best regards,
The BnF digital legal deposit team
Expositions :
Anselm Kiefer, l’alchimie du livre - jusqu'au 7 février 2016 - BnF - François-Mitterrand
Images du Grand Siècle, l'estampe française au temps de Louis XIV, 1660-1715 - jusqu'au 31 janvier 2016 - BnF - François-Mitterrand Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20151217/d084a485/attachment.html>
More information about the Netarchivesuite-curator
mailing list