[Netarchivesuite-curator] BnF NAS update for December

peter.stirling at bnf.fr peter.stirling at bnf.fr
Thu Dec 17 13:52:23 CET 2015


Hello all,

Our 2015 broad crawl finished at the beginning of November after 42 days. 
This was much shorter than in previous years due to better management of 
the bandwidth and improved communication between the crawlers and the 
storage bays. For the first time, the indexing was finished at the same 
time thanks to the production engineer, who designed a new workflow.

From a seed list of 4.4 million domains, we collected a total of 1.6 
billion URLs for a total volume of 62 TB. We observed that 550,000 seed 
domains have disappeared since last year, which represents a large amount 
of documents now only present at the BnF. As with last year, the 
monitoring done by the curators was very light because each job stopped 
automatically after a duration of three days (generally meaning that only 
4 or 5 websites which respond very slowly are not finished). 

The quality assurance was done using statistics as usual. We also paid 
particular attention to two aspects: new gTLDs and e-books. The number of 
gTLDs increased from 470 in 2014 to 650 in 2015 with extensions such as 
.hiphop or .ninja. If this rapid growth continues, it may cause some 
problems for the configuration of NAS in the process of generating jobs 
and in the control of domains. For e-books, we wanted to observe the 
number and the quality of these documents. We performed an extraction of 
all URLs in .epub and .mobi present in the broad crawl: in fact, 9,200 
files representing 475 domains (with ten of them representing half of the 
files). We found a wide range of themes from history to health or 
religion, from poetry to science-fiction and children's literature; and it 
was mostly e-books but sometimes articles or press releases or technical 
notes.

In October and November we also performed a crawl of sites relating to the 
refugee crisis. Librarians from the Department of Philosophy, History and 
Social Science selected sites, some of them drawn from an existing 
selective crawl on the theme of Solidarity and others that were added 
specially. These sites were also sent for crawling as part of the IIPC 
collection using Archive-It. As the broad crawl was also running during 
this period it will no doubt include material related to the crisis, and 
we also have our daily crawl of news sites. 

Best regards,
The BnF digital legal deposit team
Expositions : 
Anselm Kiefer, l’alchimie du livre  - jusqu'au 7 février 2016 - BnF - François-Mitterrand 
Images du Grand Siècle, l'estampe française au temps de Louis XIV, 1660-1715  - jusqu'au 31 janvier 2016 - BnF - François-Mitterrand Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20151217/d084a485/attachment.html>


More information about the Netarchivesuite-curator mailing list