[Netarchivesuite-curator] BnF NAS update for April

peter.stirling at bnf.fr peter.stirling at bnf.fr
Fri Apr 11 15:57:05 CEST 2014


Hello all,

This month we thought we'd give you an overview of all the project crawls 
we are running this year, as several of them have taken place during the 
past month.

We have several crawls relating to events and anniversaries in 2014:
- The centenary of the First World War - this is a project that began last 
November and will continue until 2018 with three or four crawls per year.
- The 250th anniversary of the death of Jean-Philippe Rameau (covered in 
our last monthly update).
- Local and European elections - the French local elections took place 
last month and we are preparing the crawls in the lead up to the European 
elections in May.
- Winter Olympic and Paralympic Games - as part of the IIPC project.

There are also project crawls on specific themes or types of document 
(these are all continued from previous years):
- News and subscription news sites - crawled every day.
- Online personal and literary journals - the first crawl took place in 
March, the second will be in August.
- Solidarity and social movements - planned for May and June
- Travel journals - planned for June
- Auction catalogues - planned for July
- French and American official publications - two separate crawls both 
planned for July.
- Dailymotion videos - planned for August.

In addition, we also maintain our  "ongoing crawls", i.e. all the sites 
selected by BnF departments according to their collection policies which 
are collected at different frequencies: once a year, twice a year, monthly 
or weekly.

Since our storage budget is the same in 2014 as in 2013, the number of 
project crawls and the increase in the number of domains in our broad 
crawl means we are trying to optimise our ongoing crawls. We are working 
with the librarians who select sites to limit the number of sites that are 
included in multiple crawls, and to make sure that the sites collected 
more frequently than once a year change often enough to justify this. 
We've also removed the largest budget from the twice-yearly crawl, and 
we've changed the way Heritrix handles queues for sites with a "domain" 
depth - previously we had queues per host, so the budget allocated was 
multiplied by the number of hosts. We now have a single queue and 
therefore a single budget for each domain. This doesn't seem to have had 
an impact on the speed of crawls.

Best regards,
The BnF digital legal deposit team



Exposition  Été 1914. Les derniers jours de l'ancien monde  - du 25 mars au 3 août 2014 - BnF - François-Mitterrand Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20140411/a39e8f83/attachment.html>


More information about the Netarchivesuite-curator mailing list