[Netarchivesuite-curator] BnF NAS update for April
peter.stirling at bnf.fr
peter.stirling at bnf.fr
Fri Apr 11 15:57:05 CEST 2014
Hello all,
This month we thought we'd give you an overview of all the project crawls
we are running this year, as several of them have taken place during the
past month.
We have several crawls relating to events and anniversaries in 2014:
- The centenary of the First World War - this is a project that began last
November and will continue until 2018 with three or four crawls per year.
- The 250th anniversary of the death of Jean-Philippe Rameau (covered in
our last monthly update).
- Local and European elections - the French local elections took place
last month and we are preparing the crawls in the lead up to the European
elections in May.
- Winter Olympic and Paralympic Games - as part of the IIPC project.
There are also project crawls on specific themes or types of document
(these are all continued from previous years):
- News and subscription news sites - crawled every day.
- Online personal and literary journals - the first crawl took place in
March, the second will be in August.
- Solidarity and social movements - planned for May and June
- Travel journals - planned for June
- Auction catalogues - planned for July
- French and American official publications - two separate crawls both
planned for July.
- Dailymotion videos - planned for August.
In addition, we also maintain our "ongoing crawls", i.e. all the sites
selected by BnF departments according to their collection policies which
are collected at different frequencies: once a year, twice a year, monthly
or weekly.
Since our storage budget is the same in 2014 as in 2013, the number of
project crawls and the increase in the number of domains in our broad
crawl means we are trying to optimise our ongoing crawls. We are working
with the librarians who select sites to limit the number of sites that are
included in multiple crawls, and to make sure that the sites collected
more frequently than once a year change often enough to justify this.
We've also removed the largest budget from the twice-yearly crawl, and
we've changed the way Heritrix handles queues for sites with a "domain"
depth - previously we had queues per host, so the budget allocated was
multiplied by the number of hosts. We now have a single queue and
therefore a single budget for each domain. This doesn't seem to have had
an impact on the speed of crawls.
Best regards,
The BnF digital legal deposit team
Exposition Été 1914. Les derniers jours de l'ancien monde - du 25 mars au 3 août 2014 - BnF - François-Mitterrand Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20140411/a39e8f83/attachment.html>
More information about the Netarchivesuite-curator
mailing list