[Netarchivesuite-curator] NAS news from the BnF
peter.stirling at bnf.fr
peter.stirling at bnf.fr
Tue Jan 10 10:53:46 CET 2012
Dear All,
As agreed during our meeting in Paris in November, here is a round-up of
recent developments involving NetarchiveSuite at the BnF.
- Harvesting with NetarchiveSuite
The year 2011 was the first one when BnF used only NetarchiveSuite to
organize all its crawls. We managed one broad crawl (as in 2010) but we
also started a daily crawl for newspapers and focused crawls with
differents schedules. We finished the year with 1.6 billion of harvested
URLs (1.2 billion in 2010) for a compressed weight of 57 To (43 To in
2010).
The main challenge for 2012 will be to manage crawls for the French
presidential and general elections. We will notably try to collect Twitter
several times a day.
- Snapshot harvest at the Bibliothèque nationale de France
As announced in October 2011, we finished our snapshot harvest on December
26th. These are the three main figures : 1 billion of harvested URLs,
compressed weight of 32.6 To, duration of 11 weeks. The crawl went well
except that we had an incident between step 1 and step 2 : we had less
domains registered for step 2 compared with 2010, although the number of
seeds was bigger. This was due to a parameter which calculates the number
of generic errors of URLs (errorpenalty in the order.xml) : we commonly
used it when we only had Heritrix but this time, with NetarchiveSuite, the
report indicated only 999 harvested URLs instead of 1,000 when one error
occurred. So we had two steps 2 : the first for 49,000 domains (for
domains that had reached the limit of 1000 URLs in NAS), the second for
53,000 domains (for domains that had failed to reach the limit of 1000
URLs, but instead had 997, 998 URLs and so on due to the errorpenalty
value).
During the monitoring, we noticed a large number of sites on e-business,
online directories and ads, municipalities websites. We decided to stop
collecting websites which last too long (about 3 to 5 per job): for step
1, this meant generally after 1 day; for step 2, generally after 4 days.
BnF would be interested to have some information about your own
observations on your 2011 broad crawl.
- New curator tool : BnF Collecte du Web (BCWeb)
Since April 2011 we have been developing a curator tool for use by our
network of 80 subject librarians at the BnF, with the aim of opening it up
to external partners, notably for the election crawls. The tool, known as
BCWeb, organises URLs using the types of collection we already use, by
thematic departments and projects. It allows selectors to define the
parameters to be used by NAS (depth, frequency and budget) as well as
documentary fields such as keywords and notes. Search and browse features
allow users to keep their selections up to date. An administration module
allows the web archiving team to load the data into NAS: adding new URLs
to Harvest Definitions and updating or deleting old ones.
BCWeb has been developed using the "Scrum" method of agile software
development. This meant we were able to develop the main functions early
on, but in the past couple of months we have encountered problems with
performance and the integration of graphic elements of some pages. We have
loaded the complete database of almost 14,000 URLs and are finalising the
full-scale tests of the import into NAS. We aim to have the tool online by
February, to allow the selection of sites for the presidential elections
in April.
If you would like any more information on any of these points please let
us know.
Best regards,
The BnF digital legal deposit team
Exposition Casanova, la passion de la Liberté - du 15 novembre 2011 au 19 février 2012 - BnF - François-Mitterrand / Grande Galerie Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20120110/edbf204c/attachment.html>
More information about the Netarchivesuite-curator
mailing list