[Netarchivesuite-curator] NAS news from the BnF

peter.stirling at bnf.fr peter.stirling at bnf.fr
Tue Jan 10 10:53:46 CET 2012


Dear All,

As agreed during our meeting in Paris in November, here is a round-up of 
recent developments involving NetarchiveSuite at the BnF. 

- Harvesting with NetarchiveSuite
The year 2011 was the first one when BnF used only NetarchiveSuite to 
organize all its crawls. We managed one broad crawl (as in 2010) but we 
also started a daily crawl for newspapers and focused crawls with 
differents schedules. We finished the year with 1.6 billion of harvested 
URLs (1.2 billion in 2010) for a compressed weight of 57 To (43 To in 
2010).

The main challenge for 2012 will be to manage crawls for the French 
presidential and general elections. We will notably try to collect Twitter 
several times a day.


- Snapshot harvest at the Bibliothèque nationale de France
As announced in October 2011, we finished our snapshot harvest on December 
26th. These are the three main figures : 1 billion of harvested URLs, 
compressed weight of 32.6 To, duration of  11 weeks. The crawl went well 
except that we had an incident between step 1 and step 2 : we had less 
domains registered for step 2 compared with 2010, although the number of 
seeds was bigger. This was due to a parameter which calculates the number 
of generic errors of URLs (errorpenalty in the order.xml) : we commonly 
used it when we only had Heritrix but this time, with NetarchiveSuite, the 
report indicated only 999 harvested URLs instead of 1,000 when one error 
occurred. So we had two steps 2 : the first for 49,000 domains (for 
domains that had reached the limit of 1000 URLs in NAS), the second for 
53,000 domains (for domains that had failed to reach the limit of 1000 
URLs, but instead had 997, 998 URLs and so on due to the errorpenalty 
value).

During the monitoring, we noticed a large number of sites on e-business, 
online directories and ads, municipalities websites. We decided to stop 
collecting websites which last too long (about 3 to 5 per job): for step 
1, this meant generally after 1 day; for step 2, generally after 4 days.

BnF would be interested to have some information about your own 
observations on your 2011 broad crawl.


- New curator tool : BnF Collecte du Web (BCWeb)
Since April 2011 we have been developing a curator tool for use by our 
network of 80 subject librarians at the BnF, with the aim of opening it up 
to external partners, notably for the election crawls. The tool, known as 
BCWeb, organises URLs using the types of collection we already use, by 
thematic departments and projects. It allows selectors to define the 
parameters to be used by NAS (depth, frequency and budget) as well as 
documentary fields such as keywords and notes. Search and browse features 
allow users to keep their selections up to date. An administration module 
allows the web archiving team to load the data into NAS: adding new URLs 
to Harvest Definitions and updating or deleting old ones.

BCWeb has been developed using the "Scrum" method of agile software 
development. This meant we were able to develop the main functions early 
on, but in the past couple of months we have encountered problems with 
performance and the integration of graphic elements of some pages. We have 
loaded the complete database of almost 14,000 URLs and are finalising the 
full-scale tests of the import into NAS. We aim to have the tool online by 
February, to allow the selection of sites for the presidential elections 
in April.


If you would like any more information on any of these points please let 
us know.

Best regards,

The BnF digital legal deposit team



Exposition  Casanova, la passion de la Liberté  - du 15 novembre 2011 au 19 février 2012 - BnF - François-Mitterrand / Grande Galerie Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20120110/edbf204c/attachment.html>


More information about the Netarchivesuite-curator mailing list