[Netarchivesuite-curator] BnF NAS update for August

peter.stirling at bnf.fr peter.stirling at bnf.fr
Thu Aug 16 13:56:39 CEST 2012


Hello all,

Here's our update for August, focusing on our Election crawl.

On the 8th of August, the last harvest of the electoral project ended. 
Over a period of seven months, monthly, weekly, daily and single captures 
have been made of websites selected by librarians for their relation to 
the French presidential and parliamentary elections. The result is more 
than 350 million URLs, and 20.38 Tb of data (compressed: 10.67 Tb).

We have focused our efforts on harvesting the social Web, especially 
Twitter and Facebook, but Pinterest and Flickr too. The well-known problem 
of the # in the URL has been an unsurmountable obstacle to the harvest of 
some sites (Google+, Pearltree). But solutions were found for others. Thus 
Twitter was collected 4 times a day with a special harvest template: the 
crawler declared itself not as a browser, but as a robot. This allowed us 
to have access to the URL without the problematic <#!> sequence, and 
therefore to collect tweets. But now Twitter's URLs seem to work without 
this sequence, even in a normal browser, making them easier to collect.

This project was also the occasion to see our new nomination tool (BCWeb) 
working with NAS on a large scale. It proved to be very useful, even where 
we had sometimes to adjust the frequency of certain captures (to densify 
harvests for the electoral week-ends for example).

If you have any questions please don't hesitate to get in touch.

Best regards,

The BnF digital legal deposit team

Exposition  Wolinski, 50 ans de dessins  - du 28 juin au 2 septembre 2012 - BnF - François-Mitterrand Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20120816/5581bd85/attachment.html>


More information about the Netarchivesuite-curator mailing list