[Netarchivesuite-curator] BnF NAS update for August
peter.stirling at bnf.fr
peter.stirling at bnf.fr
Thu Aug 16 13:56:39 CEST 2012
Hello all,
Here's our update for August, focusing on our Election crawl.
On the 8th of August, the last harvest of the electoral project ended.
Over a period of seven months, monthly, weekly, daily and single captures
have been made of websites selected by librarians for their relation to
the French presidential and parliamentary elections. The result is more
than 350 million URLs, and 20.38 Tb of data (compressed: 10.67 Tb).
We have focused our efforts on harvesting the social Web, especially
Twitter and Facebook, but Pinterest and Flickr too. The well-known problem
of the # in the URL has been an unsurmountable obstacle to the harvest of
some sites (Google+, Pearltree). But solutions were found for others. Thus
Twitter was collected 4 times a day with a special harvest template: the
crawler declared itself not as a browser, but as a robot. This allowed us
to have access to the URL without the problematic <#!> sequence, and
therefore to collect tweets. But now Twitter's URLs seem to work without
this sequence, even in a normal browser, making them easier to collect.
This project was also the occasion to see our new nomination tool (BCWeb)
working with NAS on a large scale. It proved to be very useful, even where
we had sometimes to adjust the frequency of certain captures (to densify
harvests for the electoral week-ends for example).
If you have any questions please don't hesitate to get in touch.
Best regards,
The BnF digital legal deposit team
Exposition Wolinski, 50 ans de dessins - du 28 juin au 2 septembre 2012 - BnF - François-Mitterrand Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20120816/5581bd85/attachment.html>
More information about the Netarchivesuite-curator
mailing list