[Netarchivesuite-curator] BnF NAS update for January

peter.stirling at bnf.fr peter.stirling at bnf.fr
Tue Jan 9 11:01:09 CET 2018


Hello all,

First of all we wish you a very happy new year and all best wishes for 
2018! We have a change in the team, Ange Aniesa has left to take up a 
position in another department at the BnF, we wish him all the best.

In December, we organized a week-long workshop within the team on 
collecting Twitter, to build on last year's election crawls, where we used 
Heritrix 3 to collect more than 3,500 Twitter accounts or hashtags twice a 
day, with a depth of page + 1 click. This allowed us to crawl the time 
line for each seed (i.e. 40 tweets per day per seed) and a part of the 
context (the time line of other accounts or hashtags mentioned in the 
seed).

The goal of this workshop was to continue this crawl during the year by 
creating a new specific harvest definition, and to improve its quality. 
The quality of the crawl depends on the number of seeds. First we tested 
dividing the seed list between several jobs, and then we tested putting 
all the seeds in one job and dividing the queue twitter.com into 10 
separate queues. The quality is better when the seed list is shared 
between several jobs than in several queues within one job, apparently 
because the division between queues isn't equal: some queues crawled more 
than 15,000 URLs while some crawled less than 1500 URLs. We need to 
continue the tests.

During this workshop we also studied the API services. The free service 
allows us to collect less information by the crawl by Heritrix: less 
tweets, less images, less context and no links. It will also be more 
difficult to then give access to these data and preserve them. We 
therefore decided to abandon this approach.

The new crawl will start at the beginning of the year and crawl twice a 
day, with only a small number of accounts at the beginning, but the seed 
list will grow step by step thanks to the curators. This is the best way 
to cover current events, in addition to our existing crawls of news 
websites.

Finally, we are pleased to announce that we have published the seed lists 
for our focused crawls on the new BnF site dedicated to APIs and datasets. 
These lists are based on exports from BCWeb and include the crawl settings 
and descriptive elements added by the curators. We hope this will help 
researchers to make better use of our collections. There are two pages on 
the site, one for election crawls (
http://api.bnf.fr/liste-des-adresses-URL-des-collectes-du-web-electoral-par-la-BnF
) and one for other focused crawls (
http://api.bnf.fr/liste-des-adresses-url-des-collectes-ciblees-du-web-francais-par-la-bnf
).

Best regards,
The BnF digital legal deposit team


Exposition  Paysages français – Une aventure photographique (1984 - 2017)  - du 24 octobre 2017 au 4 février 2018 - BnF - François-Mitterrand Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20180109/c09679b1/attachment.html>


More information about the Netarchivesuite-curator mailing list