[Netarchivesuite-curator] BnF NAS update for January
peter.stirling at bnf.fr
peter.stirling at bnf.fr
Tue Jan 9 11:01:09 CET 2018
Hello all,
First of all we wish you a very happy new year and all best wishes for
2018! We have a change in the team, Ange Aniesa has left to take up a
position in another department at the BnF, we wish him all the best.
In December, we organized a week-long workshop within the team on
collecting Twitter, to build on last year's election crawls, where we used
Heritrix 3 to collect more than 3,500 Twitter accounts or hashtags twice a
day, with a depth of page + 1 click. This allowed us to crawl the time
line for each seed (i.e. 40 tweets per day per seed) and a part of the
context (the time line of other accounts or hashtags mentioned in the
seed).
The goal of this workshop was to continue this crawl during the year by
creating a new specific harvest definition, and to improve its quality.
The quality of the crawl depends on the number of seeds. First we tested
dividing the seed list between several jobs, and then we tested putting
all the seeds in one job and dividing the queue twitter.com into 10
separate queues. The quality is better when the seed list is shared
between several jobs than in several queues within one job, apparently
because the division between queues isn't equal: some queues crawled more
than 15,000 URLs while some crawled less than 1500 URLs. We need to
continue the tests.
During this workshop we also studied the API services. The free service
allows us to collect less information by the crawl by Heritrix: less
tweets, less images, less context and no links. It will also be more
difficult to then give access to these data and preserve them. We
therefore decided to abandon this approach.
The new crawl will start at the beginning of the year and crawl twice a
day, with only a small number of accounts at the beginning, but the seed
list will grow step by step thanks to the curators. This is the best way
to cover current events, in addition to our existing crawls of news
websites.
Finally, we are pleased to announce that we have published the seed lists
for our focused crawls on the new BnF site dedicated to APIs and datasets.
These lists are based on exports from BCWeb and include the crawl settings
and descriptive elements added by the curators. We hope this will help
researchers to make better use of our collections. There are two pages on
the site, one for election crawls (
http://api.bnf.fr/liste-des-adresses-URL-des-collectes-du-web-electoral-par-la-BnF
) and one for other focused crawls (
http://api.bnf.fr/liste-des-adresses-url-des-collectes-ciblees-du-web-francais-par-la-bnf
).
Best regards,
The BnF digital legal deposit team
Exposition Paysages français – Une aventure photographique (1984 - 2017) - du 24 octobre 2017 au 4 février 2018 - BnF - François-Mitterrand Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20180109/c09679b1/attachment.html>
More information about the Netarchivesuite-curator
mailing list