[Netarchivesuite-curator] BnF NAS update for July

auriane.quoix at bnf.fr auriane.quoix at bnf.fr
Mon Jul 4 20:30:16 CEST 2022


Dear all,

Last week, we launched our "Auction house" crawl, which concerns French 
auction houses websites. About 200 websites had been selected. Last year, 
we had been blacklisted by large auction sites. So we set up a specific 
harvest system for auction.fr where many websites are hosted. We added 
filters on all the other jobs in progress before starting the harvest and 
we created a special queue management to group the URLs of all hosts which 
belong to a website into one particular queue. This makes it possible to 
avoid sending too many requests at the same time as well as to limit the 
harvest to 100 000 URLs per website.

The LIFRANUM crawl carried out in partnership with researchers from the 
Jean Moulin University Lyon 3 and the Lumière University Lyon 2 is about 
to be launched.
The project aims to identify and map the corpus of digital French-speaking 
literature (sites, blogs, social networks). About 1100 sites will be 
crawled for this harvest with a specific budget of 15 000 URLs. The 
harvest should last about 1 or 2 weeks.

Finally, we are continuing the preparations for our 2022 broad crawl.


Best regards,

The BnF digital legal deposit team

Expositions  L'aventure Champollion. Dans le secret des hiéroglyphes  – Jusqu'au 24 juillet 2022 | François-Mitterrand –  Visages de l’exploration au XIX e  siècle. Du mythe à l’histoire  – Du 10 mai au 21 août 2022 | François-Mitterrand Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20220704/fdbe2302/attachment.html>


More information about the Netarchivesuite-curator mailing list