[Netarchivesuite-curator] BnF NAS Update for October

peter.stirling at bnf.fr peter.stirling at bnf.fr
Tue Oct 2 14:27:09 CEST 2012


Hello all,

Here is our update for October.

The preparation for our next broad crawl has started. In a separate 
preload tool, we have gathered several sources: all domains from .fr and 
.re as usual but also, for the first time, .nc (New Caledonia), completed 
by the selection from our curators. In total, we have 2.4 million domains 
in the seed list. The process we apply to these domains (identification, 
deduplication, validation) is not so easy and our developer has had a few 
problems to solve in September. 

On the other hand, we have decided to use the most recent version 3.20 of 
NetarchiveSuite and we had a good surprise. We launched the IndexServer to 
prepare the deduplication index from last year and it took only 9 hours 
(compared with 4 days in 2011, in the same technical environment).

We will soon be ready to start the crawl.

Also, the BnF has put some new information on the wiki: different lists of 
global crawler traps by categories (contacts, printing, registration...) 
that could be used by your institutions; the default template for 
broadcrawl with an emphasis on parameters specifically modified by BnF; 
and a template we have started to test to collect content protected by 
password.

https://sbforge.org/display/NAS/Crawler+Traps
https://sbforge.org/display/NAS/BnF+default+template
https://sbforge.org/display/NAS/BnF+template+for+password

Best regards,
The BnF web archiving team

Participez à l'acquisition d'un Trésor national : le  Livre d'heures de Jeanne de France Avant d'imprimer, pensez à l'environnement. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20121002/677e8941/attachment.html>


More information about the Netarchivesuite-curator mailing list