[Netarchivesuite-curator] BnF NAS Update for October
peter.stirling at bnf.fr
peter.stirling at bnf.fr
Tue Oct 2 14:27:09 CEST 2012
Hello all,
Here is our update for October.
The preparation for our next broad crawl has started. In a separate
preload tool, we have gathered several sources: all domains from .fr and
.re as usual but also, for the first time, .nc (New Caledonia), completed
by the selection from our curators. In total, we have 2.4 million domains
in the seed list. The process we apply to these domains (identification,
deduplication, validation) is not so easy and our developer has had a few
problems to solve in September.
On the other hand, we have decided to use the most recent version 3.20 of
NetarchiveSuite and we had a good surprise. We launched the IndexServer to
prepare the deduplication index from last year and it took only 9 hours
(compared with 4 days in 2011, in the same technical environment).
We will soon be ready to start the crawl.
Also, the BnF has put some new information on the wiki: different lists of
global crawler traps by categories (contacts, printing, registration...)
that could be used by your institutions; the default template for
broadcrawl with an emphasis on parameters specifically modified by BnF;
and a template we have started to test to collect content protected by
password.
https://sbforge.org/display/NAS/Crawler+Traps
https://sbforge.org/display/NAS/BnF+default+template
https://sbforge.org/display/NAS/BnF+template+for+password
Best regards,
The BnF web archiving team
Participez à l'acquisition d'un Trésor national : le Livre d'heures de Jeanne de France Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-curator/attachments/20121002/677e8941/attachment.html>
More information about the Netarchivesuite-curator
mailing list