[Netarchivesuite-devel] Harvesting IDN
Tue Larsen
tlr at kb.dk
Fri Oct 5 09:31:50 CEST 2012
Hi Sara
In NAS the IDN domains are harvested under the punycode name.
The job stats are unfortunately not updated in 3.18.3 or previous versions but will be fixed in 4.0
Currently, is it not possible to search after 54000 danish IDN domains in the danish wayback using the UTF8 domain string.
Workaraund: use the pynucode for the IDN using e.g. http://mct.verisign-grs.com/conversiontool/convertServlet?input=skr%C3%A6dderiet.dk&type=UTF8
All the best
Tue
________________________________________
Fra: netarchivesuite-devel-bounces at ml.sbforge.org [netarchivesuite-devel-bounces at ml.sbforge.org] på vegne af sara.aubry at bnf.fr [sara.aubry at bnf.fr]
Sendt: 5. oktober 2012 09:20
Til: netarchivesuite-devel at ml.sbforge.org
Emne: [Netarchivesuite-devel] Harvesting IDN
Hello everyone,
We're about to start our 2012 broad crawl but still have problems with
International domains.
For the first time, we got around 30 000 international domains on our .fr
registry list.
We managed to load them into NetarchiveSuite correctly, they display
correctly but
when the seeds are taken in charge by Heritrix, they are transformed and
not harvested correcty.
For instance:
http://www.armedéfense.fr will turn http://www.armed/?fense.fr
Have you experienced this issue?
Another question: we noticed that xn- - equivalents were added to the
seedlists.
Does NAS include a feature to transform IDN at job generation time?
Best,
Sara
Participez à l'acquisition d'un Trésor national : le Livre d'heures de Jeanne de France Avant d'imprimer, pensez à l'environnement.
_______________________________________________
Netarchivesuite-devel mailing list
Netarchivesuite-devel at ml.sbforge.org
http://ml.sbforge.org/mailman/listinfo/netarchivesuite-devel
More information about the Netarchivesuite-devel
mailing list