[Netarchivesuite-devel] Harvesting IDN
sara.aubry at bnf.fr
sara.aubry at bnf.fr
Fri Oct 5 09:20:35 CEST 2012
Hello everyone,
We're about to start our 2012 broad crawl but still have problems with
International domains.
For the first time, we got around 30 000 international domains on our .fr
registry list.
We managed to load them into NetarchiveSuite correctly, they display
correctly but
when the seeds are taken in charge by Heritrix, they are transformed and
not harvested correcty.
For instance:
http://www.armedéfense.fr will turn http://www.armed/?fense.fr
Have you experienced this issue?
Another question: we noticed that xn- - equivalents were added to the
seedlists.
Does NAS include a feature to transform IDN at job generation time?
Best,
Sara
Participez à l'acquisition d'un Trésor national : le Livre d'heures de Jeanne de France Avant d'imprimer, pensez à l'environnement.
More information about the Netarchivesuite-devel
mailing list