[Netarchivesuite-devel] Harvesting IDN

sara.aubry at bnf.fr sara.aubry at bnf.fr
Fri Oct 5 09:20:35 CEST 2012


Hello everyone,

We're about to start our 2012 broad crawl but still have problems with 
International domains.
For the first time, we got around 30 000 international domains on our .fr 
registry list.
We managed to load them into NetarchiveSuite correctly, they display 
correctly but 
when the seeds are taken in charge by Heritrix, they are transformed and 
not harvested correcty.
For instance: 
http://www.armedéfense.fr will turn http://www.armed/?fense.fr
Have you experienced this issue? 

Another question: we noticed that xn- - equivalents were added to the 
seedlists.
Does NAS include a feature to transform IDN at job generation time?

Best,

Sara




Participez à l'acquisition d'un Trésor national : le  Livre d'heures de Jeanne de France Avant d'imprimer, pensez à l'environnement. 


More information about the Netarchivesuite-devel mailing list