[Netarchivesuite-users] Heritrix version and deduplication

nicolas.giraud at bnf.fr nicolas.giraud at bnf.fr
Wed Apr 29 14:55:26 CEST 2009


Currently we are using Heritrix 1.14 in production. So the prod team would 
feel more comfortable keeping the same version when we move NAS in 
production. I understand that the supplied version of Heritrix is a 
patched 1.12.1, with code added to handle deduplication. So the production 
team has two main questions :

1) Is there a way to properly turn off deduplication? This is because we 
use Wayback and deduplication information would not appear to the end 
user, which the librarians are not ok with. But I believe there might be a 
way to generate CDX indexes from the deduplication logs. Any insight?

2) Is there a way to replace the supplied Heritrix version with the 1.14, 
maybe loosing deduplication features?

My personal opinion is that deduplication is a major feature, and I would 
like to use it in production, but I would like some background information 
to be able to discuss alternatives with the production team.


Avant d'imprimer, pensez à l'environnement. 
Consider the environment before printing this mail.   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20090429/d07ce8ec/attachment-0002.html>

More information about the NetarchiveSuite-users mailing list