[Netarchivesuite-users] Heritrix version and deduplication

Søren Vejrup Carlsen svc at kb.dk
Fri May 1 14:51:16 CEST 2009


Hi Nicolas.

ad 1: Unfortunately, there is no way to properly turn off deduplication. You can only turn off deduplication in the harvester by

removing the DeDuplicator object from the template. But the deduplication index will be retrieved by the indexserver anyway (For this issue I will create a bug: The index should only be fetched, if 
deduplication is enabled in harvester template)
 
ad 2: It should be quite easy to do so. However It would be necessary to recompile the deduplicator software w/ the same Heritrix, that is used.
Otherwise it should be just updating the contents of the lib/heritrix 
and upgrading the few libraries in the lib/heritrix-dependencies
 
/Søren
 

Fra: netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk [mailto:netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk] På vegne af nicolas.giraud at bnf.fr
Sendt: 29. april 2009 14:55
Til: netarchivesuite-users at lists.gforge.statsbiblioteket.dk
Emne: [Netarchivesuite-users] Heritrix version and deduplication

 


Hi,

Currently we are using Heritrix 1.14 in production. So the prod team would feel more comfortable keeping the same version when we move NAS in production. I understand that the supplied version of Heritrix is a patched 1.12.1, with code added to handle deduplication. So the production team has two main questions :

1) Is there a way to properly turn off deduplication? This is because we use Wayback and deduplication information would not appear to the end user, which the librarians are not ok with. But I believe there might be a way to generate CDX indexes from the deduplication logs. Any insight?

2) Is there a way to replace the supplied Heritrix version with the 1.14, maybe loosing deduplication features?

My personal opinion is that deduplication is a major feature, and I would like to use it in production, but I would like some background information to be able to discuss alternatives with the production team.

Cheers,
Nicolas


Avant d'imprimer, pensez à l'environnement.
Consider the environment before printing this mail.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20090501/a5c1295b/attachment-0002.html>


More information about the NetarchiveSuite-users mailing list