[Netarchivesuite-users] Heritrix version and deduplication
Søren Vejrup Carlsen
svc at kb.dk
Fri May 1 14:51:16 CEST 2009
Hi Nicolas.
ad 1: Unfortunately, there is no way to properly turn off deduplication. You can only turn off deduplication in the harvester by
removing the DeDuplicator object from the template. But the deduplication index will be retrieved by the indexserver anyway (For this issue I will create a bug: The index should only be fetched, if
deduplication is enabled in harvester template)
ad 2: It should be quite easy to do so. However It would be necessary to recompile the deduplicator software w/ the same Heritrix, that is used.
Otherwise it should be just updating the contents of the lib/heritrix
and upgrading the few libraries in the lib/heritrix-dependencies
/Søren
Fra: netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk [mailto:netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk] På vegne af nicolas.giraud at bnf.fr
Sendt: 29. april 2009 14:55
Til: netarchivesuite-users at lists.gforge.statsbiblioteket.dk
Emne: [Netarchivesuite-users] Heritrix version and deduplication
Hi,
Currently we are using Heritrix 1.14 in production. So the prod team would feel more comfortable keeping the same version when we move NAS in production. I understand that the supplied version of Heritrix is a patched 1.12.1, with code added to handle deduplication. So the production team has two main questions :
1) Is there a way to properly turn off deduplication? This is because we use Wayback and deduplication information would not appear to the end user, which the librarians are not ok with. But I believe there might be a way to generate CDX indexes from the deduplication logs. Any insight?
2) Is there a way to replace the supplied Heritrix version with the 1.14, maybe loosing deduplication features?
My personal opinion is that deduplication is a major feature, and I would like to use it in production, but I would like some background information to be able to discuss alternatives with the production team.
Cheers,
Nicolas
Avant d'imprimer, pensez à l'environnement.
Consider the environment before printing this mail.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20090501/a5c1295b/attachment-0002.html>
More information about the NetarchiveSuite-users
mailing list