[Netarchivesuite-users] Deduplicating text?

Bjarne Andersen bja at statsbiblioteket.dk
Mon May 11 16:40:50 CEST 2009


The indexes generated by the IndexServer have this "filter" as well - so even if you dsable the filter in the templates you would still only get deDuplication upon non text/* mimetypes

However - there should be nothing wrong in dedpulicating everything since we use post-deduplication (downloading of all objects). It will though becuse of the Indexing require a bit of coding to allow you to set the filter to the same value in both indexing and templates. Maybe the IndexServer should read the value from the template ?

best
Bjarne Andersen
________________________________________
Fra: netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk [netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk] På vegne af nicolas.giraud at bnf.fr [nicolas.giraud at bnf.fr]
Sendt: 11. maj 2009 16:31
Til: netarchivesuite-users at lists.gforge.statsbiblioteket.dk
Emne: [Netarchivesuite-users] Deduplicating text?

Hi,

I have noticed that the Deduplicator configuration in the harvest templates blacklists all text-based mime types:

<newObject name="DeDuplicator" class="is.hi.bok.deduplicator.DeDuplicator">
                <boolean name="enabled">true</boolean>
                <map name="filters">
                </map>
                <string name="index-location"/>
                <string name="matching-method">By URL</string>
                <boolean name="try-equivalent">true</boolean>
                <boolean name="change-content-size">false</boolean>
                <string name="mime-filter">^text/.*</string>
                <string name="filter-mode">Blacklist</string>
                <string name="analysis-mode">Timestamp</string>
                <string name="log-level">SEVERE</string>
                <string name="origin"/>
                <string name="origin-handling">Use index information</string>
                <boolean name="stats-per-host">true</boolean>
 </newObject>

I'd like to know why? Is there a performance issue? What if I remove the filter?

Cheers,
Nicolas


Avant d'imprimer, pensez à l'environnement.
Consider the environment before printing this mail.




More information about the NetarchiveSuite-users mailing list