[Netarchivesuite-users] Deduplicating text?

nicolas.giraud at bnf.fr nicolas.giraud at bnf.fr
Mon May 11 16:31:14 CEST 2009


Hi,

I have noticed that the Deduplicator configuration in the harvest 
templates blacklists all text-based mime types:

<newObject name="DeDuplicator" 
class="is.hi.bok.deduplicator.DeDuplicator">
                <boolean name="enabled">true</boolean>
                <map name="filters">
                </map>
                <string name="index-location"/>
                <string name="matching-method">By URL</string>
                <boolean name="try-equivalent">true</boolean>
                <boolean name="change-content-size">false</boolean>
                <string name="mime-filter">^text/.*</string>
                <string name="filter-mode">Blacklist</string>
                <string name="analysis-mode">Timestamp</string>
                <string name="log-level">SEVERE</string>
                <string name="origin"/>
                <string name="origin-handling">Use index 
information</string>
                <boolean name="stats-per-host">true</boolean>
 </newObject> 

I'd like to know why? Is there a performance issue? What if I remove the 
filter?

Cheers,
Nicolas




Avant d'imprimer, pensez à l'environnement. 
Consider the environment before printing this mail.   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20090511/10d1f832/attachment-0002.html>


More information about the NetarchiveSuite-users mailing list