[Netarchivesuite-users] Deduplicating text?
nicolas.giraud at bnf.fr
nicolas.giraud at bnf.fr
Mon May 11 16:31:14 CEST 2009
Hi,
I have noticed that the Deduplicator configuration in the harvest
templates blacklists all text-based mime types:
<newObject name="DeDuplicator"
class="is.hi.bok.deduplicator.DeDuplicator">
<boolean name="enabled">true</boolean>
<map name="filters">
</map>
<string name="index-location"/>
<string name="matching-method">By URL</string>
<boolean name="try-equivalent">true</boolean>
<boolean name="change-content-size">false</boolean>
<string name="mime-filter">^text/.*</string>
<string name="filter-mode">Blacklist</string>
<string name="analysis-mode">Timestamp</string>
<string name="log-level">SEVERE</string>
<string name="origin"/>
<string name="origin-handling">Use index
information</string>
<boolean name="stats-per-host">true</boolean>
</newObject>
I'd like to know why? Is there a performance issue? What if I remove the
filter?
Cheers,
Nicolas
Avant d'imprimer, pensez à l'environnement.
Consider the environment before printing this mail.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20090511/10d1f832/attachment-0002.html>
More information about the NetarchiveSuite-users
mailing list