[Netarchivesuite-users] Optimizing deduplication index generation

nicolas.giraud at bnf.fr nicolas.giraud at bnf.fr
Mon Jun 22 13:54:54 CEST 2009


During my broad harvest tests, I've noticed that generating the 
deduplication index takes a very long time. Currently I've harvested about 
70 Go of data, which is not very much, and generating the index for a new 
broad harvest job takes about one hour. Is there a means to store the 
previous indices, and only incrementally generate the delta for the jobs 
that were not previously taken into account? Does index generation already 
works that way, or does it start over everytime? If not why does it take 
so long?

Best regards,

Avant d'imprimer, pensez à l'environnement. 
Consider the environment before printing this mail.   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20090622/e03269f3/attachment-0002.html>

More information about the NetarchiveSuite-users mailing list