[Netarchivesuite-users] Optimizing deduplication index generation

Mon Jun 22 14:00:53 CEST 2009

Currently it starts from scratch every time - as you noticed not that efficient. In netarchive.dk the broad crawls currently takes around 10 days to index on 18Tbytes of data.
You could optimize by eg. Merging existing indicies but I'm not sure you would always like to just add new stuff to the index, that way indicies will just grow and grow and will hold objects no longer on the web.
Best
Bjarne andersen

Sent fra min HTC Touch Pro

----- Oprindelig meddelelse -----
Fra: nicolas.giraud at bnf.fr <nicolas.giraud at bnf.fr>
Sendt: 22. juni 2009 13:55
Til: netarchivesuite-users at lists.gforge.statsbiblioteket.dk <netarchivesuite-users at lists.gforge.statsbiblioteket.dk>
Emne: [Netarchivesuite-users] Optimizing deduplication index generation

Hi,

During my broad harvest tests, I've noticed that generating the deduplication index takes a very long time. Currently I've harvested about 70 Go of data, which is not very much, and generating the index for a new broad harvest job takes about one hour. Is there a means to store the previous indices, and only incrementally generate the delta for the jobs that were not previously taken into account? Does index generation  already works that way, or does it start over everytime? If not why does it take so long?

Best regards,
Nicolas

 Avant d'imprimer, pensez à l'environnement.
 Consider the environment before printing this mail.