[Netarchivesuite-devel] Creating the deduplication index

Wed May 26 08:11:56 CEST 2010

Hello everyone,

We launched our second stage on May 18 and the IndexServer is still 
creating the deduplication index.
Because our local storage space was too small (and was actually filled up 
very quickly when first tried it),
we had to use an NFS-mounted partition to store the cache directory. 

Here is where we're at this morning:
# du -h cache
123M    cache/FULL_CRAWL_LOG/3-cache
123M    cache/FULL_CRAWL_LOG
62G     cache/cdxdata
4.0K    cache/cdxindex
158G    cache/crawllog
4.0K    cache/dedupcrawllogindex/empty-cache
34G 
cache/dedupcrawllogindex/1-2-3-4-06813bb20ca5916ec43d0ff7a0e43fb6-cache.luceneDir
34G     cache/dedupcrawllogindex
4.0K    cache/fullcrawllogindex/empty-cache
37M     cache/fullcrawllogindex/3-cache
37M     cache/fullcrawllogindex
254G    cache

We noticed comparing it to test crawls, that speed was twice lower than 
using a local disk.
But it is still working, and we had no real other solution.

We have 2 questions:
- Is there any way to evaluate precisely the target size of the index?
If we compare the figures we had when running test crawls, it should be 
close to 44Gb, but we are not sure.

- Is there a way to know the progression of the indexing process and where 
lucene is at, what file it is handling?
We noticed looking some tmp files, that indexing was not sequential (ie 
one job/crawl.log after the other).

Many thanks for your help.

Sara

Avant d'imprimer, pensez à l'environnement.