[Netarchivesuite-devel] Creating the deduplication index

Thu May 27 16:27:47 CEST 2010

Hi Sara.

The monitoring of the indexing process is definately not easy in the current system. I think there is a FR for better monitoring (otherwise we should create one).
A simple solution would be to have the indexer log whenever starting on a new job-number. e.g. "Now starting job ZZZZ - number X of Y jobs". That would indicate how the process is going and how close to the end it gets.

About the size of the index it should be possible to calculate the approximate size - since you use object-limits instead of byte-limits your index might get either a bit bigger or smaller than ours, but 44Gb does seem right. I remember our last index to be 90Gb on 20Tbytes of data

During the process the size might grow bigger (quite a lot bigger) since lucene once in a while optimizes the index and during that process needs more free disk-space

best
Bjarne
________________________________________
Fra: sara.aubry at bnf.fr [sara.aubry at bnf.fr]
Sendt: 26. maj 2010 08:11
Til: netarchivesuite-devel at lists.gforge.statsbiblioteket.dk; Bjarne Andersen
Emne: Creating the deduplication index

Hello everyone,

We launched our second stage on May 18 and the IndexServer is still
creating the deduplication index.
Because our local storage space was too small (and was actually filled up
very quickly when first tried it),
we had to use an NFS-mounted partition to store the cache directory.

Here is where we're at this morning:
# du -h cache
123M    cache/FULL_CRAWL_LOG/3-cache
123M    cache/FULL_CRAWL_LOG
62G     cache/cdxdata
4.0K    cache/cdxindex
158G    cache/crawllog
4.0K    cache/dedupcrawllogindex/empty-cache
34G
cache/dedupcrawllogindex/1-2-3-4-06813bb20ca5916ec43d0ff7a0e43fb6-cache.luceneDir
34G     cache/dedupcrawllogindex
4.0K    cache/fullcrawllogindex/empty-cache
37M     cache/fullcrawllogindex/3-cache
37M     cache/fullcrawllogindex
254G    cache

We noticed comparing it to test crawls, that speed was twice lower than
using a local disk.
But it is still working, and we had no real other solution.

We have 2 questions:
- Is there any way to evaluate precisely the target size of the index?
If we compare the figures we had when running test crawls, it should be
close to 44Gb, but we are not sure.

- Is there a way to know the progression of the indexing process and where
lucene is at, what file it is handling?
We noticed looking some tmp files, that indexing was not sequential (ie
one job/crawl.log after the other).

Many thanks for your help.

Sara

Avant d'imprimer, pensez ? l'environnement.