[Netarchivesuite-devel] Creating the deduplication index
sara.aubry at bnf.fr
sara.aubry at bnf.fr
Wed May 26 08:11:56 CEST 2010
Hello everyone,
We launched our second stage on May 18 and the IndexServer is still
creating the deduplication index.
Because our local storage space was too small (and was actually filled up
very quickly when first tried it),
we had to use an NFS-mounted partition to store the cache directory.
Here is where we're at this morning:
# du -h cache
123M cache/FULL_CRAWL_LOG/3-cache
123M cache/FULL_CRAWL_LOG
62G cache/cdxdata
4.0K cache/cdxindex
158G cache/crawllog
4.0K cache/dedupcrawllogindex/empty-cache
34G
cache/dedupcrawllogindex/1-2-3-4-06813bb20ca5916ec43d0ff7a0e43fb6-cache.luceneDir
34G cache/dedupcrawllogindex
4.0K cache/fullcrawllogindex/empty-cache
37M cache/fullcrawllogindex/3-cache
37M cache/fullcrawllogindex
254G cache
We noticed comparing it to test crawls, that speed was twice lower than
using a local disk.
But it is still working, and we had no real other solution.
We have 2 questions:
- Is there any way to evaluate precisely the target size of the index?
If we compare the figures we had when running test crawls, it should be
close to 44Gb, but we are not sure.
- Is there a way to know the progression of the indexing process and where
lucene is at, what file it is handling?
We noticed looking some tmp files, that indexing was not sequential (ie
one job/crawl.log after the other).
Many thanks for your help.
Sara
Avant d'imprimer, pensez à l'environnement.
More information about the Netarchivesuite-devel
mailing list