[Netarchivesuite-devel] troubleshooting deduplication
Tue Larsen
tlr at kb.dk
Mon Sep 19 11:40:34 CEST 2011
Hi Sara
We have 2 steps in our broad crawl.
Each step creates its own index - the last one builds upon the first one.
Our last step 2 index was 83 GB
It took
ca 2,5 hour to get all the crawllogs from the bitarchiveservers ( 15 x 6 bitapps using 4 x ftpservsers )
ca 2 hours to extract the cdx files from the crawllogs and
4 days building the 83 GB index
The 1. step took also 4 days.
The Index Server is a hp380g7 64 bit with RHEL 6
10 GB RAM,
8 x Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
diskdrives are 15 K on / root and 10 K on /msa*
diskpartitions with separate io controller for / root and /msa*:
The /msa* are on a separate msa box.
/dev/mapper/vg_kb01-LogVol00
863G 303G 517G 37% /
tmpfs 5.2G 185k 5.2G 1% /dev/shm
/dev/sda1 508M 80M 403M 17% /boot
/dev/mapper/vg002-lv002
1.8T 1.1T 680G 60% /msa01
/dev/mapper/vg003-lv003
1.8T 369G 1.4T 22% /msa02
/dev/mapper/vg004-lv004
1.8T 947G 736G 57% /msa03
The crawllogs on /msa03
The cdx files on /msa02
The lucene index on /
All the best
Tue
________________________________________
Fra: netarchivesuite-devel-bounces at ml.sbforge.org [netarchivesuite-devel-bounces at ml.sbforge.org] på vegne af Søren Vejrup Carlsen [svc at kb.dk]
Sendt: 16. september 2011 16:26
Til: netarchivesuite-devel at ml.sbforge.org
Emne: Re: [Netarchivesuite-devel] troubleshooting deduplication
Hi Sara.
About the configuration of the indexserver, please send an email directly to Tue Larsen (tlr at kb.dk).
But I know that we have now reduced the indexing time from 14 days to 4 days by getting faster disks, and moving to a 64 bit architecture (Redhat Enterprise Linux 6).
Only 1/4th of the time (I believe) is spent fetching the logs and cdxes to the indexserver required to do the indexing.
About 2) No we haven't tried that
About 3) The current workflow for the harvestcontroller starting the job is as follows:
With the job is a list of prior jobs (given as a list of numbers) from which to build a deduplication index
for the harvesting job. Then a request for a index based on these numbers is sent to the cache.
If it isn't in the cache, the request is forwarded to the indexserver by JMS.
The cache-lookup mechanism is the following. It looks for a file defined by the method
dk.netarkivet.archive.indexserver.MultiFileBasedCache#getCacheFile(Set<T> ids) {
String fileName = FileUtils.generateFileNameFromSet(ids, "-cache");
return new File(getCacheDir(), fileName);
}
It returns something like a File object w/ a name that is the first 4 JOBIDS(separated by '-', and then a MD5 sum after that suffixed by "-cache".
I hope this helps with your troubleshooting.
/Søren
---------------------------------------------------------------------------
Søren Vejrup Carlsen, Department of Digital Preservation, Royal Library, Copenhagen, Denmark
tlf: (+45) 33 47 48 41
email: svc at kb.dk
----------------------------------------------------------------------------
Non omnia possumus omnes
--- Macrobius, Saturnalia, VI, 1, 35 -------
-----Oprindelig meddelelse-----
Fra: netarchivesuite-devel-bounces at ml.sbforge.org [mailto:netarchivesuite-devel-bounces at ml.sbforge.org] På vegne af sara.aubry at bnf.fr
Sendt: 16. september 2011 14:02
Til: netarchivesuite-devel at ml.sbforge.org
Cc: bert.wendland at bnf.fr; christophe.yven at bnf.fr
Emne: [Netarchivesuite-devel] troubleshooting deduplication
Hello everyone,
As I mentionned during our last teleconference, we are testing NetarchiveSuite 3.16.1 and a new architecture to launch our annual broad crawl.
We activated the harvest on August 23 (almost 3 weeks ago!) and the deduplication index is still ready!
1) Could you tell us what is the configuration of your index server (CPU, RAM, local disk space vs. nfs partition) and how long did your deduplication process last for how much data?
2) Is it possible (have you ever tested) to generate a deduplication index in a test environment and use it in your production environment?
We hope to be able to end our deduplication process and use the index...
3) When a job starts, how does the index server know that an index has already been created?
Many thanks for your answers.
Sara
Fermeture annuelle des sites Fran?ois-Mitterrand et Richelieu - du lundi 5 au dimanche 18 septembre 2011 inclus Journ?e du patrimoine - samedi 17 septembre (Sabl?-sur-Sarthe et Maison Jean-Vilar ? Avignon) et dimanche 18 septembre (autres sites, dont Fran?ois-Mitterrand et Richelieu) Avant d'imprimer, pensez ? l'environnement.
_______________________________________________
Netarchivesuite-devel mailing list
Netarchivesuite-devel at ml.sbforge.org
http://ml.sbforge.org/mailman/listinfo/netarchivesuite-devel
More information about the Netarchivesuite-devel
mailing list