[Netarchivesuite-devel] troubleshooting deduplication

Mon Sep 19 11:40:34 CEST 2011

Hi Sara

We have 2 steps in our broad crawl.
Each step creates its own index - the last one builds upon the first one.
Our last step 2 index was 83 GB

It took 
ca 2,5 hour to get all the crawllogs from the bitarchiveservers  ( 15 x 6 bitapps using 4 x ftpservsers  )
ca 2 hours to extract the cdx files from the crawllogs and
4 days building the 83 GB index

The 1. step took also 4 days.

The Index Server is a hp380g7 64 bit with RHEL 6
10 GB RAM, 
8 x Intel(R) Xeon(R) CPU  E5620  @ 2.40GHz
diskdrives are 15 K on / root  and 10 K on /msa*

diskpartitions  with separate io controller for / root and /msa*:
The /msa* are  on a separate msa box.

/dev/mapper/vg_kb01-LogVol00
                       863G   303G   517G  37% /
tmpfs                  5.2G   185k   5.2G   1% /dev/shm
/dev/sda1              508M    80M   403M  17% /boot
/dev/mapper/vg002-lv002
                       1.8T   1.1T   680G  60% /msa01
/dev/mapper/vg003-lv003
                       1.8T   369G   1.4T  22% /msa02
/dev/mapper/vg004-lv004
                       1.8T   947G   736G  57% /msa03

The crawllogs  on /msa03
The cdx files on /msa02
The lucene index on /

All the best
Tue
________________________________________
Fra: netarchivesuite-devel-bounces at ml.sbforge.org [netarchivesuite-devel-bounces at ml.sbforge.org] på vegne af Søren Vejrup Carlsen [svc at kb.dk]
Sendt: 16. september 2011 16:26
Til: netarchivesuite-devel at ml.sbforge.org
Emne: Re: [Netarchivesuite-devel] troubleshooting deduplication

Hi Sara.
About the configuration of the indexserver, please send an email directly to Tue Larsen (tlr at kb.dk).
But I know that we have now reduced the indexing time from 14 days to 4 days by getting faster disks, and moving to a 64 bit architecture (Redhat Enterprise Linux 6).

Only 1/4th of the time (I believe) is spent fetching the logs and cdxes to the indexserver required to do the indexing.

About 2) No we haven't tried that
About 3) The current workflow for the harvestcontroller starting the job is as follows:
With the job is a list of prior jobs (given as a list of numbers) from which to build a deduplication index
for the harvesting job. Then a request for a index based on these numbers is sent to the cache.
If it isn't in the cache, the request is forwarded to the indexserver by JMS.
The cache-lookup mechanism is the following. It looks for a file defined by the method
dk.netarkivet.archive.indexserver.MultiFileBasedCache#getCacheFile(Set<T> ids) {
        String fileName = FileUtils.generateFileNameFromSet(ids, "-cache");
        return new File(getCacheDir(), fileName);
}

It returns something like a File object w/ a name that is the first 4 JOBIDS(separated by '-', and then a MD5 sum after that suffixed by "-cache".

I hope this helps with your troubleshooting.

/Søren

---------------------------------------------------------------------------
Søren Vejrup Carlsen, Department of Digital Preservation, Royal Library, Copenhagen, Denmark
tlf: (+45) 33 47 48 41
email: svc at kb.dk
----------------------------------------------------------------------------
Non omnia possumus omnes
--- Macrobius, Saturnalia, VI, 1, 35 -------

-----Oprindelig meddelelse-----
Fra: netarchivesuite-devel-bounces at ml.sbforge.org [mailto:netarchivesuite-devel-bounces at ml.sbforge.org] På vegne af sara.aubry at bnf.fr
Sendt: 16. september 2011 14:02
Til: netarchivesuite-devel at ml.sbforge.org
Cc: bert.wendland at bnf.fr; christophe.yven at bnf.fr
Emne: [Netarchivesuite-devel] troubleshooting deduplication

Hello everyone,

As I mentionned during our last teleconference, we are testing NetarchiveSuite 3.16.1 and a new architecture to launch our annual broad crawl.

We activated the harvest on August 23 (almost 3 weeks ago!) and the deduplication index is still ready!

1) Could you tell us what is the configuration of your index server (CPU, RAM, local disk space vs. nfs partition) and how long did your deduplication process last for how much data?

2) Is it possible (have you ever tested) to generate a deduplication index in a test environment and use it in your production environment?
We hope to be able to end our deduplication process and use the index...

3) When a job starts, how does the index server know that an index has already been created?

Many thanks for your answers.

Sara

Fermeture annuelle des sites Fran?ois-Mitterrand et Richelieu  - du lundi 5 au dimanche 18 septembre 2011 inclus Journ?e du patrimoine  - samedi 17 septembre (Sabl?-sur-Sarthe et Maison Jean-Vilar ? Avignon) et dimanche 18 septembre (autres sites, dont Fran?ois-Mitterrand et Richelieu) Avant d'imprimer, pensez ? l'environnement.

_______________________________________________
Netarchivesuite-devel mailing list
Netarchivesuite-devel at ml.sbforge.org
http://ml.sbforge.org/mailman/listinfo/netarchivesuite-devel