[Netarchivesuite-devel] troubleshooting deduplication

Søren Vejrup Carlsen svc at kb.dk
Fri Sep 16 16:26:40 CEST 2011


Hi Sara.
About the configuration of the indexserver, please send an email directly to Tue Larsen (tlr at kb.dk).
But I know that we have now reduced the indexing time from 14 days to 4 days by getting faster disks, and moving to a 64 bit architecture (Redhat Enterprise Linux 6).

Only 1/4th of the time (I believe) is spent fetching the logs and cdxes to the indexserver required to do the indexing. 

About 2) No we haven't tried that 
About 3) The current workflow for the harvestcontroller starting the job is as follows: 
With the job is a list of prior jobs (given as a list of numbers) from which to build a deduplication index
for the harvesting job. Then a request for a index based on these numbers is sent to the cache.
If it isn't in the cache, the request is forwarded to the indexserver by JMS.
The cache-lookup mechanism is the following. It looks for a file defined by the method
dk.netarkivet.archive.indexserver.MultiFileBasedCache#getCacheFile(Set<T> ids) {
        String fileName = FileUtils.generateFileNameFromSet(ids, "-cache");
        return new File(getCacheDir(), fileName);
}

It returns something like a File object w/ a name that is the first 4 JOBIDS(separated by '-', and then a MD5 sum after that suffixed by "-cache".


I hope this helps with your troubleshooting.

/Søren

---------------------------------------------------------------------------
Søren Vejrup Carlsen, Department of Digital Preservation, Royal Library, Copenhagen, Denmark
tlf: (+45) 33 47 48 41
email: svc at kb.dk
----------------------------------------------------------------------------
Non omnia possumus omnes
--- Macrobius, Saturnalia, VI, 1, 35 -------



-----Oprindelig meddelelse-----
Fra: netarchivesuite-devel-bounces at ml.sbforge.org [mailto:netarchivesuite-devel-bounces at ml.sbforge.org] På vegne af sara.aubry at bnf.fr
Sendt: 16. september 2011 14:02
Til: netarchivesuite-devel at ml.sbforge.org
Cc: bert.wendland at bnf.fr; christophe.yven at bnf.fr
Emne: [Netarchivesuite-devel] troubleshooting deduplication

Hello everyone,

As I mentionned during our last teleconference, we are testing NetarchiveSuite 3.16.1 and a new architecture to launch our annual broad crawl.

We activated the harvest on August 23 (almost 3 weeks ago!) and the deduplication index is still ready!

1) Could you tell us what is the configuration of your index server (CPU, RAM, local disk space vs. nfs partition) and how long did your deduplication process last for how much data?

2) Is it possible (have you ever tested) to generate a deduplication index in a test environment and use it in your production environment?
We hope to be able to end our deduplication process and use the index... 

3) When a job starts, how does the index server know that an index has already been created?

Many thanks for your answers.

Sara
 

Fermeture annuelle des sites Fran?ois-Mitterrand et Richelieu  - du lundi 5 au dimanche 18 septembre 2011 inclus Journ?e du patrimoine  - samedi 17 septembre (Sabl?-sur-Sarthe et Maison Jean-Vilar ? Avignon) et dimanche 18 septembre (autres sites, dont Fran?ois-Mitterrand et Richelieu) Avant d'imprimer, pensez ? l'environnement. 



More information about the Netarchivesuite-devel mailing list