[Netarchivesuite-users] how do I configure NS to have a minimum or fixed value of 3500 domains per job?

Bjarne Andersen bja at statsbiblioteket.dk
Mon Mar 8 16:04:54 CET 2010


What about disable the splitting by setting the current parameters to something high/low that means splitting will always hit 3500 domains ?
(I can't figure out the exact values - but I imagine it is possible ?)

Off cause this means that you would never now how large jobs are going to be. 3500 very large domains in one job would explode the harvester unless it really have a lot of disk space locally.

best
Bjarne
________________________________________
Fra: netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk [netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk] På vegne af sara.aubry at bnf.fr [sara.aubry at bnf.fr]
Sendt: 8. marts 2010 16:01
Til: netarchivesuite-users at lists.gforge.statsbiblioteket.dk
Cc: bert.wendland at bnf.fr; nicolas.giraud at bnf.fr; PAUL.FIEVRE at bnf.fr
Emne: Re: [Netarchivesuite-users] how do I configure NS to have a minimum or fixed value of 3500 domains per job?

Søren,

is there a way (even a tricky one) to set these parameters to be close to
3500 domains?

Sara








Message de : Søren Vejrup Carlsen <svc at kb.dk>
                      08/03/2010 15:55

Envoyé par :
<netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk>

Veuillez répondre à
<netarchivesuite-users at lists.gforge.statsbiblioteket.dk>



Pour
"netarchivesuite-users at lists.gforge.statsbiblioteket.dk"
<netarchivesuite-users at lists.gforge.statsbiblioteket.dk>
Copie
"bert.wendland at bnf.fr" <bert.wendland at bnf.fr>, "nicolas.giraud at bnf.fr"
<nicolas.giraud at bnf.fr>, "PAUL.FIEVRE at bnf.fr" <PAUL.FIEVRE at bnf.fr>
Objet
Re: [Netarchivesuite-users] how do I configure NS to have a minimum or
fixed value of 3500 domains per job?



Hi Sara.
>is there a way to set these parameters to have a minimum or fixed value
of domains per job?
No, there isn't currently, but it would probably be a good idea.

/Søren
-----Oprindelig meddelelse-----
Fra: netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk
[mailto:netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk] På
vegne af sara.aubry at bnf.fr
Sendt: 8. marts 2010 15:43
Til: netarchivesuite-users at lists.gforge.statsbiblioteket.dk
Cc: bert.wendland at bnf.fr; nicolas.giraud at bnf.fr; PAUL.FIEVRE at bnf.fr
Emne: [Netarchivesuite-users] how do I configure NS to have a minimum or
fixed value of 3500 domains per job?

Hello everyone,

We are trying to configure the job-generation part of NetarchiveSuite to
have jobs with 3500 domains/configurations in it.
After we set the configChunkSize to 3500, NS created jobs with either 1000
or 2500 configurations in the first stage and only 98 configurations in
the second stage, which really slows down our crawl speed (after a few
hours, we have many jobs with a few active queues that can last several
days...).

We re-read the configuration manual, smart guesses are made about target
size but is there a way to set these parameters to have a minimum or fixed
value of domains per job?

Thanks for your help!

Sara

/////////////////////////////////////


Here is our current scheduler configuration :

<scheduler>
        <errorFactorPrevResult>10</errorFactorPrevResult>
        <errorFactorBestGuess>20</errorFactorBestGuess>
 <expectedAverageBytesPerObject>38000</expectedAverageBytesPerObject>
        <maxDomainSize>5000</maxDomainSize>
                <jobs>
 <maxRelativeSizeDifference>100</maxRelativeSizeDifference>
 <minAbsoluteSizeDifference>2000</minAbsoluteSizeDifference>
                        <maxTotalSize>500000</maxTotalSize>
                </jobs>
        <configChunkSize>3500</configChunkSize>
        <splitByObjectLimit>true</splitByObjectLimit>
</scheduler>





Avant d'imprimer, pensez ? l'environnement.
Consider the environment before printing this mail.

_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at lists.gforge.statsbiblioteket.dk
https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchivesuite-users







Avant d'imprimer, pensez à l'environnement.
Consider the environment before printing this mail.
_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at lists.gforge.statsbiblioteket.dk
https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchivesuite-users




More information about the NetarchiveSuite-users mailing list