[Netarchivesuite-users] Snapshot Harvest

sara.aubry at bnf.fr sara.aubry at bnf.fr
Mon Jul 6 15:36:37 CEST 2009


Hi Andreas,

At BnF, we are using object number rather than size and we have a FR on 
NetarchiveSuite to take this functionality into account.
For snapshot harvests, we limit harvests for each domain:
- first to 5 000 URL,
- then we raise it to 7 500,
- last, we raise it to 10 000.
 
Sara







Message de : Kåre Fiedler Christiansen <kfc at statsbiblioteket.dk> 
                      06/07/2009 13:59

Envoyé par : 
<netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk>

Veuillez répondre à 
<netarchivesuite-users at lists.gforge.statsbiblioteket.dk>



Pour
"netarchivesuite-users at lists.gforge.statsbiblioteket.dk" 
<netarchivesuite-users at lists.gforge.statsbiblioteket.dk>
Copie

Objet
Re: [Netarchivesuite-users] Snapshot Harvest



On Thu, 2009-06-25 at 11:39 +0200, aponb at gmx.at wrote:
> Just a short question about how to handle Snapshot Harvests. What is
> your recommendation about the max number of bytes per domain. With which
> value did you start in the past and what were the next values? How many
> iterations of an harvest did you need before completing a Snapshot 
Harvest?

Apologies for the very late reply.

In the Danish netarchive, currently we take two steps:

One on 10MB/domain, where we weed out all the insignificant small
domains, and then one on 4GB building on that. However, most domains are
stopped before the 4GB limit by the default 1GB domain limit.

After the harvest we extract all domains that reached the 1GB limit, and
manually evaluate whether we want to increase the domain limit for that
domain, and look for crawler traps that might have caused an excessive
size.

I hope this helps.

Best,
  Kåre

_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at lists.gforge.statsbiblioteket.dk
https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchivesuite-users







Avant d'imprimer, pensez à l'environnement. 
Consider the environment before printing this mail.   



More information about the NetarchiveSuite-users mailing list