[Netarchivesuite-users] Snapshot Harvest

sara.aubry at bnf.fr sara.aubry at bnf.fr
Mon Jul 6 15:36:37 CEST 2009

Hi Andreas,

At BnF, we are using object number rather than size and we have a FR on 
NetarchiveSuite to take this functionality into account.
For snapshot harvests, we limit harvests for each domain:
- first to 5 000 URL,
- then we raise it to 7 500,
- last, we raise it to 10 000.

Message de : Kåre Fiedler Christiansen <kfc at statsbiblioteket.dk> 
                      06/07/2009 13:59

Envoyé par : 
<netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk>

Veuillez répondre à 
<netarchivesuite-users at lists.gforge.statsbiblioteket.dk>

"netarchivesuite-users at lists.gforge.statsbiblioteket.dk" 
<netarchivesuite-users at lists.gforge.statsbiblioteket.dk>

Re: [Netarchivesuite-users] Snapshot Harvest

On Thu, 2009-06-25 at 11:39 +0200, aponb at gmx.at wrote:
> Just a short question about how to handle Snapshot Harvests. What is
> your recommendation about the max number of bytes per domain. With which
> value did you start in the past and what were the next values? How many
> iterations of an harvest did you need before completing a Snapshot 

Apologies for the very late reply.

In the Danish netarchive, currently we take two steps:

One on 10MB/domain, where we weed out all the insignificant small
domains, and then one on 4GB building on that. However, most domains are
stopped before the 4GB limit by the default 1GB domain limit.

After the harvest we extract all domains that reached the 1GB limit, and
manually evaluate whether we want to increase the domain limit for that
domain, and look for crawler traps that might have caused an excessive

I hope this helps.


NetarchiveSuite-users mailing list
NetarchiveSuite-users at lists.gforge.statsbiblioteket.dk

Avant d'imprimer, pensez à l'environnement. 
Consider the environment before printing this mail.   

More information about the NetarchiveSuite-users mailing list