[Netarchivesuite-users] Snapshot Harvest

Kåre Fiedler Christiansen kfc at statsbiblioteket.dk
Mon Jul 6 13:59:25 CEST 2009


On Thu, 2009-06-25 at 11:39 +0200, aponb at gmx.at wrote:
> Just a short question about how to handle Snapshot Harvests. What is
> your recommendation about the max number of bytes per domain. With which
> value did you start in the past and what were the next values? How many
> iterations of an harvest did you need before completing a Snapshot Harvest?

Apologies for the very late reply.

In the Danish netarchive, currently we take two steps:

One on 10MB/domain, where we weed out all the insignificant small
domains, and then one on 4GB building on that. However, most domains are
stopped before the 4GB limit by the default 1GB domain limit.

After the harvest we extract all domains that reached the 1GB limit, and
manually evaluate whether we want to increase the domain limit for that
domain, and look for crawler traps that might have caused an excessive
size.

I hope this helps.

Best,
  Kåre




More information about the NetarchiveSuite-users mailing list