[Netarchivesuite-users] Preparing the second step of a broad crawl : a question for librarians and engineers

Bjarne Andersen bja at statsbiblioteket.dk
Thu Feb 18 09:11:59 CET 2010

The librarians at KB does not do any analysis between our 1st and 2nd step. All domains hitting the limit after step-1 (10Mb in our case) are included in the 2nd step.

After the 2nd step all domains hitting the next limit (currently default 1Gb - but a number of domains have been raised to 2Gb, 4Gb and 6Gb) will be analysed for potentially to have their limit raised further. In this process we have decidede for each domain either to raise the limit to next "level" (2Gb, 4Gb, 6Gb) or lower with exact one byte (e.g. 999999999 or 1999999999 bytes) - just to be able to differ domains already analyzed after previous harvests and new ones.

As far as I know we do not do any analysis of statistics across jobs - e.g. stop-reason or harvest-time. We do statistics on simple bytes level per harvest (all jobs in one harvest) directly on the stored arc-files in the bit archive since the numbers in NetarchiveSuite are not taking deduplication into account and are not counting anything harvested from outside the included domains (like inline objects etc.)

After each step you could do some cleaning like you propose - looking for left ARC-files (that failed upload) - backup the DB (we do that automatically every night) - cleaning up /oldjobs directories for leftovers.

Fra: netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk [netarchivesuite-users-bounces at lists.gforge.statsbiblioteket.dk] På vegne af sara.aubry at bnf.fr [sara.aubry at bnf.fr]
Sendt: 17. februar 2010 17:28
Til: netarchivesuite-users at lists.gforge.statsbiblioteket.dk
Emne: [Netarchivesuite-users] Preparing the second step of a broad crawl : a question for librarians and engineers

Hi again,

We are closed to the end of our first step crawl (200 URL per domain,
which is our equivalent to 10 Mb)
and preparing a list of actions to run before launching our second step

I know librarians from KB monitoring the crawl are using an Excel file
containing a list of all domains
which reached the max object limit to compare the current crawl with the
previous one
and decide which domains have to be included in the next step. Are you
still doing this analysis ?
Do you compile any general statistics like the distribution between the
different stop resons,
the total number of jobs, job duration, bytes or documents harvested,

On the architecture side, we were thinking about :
- backing up the database,
- looking at old_jobs on the crawlers to look for possible ARC files,
- backup and clean up the HarvestController (current / old directories).
Are we missing another important task?

Thanks again for your help!


Avant d'imprimer, pensez ? l'environnement.
Consider the environment before printing this mail.

More information about the NetarchiveSuite-users mailing list