[Netarchivesuite-users] Preparing the second step of a broad crawl : a question for librarians and engineers

Wed Feb 17 17:28:08 CET 2010

Hi again,

We are closed to the end of our first step crawl (200 URL per domain, 
which is our equivalent to 10 Mb) 
and preparing a list of actions to run before launching our second step 
crawl.

I know librarians from KB monitoring the crawl are using an Excel file 
containing a list of all domains
which reached the max object limit to compare the current crawl with the 
previous one
and decide which domains have to be included in the next step. Are you 
still doing this analysis ?
Do you compile any general statistics like the distribution between the 
different stop resons,
the total number of jobs, job duration, bytes or documents harvested, 
other?

On the architecture side, we were thinking about :
- backing up the database,
- looking at old_jobs on the crawlers to look for possible ARC files,
- backup and clean up the HarvestController (current / old directories).
Are we missing another important task?

Thanks again for your help!

Sara

Avant d'imprimer, pensez à l'environnement. 
Consider the environment before printing this mail.