[Netarchivesuite-users] Preparing the second step of a broad crawl : a question for librarians and engineers
sara.aubry at bnf.fr
sara.aubry at bnf.fr
Wed Feb 17 17:28:08 CET 2010
Hi again,
We are closed to the end of our first step crawl (200 URL per domain,
which is our equivalent to 10 Mb)
and preparing a list of actions to run before launching our second step
crawl.
I know librarians from KB monitoring the crawl are using an Excel file
containing a list of all domains
which reached the max object limit to compare the current crawl with the
previous one
and decide which domains have to be included in the next step. Are you
still doing this analysis ?
Do you compile any general statistics like the distribution between the
different stop resons,
the total number of jobs, job duration, bytes or documents harvested,
other?
On the architecture side, we were thinking about :
- backing up the database,
- looking at old_jobs on the crawlers to look for possible ARC files,
- backup and clean up the HarvestController (current / old directories).
Are we missing another important task?
Thanks again for your help!
Sara
Avant d'imprimer, pensez à l'environnement.
Consider the environment before printing this mail.
More information about the NetarchiveSuite-users
mailing list