[Netarchivesuite-devel] Happy new year, informations and a question
sara.aubry at bnf.fr
sara.aubry at bnf.fr
Wed Jan 6 10:46:30 CET 2010
Hello Netarchive and ONB teams,
I would first like to wish you all a very happy new year, full with
interesting projects, fun and surprises.
As 2010 beginns, we are really looking forward to continue joint
development activities on NetarchiveSuite,
which remains our main focus to set up a new hard and software
infrastructure for running broad and ongoing focused crawls.
Here are some news about what we did in the past few months.
We have been working on three different tools:
* a pre-selection tool which gives us the ability to:
- filter and merge our domain list (ca. 1.5 mio), a seed list selected by
librarians (ca. 8000) and the host reports
from our 2008 broad crawl (ca. 3.5 mio),
- gives some stats,
- create TLD, domains and seeds and ingest them directly into NAS.
It took us a while to develop it, we run into lots of problems (DB
management, syntax, volume, timeout...) and
it is still not done, but we are closed to the end.
* a monitoring tool for librarians which is integrated into NAS and gives
the ability to:
- sort, filter and paginate the job list page,
- and see data on running jobs on one page (stats given by Heritrix like
duration, queues, number of files,...).
Here again we are closed to the end.
* a QA tool which sums up stats from Heritrix reports for a given set of
jobs or harvest.
It has been subcontracted and for now, we don't know if it is closed to
the end or not.
We have also set up and installed NetarchiveSuite on a virtualized
architecture (one admin + 30 crawlers on 6 physical machines)
and try to launch our first broad crawl. We run into all possible
problems: licences, disk crash, network connectivity and limited number of
connexions, data integrity (we ingested domains which were not strictly
compliant with NAS synthax and NAS desactivated the harvest)
and last but not least Heritrix/NAS configurations (we never managed to
finish a job in a short period of time and NAS interrupted them
assigning "closed connexion" stop reason to all domains even the completed
ones).
Nicolas is currently working on data integrity. Bert, our crawl engineer,
is trying to find a good combination in the parameters used
in Heritrix and NAS to manage politeness, queues, retries and timeouts
(which we think are the source of the problem).
It would be really helpful if you (Netarchive and ONB) could send us as
examples the values your are using for
settings.harvester.harvesting.heritrix.inactivityTimeout and
settings.harvester.harvesting.heritrix.noResponseTimeout in the
stettings.xml
and the default order.xml you are using for production.
We'll cut into pieces all the work we have done and see with you if and
how it could integrate NAS main trunk as soon as things slow down a little
bit.
Once again,
all the best for 2010,
and thanks in advance for your help.
Sara
Avant d'imprimer, pensez à l'environnement.
Consider the environment before printing this mail.
More information about the Netarchivesuite-devel
mailing list