[Netarchivesuite-devel] Happy new year, informations and a question

sara.aubry at bnf.fr sara.aubry at bnf.fr
Wed Jan 6 10:46:30 CET 2010


Hello Netarchive and ONB teams, 


I would first like to wish you all a very happy new year, full with 
interesting projects, fun and surprises.
As 2010 beginns, we are really looking forward to continue joint 
development activities on NetarchiveSuite,
which remains our main focus to set up a new hard and software 
infrastructure for running broad and ongoing focused crawls.

Here are some news about what we did in the past few months.


We have been working on three different tools:

* a pre-selection tool which gives us the ability to:
- filter and merge our domain list (ca. 1.5 mio), a seed list selected by 
librarians (ca. 8000) and the host reports 
from our 2008 broad crawl (ca. 3.5 mio),
- gives some stats,
- create TLD, domains and seeds and ingest them directly into NAS.
It took us a while to develop it, we run into lots of problems (DB 
management, syntax, volume, timeout...) and
it is still not done, but we are closed to the end.

* a monitoring tool  for librarians which is integrated into NAS and gives 
the ability to:
- sort, filter and paginate the job list page,
- and see data on running jobs on one page (stats given by Heritrix like 
duration, queues, number of files,...).
Here again we are closed to the end.

* a QA tool which sums up stats from Heritrix reports for a given set of 
jobs or harvest.
It has been subcontracted and for now, we don't know if it is closed to 
the end or not.


We have also set up and installed NetarchiveSuite on a virtualized 
architecture (one admin + 30 crawlers on 6 physical machines)
and try to launch our first broad crawl. We run into all possible 
problems: licences, disk crash, network connectivity and limited number of 

connexions, data integrity (we ingested domains which were not strictly 
compliant with NAS synthax and NAS desactivated the harvest) 
and last but not least Heritrix/NAS configurations (we never managed to 
finish a job in a short period of time and NAS interrupted them 
assigning "closed connexion" stop reason to all domains even the completed 
ones).

Nicolas is currently working on data integrity. Bert, our crawl engineer, 
is trying to find a good combination in the parameters used
in Heritrix and NAS to manage politeness, queues, retries and timeouts 
(which we think are the source of the problem). 
It would be really helpful if you (Netarchive and ONB) could send us as 
examples the values your are using for 
settings.harvester.harvesting.heritrix.inactivityTimeout  and 
settings.harvester.harvesting.heritrix.noResponseTimeout  in the 
stettings.xml
and the default order.xml you are using for production.

We'll cut into pieces all the work we have done and see with you if and 
how it could integrate NAS main trunk as soon as things slow down a little 
bit. 

Once again, 
all the best for 2010,
and thanks in advance for your help.


Sara




Avant d'imprimer, pensez à l'environnement. 
Consider the environment before printing this mail.   



More information about the Netarchivesuite-devel mailing list