[Netarchivesuite-devel] Happy new year, informations and a question

Claus Lomborg clo at statsbiblioteket.dk
Thu Jan 7 09:58:44 CET 2010


Hi Sara

A happy new year to you too.

Both our settings (inactivityTimeout and noResponseTimeout) in Netarchive are: 1800 (ms)

Best of luck with the difficulties.

Claus Lomborg, Netarkivet


> -----Original Message-----
> From: sara.aubry at bnf.fr [mailto:sara.aubry at bnf.fr]
> Sent: Wednesday, January 06, 2010 10:47 AM
> To: netarchivesuite-devel at lists.gforge.statsbiblioteket.dk; Claus
> Lomborg
> Subject: Happy new year, informations and a question
> 
> Hello Netarchive and ONB teams,
> 
> 
> I would first like to wish you all a very happy new year, full with
> interesting projects, fun and surprises.
> As 2010 beginns, we are really looking forward to continue joint
> development activities on NetarchiveSuite,
> which remains our main focus to set up a new hard and software
> infrastructure for running broad and ongoing focused crawls.
> 
> Here are some news about what we did in the past few months.
> 
> 
> We have been working on three different tools:
> 
> * a pre-selection tool which gives us the ability to:
> - filter and merge our domain list (ca. 1.5 mio), a seed list
> selected by
> librarians (ca. 8000) and the host reports
> from our 2008 broad crawl (ca. 3.5 mio),
> - gives some stats,
> - create TLD, domains and seeds and ingest them directly into NAS.
> It took us a while to develop it, we run into lots of problems (DB
> management, syntax, volume, timeout...) and
> it is still not done, but we are closed to the end.
> 
> * a monitoring tool  for librarians which is integrated into NAS
> and gives
> the ability to:
> - sort, filter and paginate the job list page,
> - and see data on running jobs on one page (stats given by Heritrix
> like
> duration, queues, number of files,...).
> Here again we are closed to the end.
> 
> * a QA tool which sums up stats from Heritrix reports for a given
> set of
> jobs or harvest.
> It has been subcontracted and for now, we don't know if it is
> closed to
> the end or not.
> 
> 
> We have also set up and installed NetarchiveSuite on a virtualized
> architecture (one admin + 30 crawlers on 6 physical machines)
> and try to launch our first broad crawl. We run into all possible
> problems: licences, disk crash, network connectivity and limited
> number of
> 
> connexions, data integrity (we ingested domains which were not
> strictly
> compliant with NAS synthax and NAS desactivated the harvest)
> and last but not least Heritrix/NAS configurations (we never
> managed to
> finish a job in a short period of time and NAS interrupted them
> assigning "closed connexion" stop reason to all domains even the
> completed
> ones).
> 
> Nicolas is currently working on data integrity. Bert, our crawl
> engineer,
> is trying to find a good combination in the parameters used
> in Heritrix and NAS to manage politeness, queues, retries and
> timeouts
> (which we think are the source of the problem).
> It would be really helpful if you (Netarchive and ONB) could send
> us as
> examples the values your are using for
> settings.harvester.harvesting.heritrix.inactivityTimeout  and
> settings.harvester.harvesting.heritrix.noResponseTimeout  in the
> stettings.xml
> and the default order.xml you are using for production.
> 
> We'll cut into pieces all the work we have done and see with you if
> and
> how it could integrate NAS main trunk as soon as things slow down a
> little
> bit.
> 
> Once again,
> all the best for 2010,
> and thanks in advance for your help.
> 
> 
> Sara
> 
> 
> 
> 
> Avant d'imprimer, pensez ? l'environnement.
> Consider the environment before printing this mail.




More information about the Netarchivesuite-devel mailing list