[Netarchivesuite-devel] Happy new year, informations and a question

aponb at gmx.at aponb at gmx.at
Thu Jan 7 12:20:05 CET 2010


Hi Sara!

We are also using these values. 1800 seconds = 30 minutes.
See 
https://lists.gforge.statsbiblioteket.dk/pipermail/netarchivesuite-users/2008-June/000067.html

Regards
a.

> Hi Sara
>
> A happy new year to you too.
>
> Both our settings (inactivityTimeout and noResponseTimeout) in Netarchive are: 1800 (ms)
>
> Best of luck with the difficulties.
>
> Claus Lomborg, Netarkivet
>
>
>   
>> -----Original Message-----
>> From: sara.aubry at bnf.fr [mailto:sara.aubry at bnf.fr]
>> Sent: Wednesday, January 06, 2010 10:47 AM
>> To: netarchivesuite-devel at lists.gforge.statsbiblioteket.dk; Claus
>> Lomborg
>> Subject: Happy new year, informations and a question
>>
>> Hello Netarchive and ONB teams,
>>
>>
>> I would first like to wish you all a very happy new year, full with
>> interesting projects, fun and surprises.
>> As 2010 beginns, we are really looking forward to continue joint
>> development activities on NetarchiveSuite,
>> which remains our main focus to set up a new hard and software
>> infrastructure for running broad and ongoing focused crawls.
>>
>> Here are some news about what we did in the past few months.
>>
>>
>> We have been working on three different tools:
>>
>> * a pre-selection tool which gives us the ability to:
>> - filter and merge our domain list (ca. 1.5 mio), a seed list
>> selected by
>> librarians (ca. 8000) and the host reports
>> from our 2008 broad crawl (ca. 3.5 mio),
>> - gives some stats,
>> - create TLD, domains and seeds and ingest them directly into NAS.
>> It took us a while to develop it, we run into lots of problems (DB
>> management, syntax, volume, timeout...) and
>> it is still not done, but we are closed to the end.
>>
>> * a monitoring tool  for librarians which is integrated into NAS
>> and gives
>> the ability to:
>> - sort, filter and paginate the job list page,
>> - and see data on running jobs on one page (stats given by Heritrix
>> like
>> duration, queues, number of files,...).
>> Here again we are closed to the end.
>>
>> * a QA tool which sums up stats from Heritrix reports for a given
>> set of
>> jobs or harvest.
>> It has been subcontracted and for now, we don't know if it is
>> closed to
>> the end or not.
>>
>>
>> We have also set up and installed NetarchiveSuite on a virtualized
>> architecture (one admin + 30 crawlers on 6 physical machines)
>> and try to launch our first broad crawl. We run into all possible
>> problems: licences, disk crash, network connectivity and limited
>> number of
>>
>> connexions, data integrity (we ingested domains which were not
>> strictly
>> compliant with NAS synthax and NAS desactivated the harvest)
>> and last but not least Heritrix/NAS configurations (we never
>> managed to
>> finish a job in a short period of time and NAS interrupted them
>> assigning "closed connexion" stop reason to all domains even the
>> completed
>> ones).
>>
>> Nicolas is currently working on data integrity. Bert, our crawl
>> engineer,
>> is trying to find a good combination in the parameters used
>> in Heritrix and NAS to manage politeness, queues, retries and
>> timeouts
>> (which we think are the source of the problem).
>> It would be really helpful if you (Netarchive and ONB) could send
>> us as
>> examples the values your are using for
>> settings.harvester.harvesting.heritrix.inactivityTimeout  and
>> settings.harvester.harvesting.heritrix.noResponseTimeout  in the
>> stettings.xml
>> and the default order.xml you are using for production.
>>
>> We'll cut into pieces all the work we have done and see with you if
>> and
>> how it could integrate NAS main trunk as soon as things slow down a
>> little
>> bit.
>>
>> Once again,
>> all the best for 2010,
>> and thanks in advance for your help.
>>
>>
>> Sara
>>
>>
>>
>>
>> Avant d'imprimer, pensez ? l'environnement.
>> Consider the environment before printing this mail.
>>     
>
> _______________________________________________
> Netarchivesuite-devel mailing list
> Netarchivesuite-devel at lists.gforge.statsbiblioteket.dk
> https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchivesuite-devel
>
>   








More information about the Netarchivesuite-devel mailing list