[Netarchivesuite-devel] Happy new year, informations and a question

Søren Vejrup Carlsen svc at kb.dk
Wed Jan 6 11:50:26 CET 2010


Hi Sara.
>connexions, data integrity (we ingested domains which were not strictly compliant with NAS synthax and NAS desactivated the harvest) and last but >not least Heritrix/NAS configurations (we never managed to finish a job in a short period of time and NAS interrupted them assigning "closed >connexion" stop reason to all domains even the completed ones)
This seem like a ugly bug i NetarchiveSuite. Could you report it as a bug in our bug trackingsystem along with the relevant logfiles?

>domains which were not strictly compliant with NAS synthax and NAS desactivated the harvest
I did send a mail to Nicolas about this before Christmas.
The issue was not about NAS compliancy, but that URLs like "www.editions-debaisieux.fr" are not considered a valid domain by NAS.
However, editions-debaisieux.fr is considered a perfectly valid domain by NAS.
 

Best  Regards

---------------------------------------------------------------------------
Søren Vejrup Carlsen, NetarchiveSuite developer (and QA)
Department of Digital Preservation, Royal Library, Copenhagen, Denmark 
tlf: (+45) 33 47 48 41
email: svc at kb.dk
----------------------------------------------------------------------------
Non omnia possumus omnes
--- Macrobius, Saturnalia, VI, 1, 35 -------


-----Oprindelig meddelelse-----
Fra: netarchivesuite-devel-bounces at lists.gforge.statsbiblioteket.dk [mailto:netarchivesuite-devel-bounces at lists.gforge.statsbiblioteket.dk] På vegne af sara.aubry at bnf.fr
Sendt: 6. januar 2010 10:47
Til: netarchivesuite-devel at lists.gforge.statsbiblioteket.dk; clo at statsbiblioteket.dk
Emne: [Netarchivesuite-devel] Happy new year, informations and a question

Hello Netarchive and ONB teams, 


I would first like to wish you all a very happy new year, full with interesting projects, fun and surprises.
As 2010 beginns, we are really looking forward to continue joint development activities on NetarchiveSuite, which remains our main focus to set up a new hard and software infrastructure for running broad and ongoing focused crawls.

Here are some news about what we did in the past few months.


We have been working on three different tools:

* a pre-selection tool which gives us the ability to:
- filter and merge our domain list (ca. 1.5 mio), a seed list selected by librarians (ca. 8000) and the host reports from our 2008 broad crawl (ca. 3.5 mio),
- gives some stats,
- create TLD, domains and seeds and ingest them directly into NAS.
It took us a while to develop it, we run into lots of problems (DB management, syntax, volume, timeout...) and it is still not done, but we are closed to the end.

* a monitoring tool  for librarians which is integrated into NAS and gives the ability to:
- sort, filter and paginate the job list page,
- and see data on running jobs on one page (stats given by Heritrix like duration, queues, number of files,...).
Here again we are closed to the end.

* a QA tool which sums up stats from Heritrix reports for a given set of jobs or harvest.
It has been subcontracted and for now, we don't know if it is closed to the end or not.


We have also set up and installed NetarchiveSuite on a virtualized architecture (one admin + 30 crawlers on 6 physical machines) and try to launch our first broad crawl. We run into all possible
problems: licences, disk crash, network connectivity and limited number of 

connexions, data integrity (we ingested domains which were not strictly compliant with NAS synthax and NAS desactivated the harvest) and last but not least Heritrix/NAS configurations (we never managed to finish a job in a short period of time and NAS interrupted them assigning "closed connexion" stop reason to all domains even the completed ones).

Nicolas is currently working on data integrity. Bert, our crawl engineer, is trying to find a good combination in the parameters used in Heritrix and NAS to manage politeness, queues, retries and timeouts (which we think are the source of the problem). 
It would be really helpful if you (Netarchive and ONB) could send us as examples the values your are using for settings.harvester.harvesting.heritrix.inactivityTimeout  and settings.harvester.harvesting.heritrix.noResponseTimeout  in the stettings.xml and the default order.xml you are using for production.

We'll cut into pieces all the work we have done and see with you if and how it could integrate NAS main trunk as soon as things slow down a little bit. 

Once again,
all the best for 2010,
and thanks in advance for your help.


Sara




Avant d'imprimer, pensez ? l'environnement. 
Consider the environment before printing this mail.   




More information about the Netarchivesuite-devel mailing list