[Netarchivesuite-devel] Happy new year, informations and a question

sara.aubry at bnf.fr sara.aubry at bnf.fr
Wed Jan 6 16:17:14 CET 2010


Hi Soren,

Thanks for your answer!
I'm not 100% sure it's a bug, but I reported the bug tracking system: 
https://gforge.statsbiblioteket.dk/tracker/index.php?func=detail&aid=1851&group_id=7&atid=105
and also uploaded the log files.
Any look / advice with the config information would be helpful!

Sara







Message de : Søren Vejrup Carlsen <svc at kb.dk> 
                      06/01/2010 11:50

Envoyé par : 
<netarchivesuite-devel-bounces at lists.gforge.statsbiblioteket.dk>

Veuillez répondre à 
<netarchivesuite-devel at lists.gforge.statsbiblioteket.dk>



Pour
"netarchivesuite-devel at lists.gforge.statsbiblioteket.dk" 
<netarchivesuite-devel at lists.gforge.statsbiblioteket.dk>
Copie

Objet
Re: [Netarchivesuite-devel] Happy new year, informations and a question



Hi Sara.
>connexions, data integrity (we ingested domains which were not strictly 
compliant with NAS synthax and NAS desactivated the harvest) and last but 
>not least Heritrix/NAS configurations (we never managed to finish a job 
in a short period of time and NAS interrupted them assigning "closed 
>connexion" stop reason to all domains even the completed ones)
This seem like a ugly bug i NetarchiveSuite. Could you report it as a bug 
in our bug trackingsystem along with the relevant logfiles?

>domains which were not strictly compliant with NAS synthax and NAS 
desactivated the harvest
I did send a mail to Nicolas about this before Christmas.
The issue was not about NAS compliancy, but that URLs like 
"www.editions-debaisieux.fr" are not considered a valid domain by NAS.
However, editions-debaisieux.fr is considered a perfectly valid domain by 
NAS.
 

Best  Regards

---------------------------------------------------------------------------
Søren Vejrup Carlsen, NetarchiveSuite developer (and QA)
Department of Digital Preservation, Royal Library, Copenhagen, Denmark 
tlf: (+45) 33 47 48 41
email: svc at kb.dk
----------------------------------------------------------------------------
Non omnia possumus omnes
--- Macrobius, Saturnalia, VI, 1, 35 -------


-----Oprindelig meddelelse-----
Fra: netarchivesuite-devel-bounces at lists.gforge.statsbiblioteket.dk 
[mailto:netarchivesuite-devel-bounces at lists.gforge.statsbiblioteket.dk] På 
vegne af sara.aubry at bnf.fr
Sendt: 6. januar 2010 10:47
Til: netarchivesuite-devel at lists.gforge.statsbiblioteket.dk; 
clo at statsbiblioteket.dk
Emne: [Netarchivesuite-devel] Happy new year, informations and a question

Hello Netarchive and ONB teams, 


I would first like to wish you all a very happy new year, full with 
interesting projects, fun and surprises.
As 2010 beginns, we are really looking forward to continue joint 
development activities on NetarchiveSuite, which remains our main focus to 
set up a new hard and software infrastructure for running broad and 
ongoing focused crawls.

Here are some news about what we did in the past few months.


We have been working on three different tools:

* a pre-selection tool which gives us the ability to:
- filter and merge our domain list (ca. 1.5 mio), a seed list selected by 
librarians (ca. 8000) and the host reports from our 2008 broad crawl (ca. 
3.5 mio),
- gives some stats,
- create TLD, domains and seeds and ingest them directly into NAS.
It took us a while to develop it, we run into lots of problems (DB 
management, syntax, volume, timeout...) and it is still not done, but we 
are closed to the end.

* a monitoring tool  for librarians which is integrated into NAS and gives 
the ability to:
- sort, filter and paginate the job list page,
- and see data on running jobs on one page (stats given by Heritrix like 
duration, queues, number of files,...).
Here again we are closed to the end.

* a QA tool which sums up stats from Heritrix reports for a given set of 
jobs or harvest.
It has been subcontracted and for now, we don't know if it is closed to 
the end or not.


We have also set up and installed NetarchiveSuite on a virtualized 
architecture (one admin + 30 crawlers on 6 physical machines) and try to 
launch our first broad crawl. We run into all possible
problems: licences, disk crash, network connectivity and limited number of 


connexions, data integrity (we ingested domains which were not strictly 
compliant with NAS synthax and NAS desactivated the harvest) and last but 
not least Heritrix/NAS configurations (we never managed to finish a job in 
a short period of time and NAS interrupted them assigning "closed 
connexion" stop reason to all domains even the completed ones).

Nicolas is currently working on data integrity. Bert, our crawl engineer, 
is trying to find a good combination in the parameters used in Heritrix 
and NAS to manage politeness, queues, retries and timeouts (which we think 
are the source of the problem). 
It would be really helpful if you (Netarchive and ONB) could send us as 
examples the values your are using for 
settings.harvester.harvesting.heritrix.inactivityTimeout  and 
settings.harvester.harvesting.heritrix.noResponseTimeout  in the 
stettings.xml and the default order.xml you are using for production.

We'll cut into pieces all the work we have done and see with you if and 
how it could integrate NAS main trunk as soon as things slow down a little 
bit. 

Once again,
all the best for 2010,
and thanks in advance for your help.


Sara




Avant d'imprimer, pensez ? l'environnement. 
Consider the environment before printing this mail. 

_______________________________________________
Netarchivesuite-devel mailing list
Netarchivesuite-devel at lists.gforge.statsbiblioteket.dk
https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchivesuite-devel






Avant d'imprimer, pensez à l'environnement. 
Consider the environment before printing this mail.   



More information about the Netarchivesuite-devel mailing list