[Netarchivesuite-devel] Happy new year, informations and a question
sara.aubry at bnf.fr
sara.aubry at bnf.fr
Wed Jan 6 16:17:14 CET 2010
Hi Soren,
Thanks for your answer!
I'm not 100% sure it's a bug, but I reported the bug tracking system:
https://gforge.statsbiblioteket.dk/tracker/index.php?func=detail&aid=1851&group_id=7&atid=105
and also uploaded the log files.
Any look / advice with the config information would be helpful!
Sara
Message de : Søren Vejrup Carlsen <svc at kb.dk>
06/01/2010 11:50
Envoyé par :
<netarchivesuite-devel-bounces at lists.gforge.statsbiblioteket.dk>
Veuillez répondre à
<netarchivesuite-devel at lists.gforge.statsbiblioteket.dk>
Pour
"netarchivesuite-devel at lists.gforge.statsbiblioteket.dk"
<netarchivesuite-devel at lists.gforge.statsbiblioteket.dk>
Copie
Objet
Re: [Netarchivesuite-devel] Happy new year, informations and a question
Hi Sara.
>connexions, data integrity (we ingested domains which were not strictly
compliant with NAS synthax and NAS desactivated the harvest) and last but
>not least Heritrix/NAS configurations (we never managed to finish a job
in a short period of time and NAS interrupted them assigning "closed
>connexion" stop reason to all domains even the completed ones)
This seem like a ugly bug i NetarchiveSuite. Could you report it as a bug
in our bug trackingsystem along with the relevant logfiles?
>domains which were not strictly compliant with NAS synthax and NAS
desactivated the harvest
I did send a mail to Nicolas about this before Christmas.
The issue was not about NAS compliancy, but that URLs like
"www.editions-debaisieux.fr" are not considered a valid domain by NAS.
However, editions-debaisieux.fr is considered a perfectly valid domain by
NAS.
Best Regards
---------------------------------------------------------------------------
Søren Vejrup Carlsen, NetarchiveSuite developer (and QA)
Department of Digital Preservation, Royal Library, Copenhagen, Denmark
tlf: (+45) 33 47 48 41
email: svc at kb.dk
----------------------------------------------------------------------------
Non omnia possumus omnes
--- Macrobius, Saturnalia, VI, 1, 35 -------
-----Oprindelig meddelelse-----
Fra: netarchivesuite-devel-bounces at lists.gforge.statsbiblioteket.dk
[mailto:netarchivesuite-devel-bounces at lists.gforge.statsbiblioteket.dk] På
vegne af sara.aubry at bnf.fr
Sendt: 6. januar 2010 10:47
Til: netarchivesuite-devel at lists.gforge.statsbiblioteket.dk;
clo at statsbiblioteket.dk
Emne: [Netarchivesuite-devel] Happy new year, informations and a question
Hello Netarchive and ONB teams,
I would first like to wish you all a very happy new year, full with
interesting projects, fun and surprises.
As 2010 beginns, we are really looking forward to continue joint
development activities on NetarchiveSuite, which remains our main focus to
set up a new hard and software infrastructure for running broad and
ongoing focused crawls.
Here are some news about what we did in the past few months.
We have been working on three different tools:
* a pre-selection tool which gives us the ability to:
- filter and merge our domain list (ca. 1.5 mio), a seed list selected by
librarians (ca. 8000) and the host reports from our 2008 broad crawl (ca.
3.5 mio),
- gives some stats,
- create TLD, domains and seeds and ingest them directly into NAS.
It took us a while to develop it, we run into lots of problems (DB
management, syntax, volume, timeout...) and it is still not done, but we
are closed to the end.
* a monitoring tool for librarians which is integrated into NAS and gives
the ability to:
- sort, filter and paginate the job list page,
- and see data on running jobs on one page (stats given by Heritrix like
duration, queues, number of files,...).
Here again we are closed to the end.
* a QA tool which sums up stats from Heritrix reports for a given set of
jobs or harvest.
It has been subcontracted and for now, we don't know if it is closed to
the end or not.
We have also set up and installed NetarchiveSuite on a virtualized
architecture (one admin + 30 crawlers on 6 physical machines) and try to
launch our first broad crawl. We run into all possible
problems: licences, disk crash, network connectivity and limited number of
connexions, data integrity (we ingested domains which were not strictly
compliant with NAS synthax and NAS desactivated the harvest) and last but
not least Heritrix/NAS configurations (we never managed to finish a job in
a short period of time and NAS interrupted them assigning "closed
connexion" stop reason to all domains even the completed ones).
Nicolas is currently working on data integrity. Bert, our crawl engineer,
is trying to find a good combination in the parameters used in Heritrix
and NAS to manage politeness, queues, retries and timeouts (which we think
are the source of the problem).
It would be really helpful if you (Netarchive and ONB) could send us as
examples the values your are using for
settings.harvester.harvesting.heritrix.inactivityTimeout and
settings.harvester.harvesting.heritrix.noResponseTimeout in the
stettings.xml and the default order.xml you are using for production.
We'll cut into pieces all the work we have done and see with you if and
how it could integrate NAS main trunk as soon as things slow down a little
bit.
Once again,
all the best for 2010,
and thanks in advance for your help.
Sara
Avant d'imprimer, pensez ? l'environnement.
Consider the environment before printing this mail.
_______________________________________________
Netarchivesuite-devel mailing list
Netarchivesuite-devel at lists.gforge.statsbiblioteket.dk
https://lists.gforge.statsbiblioteket.dk/mailman/listinfo/netarchivesuite-devel
Avant d'imprimer, pensez à l'environnement.
Consider the environment before printing this mail.
More information about the Netarchivesuite-devel
mailing list