[Netarchivesuite-users] NAS broad crawl questions
sara.aubry at bnf.fr
sara.aubry at bnf.fr
Fri Sep 13 18:59:22 CEST 2019
Regarding the limit, if you talked to Tue about crawler configuration,
then you're probably ok.
Generation of all snaphot jobs takes over the generation of selective
ones.
At BnF, before we launch the broad crawl, we make sure our daily crawls
have started because the whole generation for about 1000 jobs takes
between 4 and 5 hours.
If you do have an available snaphot harvest controller truly available
(with no grey dot), then the second job should start.
Common problems (at least some we encountered) are:
- acces problem to the arc repository
- unwanted characters in seed lists causing the desactivation of the
harvest definition
- broker out of memory
Sara
De : "Peter Svanberg" <Peter.Svanberg at kb.se>
A : "netarchivesuite-users at ml.sbforge.org"
<netarchivesuite-users at ml.sbforge.org>
Date : 13/09/2019 18:44
Objet : Re: [Netarchivesuite-users] NAS broad crawl questions
Envoyé par : "NetarchiveSuite-users"
<netarchivesuite-users-bounces at ml.sbforge.org>
10000 is what the default limits give. Should we change that?
One job started and ended but next snapshot job didn’t start. That’s what
is strange.
Then later no selected job is started either. Everything seems to have
stopped/paused, except snapshot job creation.
We will dig further in logs etc.
/Peter
13 sep. 2019 kl. 18:15 skrev "sara.aubry at bnf.fr" <sara.aubry at bnf.fr>:
Hello Peter,
That's great news, just the start of a big aventure!
About everything should happen during the first broad crawl!
10 000 domains per job is quite big, we do only 5 000 but you probably
have big crawlers.
If you only had a single crawler started on the Snaphsot channel, that's
normal that only one job started.
That's very cautious. We also do this to make sure that we don't fail
about 1000 jobs in a row...
Grey dot with no hostname means that your job is over and being
post-processed with data transferred to the arc repository.
To check on this, look at the end of your HarvesController log file.
If everything went well, you can start another crawler, see if you are
crawling well, and then launch your other crawlers.
Job generation can be quite long.
Best,
Sara
De : "Peter Svanberg" <Peter.Svanberg at kb.se>
A : "netarchivesuite-users at ml.sbforge.org" <
netarchivesuite-users at ml.sbforge.org>
Date : 13/09/2019 18:03
Objet : [Netarchivesuite-users] NAS broad crawl questions
Envoyé par : "NetarchiveSuite-users" <
netarchivesuite-users-bounces at ml.sbforge.org>
This Wednesday at 11:02 we started our first NAS broad crawl, tadaa! (Pär
has pictures showing Thomas and I pressing the mouse button, clicking on
“Activate”.)
It started well, with the job creation process. The first job, though,
contained only one domain – maybe because it was special, with lots of
non-default seeds. Then there was job two, containing 9999 domains, and
then the process continued, with 10000 domains in each job.
After that, the first snapshot job started running. But after it was
finished, no more snapshot jobs was started.
Later, our selective harvests started and run as scheduled. But when they
were finished, nothing seems to happen in the job finishing and job
starting area. The “All Running Jobs” page just contains job rows with a
grey dot (crawl finished) and no host name. But the job creation process
continues, with now soon 100 jobs with 10000 domains each.
1) Do you have any hints on what could have happened? Is the admin
host so occupied with job creation that it can’t handle anything else? But
it wasn’t during the first hours. Where could we look to find out what
could be wrong? (In log files, of course, but what should we look for?)
We will let the job creation be finished (which will happen approximately
Sunday after 18) and see what then happens.
Then, concerning starting a broad crawl:
2) We were advised to just have one harvester process running when the
snapshot harvest is activated, which we did. But when could more processes
be started? After the first snapshot job is started? Or should we wait
until all jobs are created?
Regards,
-----
Peter Svanberg
Technical officer
Digital Collections Department, Newspapers, Radio and Television Division
National Library of Sweden
PO Box 5039
SE-104 51 Stockholm
Visits: Karlavägen 100, Stockholm
Phone: +46 10 709 32 78
E-mail: peter.svanberg at kb.se
Web: www.kb.se
_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org
https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users
Journées européennes du patrimoine 2019 - Samedi 21 et dimanche 22
septembre sur les sites de la BnF
Avant d'imprimer, pensez à l'environnement.
_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org
https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users
_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org
https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users
Journées européennes du patrimoine 2019 - Samedi 21 et dimanche 22 septembre sur les sites de la BnF Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20190913/e07d4508/attachment.html>
More information about the NetarchiveSuite-users
mailing list