[Netarchivesuite-users] NAS broad crawl questions

Tue Sep 24 10:08:56 CEST 2019

Well, more mysterious things happen:

·        We stopped everything and made a reset of the queues X_COMMON_HARVESTER_STATUS_TOPIC (harvesters to call in ready for new jobs) which was slowly increasing, and X_COMMON_HCHAN_VAL_RESP (response from the HarvestManager to the Harvesters requesting HarvestChannel registrations). The latter had traffic of 10000 queue messages per second (in and out). This probably explains openmq process use of >200 % cpu. But why are all these sent? (The figure 10000 only happens to coincide with the number of domains per job, I suppose?)

·        The CPU usage decreased but after a while the cpu usage for openmq is up again (> 200 %), and 100 % for postgres DB process.

·         The GUI is still out of sync, although the “All jobs” page is slowly getting more in sync.

Question:

·        To have just one harvester running at start of crawl, we killed all the others, but maybe that confuses the admin? Is it better if we kill the harvesters and admin, empty the queues and then start admin and just one harvester?

Regards,

-----

Peter Svanberg

National Library of Sweden
Phone: +46 10 709 32 78

E-mail: peter.svanberg at kb.se
Web: www.kb.se

Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org> För Peter Svanberg
Skickat: den 17 september 2019 19:37
Till: netarchivesuite-users at ml.sbforge.org
Ämne: Re: [Netarchivesuite-users] NAS broad crawl questions

Thank you for that. We had increased on the admin host but now we increased on the harvesters also. But we have not seen any “too many open files” errors, shouldn’t that show up in some log file?

Our scheduling of 1,8 million domains in 180 jobs of 10000 each took about 4 days (5 domains/second). This is step 1, i.e. first run (with 500 kByte limit). How many domains did you schedule in 6-9 hours?
Could you explain more? Do you deliberately delay the scheduling, how? Some NAS parameter?
Other strange things:

·        The admin host’s openmq process use 200 % cpu also *after* the scheduling is done (and it shouldn’t have so much to do).

·        The GUI is only partially updating, and out of sync with reality … when snapshot job n+30 is ready, it reports that job n is ready, with date and time from job n+30.

·        Jobs are run on different servers but never more than one on at the time on a server (but with different harvester on the server at different times). And it don’t seem to use all the servers either.

·        There is sometimes a lot of delay between “Requested to check the validity of harvest channel 'SNAPSHOT'” and “Received message stating that channel 'SNAPSHOT' is valid.”.

·        Sometimes some process thinks that a harvest job is ready and moves files, while the harvest process itself continues and gets an exception when it doesn’t find the files to write in.

As Sara said, “About everything should happen during the first broad crawl!”

Regards,

Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> För Tue Hejlskov Larsen
Skickat: den 16 september 2019 09:12
Till: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Ämne: Re: [Netarchivesuite-users] NAS broad crawl questions

Our 5 broadcrawl servers ( up to 10 harvesters)/ server have following setup

[prod at kb-prod-har-001 ~]$ cat /etc/security/limits.d/90-nproc.conf
# Default limit for number of user's processes to prevent
# accidental fork bombs.
# See rhbz #432903 for reasoning.

*          soft    nproc     40000
root       soft    nproc     unlimited

/etc/security/limits.conf
prod             soft    nofile          20000
prod             hard    nofile          20000

the proftpd server on each harvester has no session limit and is niced between -10  and -20

We don’t have any problems with the scheduling of selective jobs during broad crawl job scheduling. In DK it takes about 6-9 hours to schedule about 350 jobs I step 2.
Previously – for some years ago we had the problem too, that there were no selective jobs scheduled during braodcrawl job scheduling, but not anymore.

The setup of the broad crawl job scheduling in DK is delayed with long HarvestJobManager timeouts, becaused of the previous scheduling issues.
Even though we still have submitted on the queue - it is only temporary and not a production issue any longer.
The job scheduling are delayed so much that it can take  a couple of hours to get all harvesters running with jobs. Most of the time I have 5-10 just are just “sleeping”, even though there are a lot of jobs in the “new” queue”.

Sometimes I can provoke the some of the waiting harvesters to take a job by restarting another “listening” harvester.

Best regards

Tue
From: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> On Behalf Of sara.aubry at bnf.fr<mailto:sara.aubry at bnf.fr>
Sent: Friday, September 13, 2019 6:59 PM
To: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Subject: Re: [Netarchivesuite-users] NAS broad crawl questions

Regarding the limit, if you talked to Tue about crawler configuration, then you're probably ok.
Generation of all snaphot jobs takes over the generation of selective ones.
At BnF, before we launch the broad crawl, we make sure our daily crawls have started because the whole generation for about 1000 jobs takes between 4 and 5 hours.
If you do have an available snaphot harvest controller truly available (with no grey dot), then the second job should start.
Common problems (at least some we encountered) are:
- acces problem to the arc repository
- unwanted characters in seed lists causing the desactivation of the harvest definition
- broker out of memory

Sara

De :        "Peter Svanberg" <Peter.Svanberg at kb.se<mailto:Peter.Svanberg at kb.se>>
A :        "netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>" <netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>>
Date :        13/09/2019 18:44
Objet :        Re: [Netarchivesuite-users] NAS broad crawl questions
Envoyé par :        "NetarchiveSuite-users" <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>>
________________________________

10000 is what the default limits give. Should we change that?

One job started and ended but next snapshot job didn’t start. That’s what is strange.

Then later no selected job is started either. Everything seems to have stopped/paused, except snapshot job creation.

We will dig further in logs etc.

/Peter

13 sep. 2019 kl. 18:15 skrev "sara.aubry at bnf.fr<mailto:sara.aubry at bnf.fr>" <sara.aubry at bnf.fr<mailto:sara.aubry at bnf.fr>>:

Hello Peter,

That's great news, just the start of a big aventure!
About everything should happen during the first broad crawl!

10 000 domains per job is quite big, we do only 5 000 but you probably have big crawlers.

If you only had a single crawler started on the Snaphsot channel, that's normal that only one job started.
That's very cautious. We also do this to make sure that we don't fail about 1000 jobs in a row...

Grey dot with no hostname means that your job is over and being post-processed with data transferred to the arc repository.
To check on this, look at the end of your HarvesController log file.
If everything went well, you can start another crawler, see if you are crawling well, and then launch your other crawlers.

Job generation can be quite long.

Best,

Sara

De :        "Peter Svanberg" <Peter.Svanberg at kb.se<mailto:Peter.Svanberg at kb.se>>
A :        "netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>" <netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>>
Date :        13/09/2019 18:03
Objet :        [Netarchivesuite-users] NAS broad crawl questions
Envoyé par :        "NetarchiveSuite-users" <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>>
________________________________

This Wednesday at 11:02 we started our first NAS broad crawl, tadaa! (Pär has pictures showing Thomas and I pressing the mouse button, clicking on “Activate”.)

It started well, with the job creation process. The first job, though, contained only one domain – maybe because it was special, with lots of non-default seeds. Then there was job two, containing 9999 domains, and then the process continued, with 10000 domains in each job.

After that, the first snapshot job started running. But after it was finished, no more snapshot jobs was started.

Later, our selective harvests started and run as scheduled. But when they were finished, nothing seems to happen in the job finishing and job starting area. The “All Running Jobs” page just contains job rows with a grey dot (crawl finished) and no host name. But the job creation process continues, with now soon 100 jobs with 10000 domains each.

1)     Do you have any hints on what could have happened? Is the admin host so occupied with job creation that it can’t handle anything else? But it wasn’t during the first hours. Where could we look to find out what could be wrong? (In log files, of course, but what should we look for?)

We will let the job creation be finished (which will happen approximately Sunday after 18) and see what then happens.

Then, concerning starting a broad crawl:

2)     We were advised to just have one harvester process running when the snapshot harvest is activated, which we did. But when could more processes be started? After the first snapshot job is started? Or should we wait until all jobs are created?

Regards,

-----

Peter Svanberg
Technical officer
Digital Collections Department, Newspapers, Radio and Television Division

National Library of Sweden
PO Box 5039
SE-104 51 Stockholm
Visits: Karlavägen 100, Stockholm
Phone: +46 10 709 32 78

E-mail: peter.svanberg at kb.se<mailto:peter.svanberg at kb.se>
Web: www.kb.se

_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org<mailto:NetarchiveSuite-users at ml.sbforge.org>
https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users
________________________________

Journées européennes du patrimoine 2019<https://www.bnf.fr/fr/actualites/journees-europeennes-du-patrimoine-2019>- Samedi 21 et dimanche 22 septembre sur les sites de la BnF

Avant d'imprimer, pensez à l'environnement.

_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org<mailto:NetarchiveSuite-users at ml.sbforge.org>
https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users_______________________________________________
NetarchiveSuite-users mailing list
NetarchiveSuite-users at ml.sbforge.org<mailto:NetarchiveSuite-users at ml.sbforge.org>
https://ml.sbforge.org/mailman/listinfo/netarchivesuite-users

________________________________

Journées européennes du patrimoine 2019<https://www.bnf.fr/fr/actualites/journees-europeennes-du-patrimoine-2019> - Samedi 21 et dimanche 22 septembre sur les sites de la BnF

Avant d'imprimer, pensez à l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20190924/f2520b75/attachment-0001.html>