[Netarchivesuite-users] Messaging (openmq) problems: looping (?) messages, > 6000 per second in and out
Tue Hejlskov Larsen
tlr at kb.dk
Thu Nov 9 13:14:25 CET 2023
Peter
I have never before heard about that.
Here is the result of the same command on our platform ( see below):
We have currently 125 crawling Heritrix instances on 15 physical and 5 virtuel servers with different specs on 2 destinations.
The newest most powerfull broadcrawl server has 18 Heritrix instances with about 3 TB fast disks, 64 G RAM and 32 CPU's. Every server has it's own active firewall.
I can send you more server specs if you need that.
Displaying destination metrics where:
----------------------------------------------
Destination Name Destination Type
----------------------------------------------
PROD_COMMON_HCHAN_VAL_RESP Queue
On the broker specified by:
-------------------------
Host Primary Port
-------------------------
localhost 7676
----------------------------------------------------------------------------------
Msgs/sec Msg Bytes/sec Msg Count Total Msg Bytes (k) Largest
In Out In Out Current Peak Avg Current Peak Avg Msg (k)
----------------------------------------------------------------------------------
0 0 0 0 83 190 85 64 147 60 < 1
1713 1714 1360036 1360491 69 190 85 53 147 60 < 1
1670 1672 1325993 1327578 71 190 85 55 147 60 < 1
1666 1667 1322654 1323287 67 190 85 51 147 60 < 1
1720 1717 1365280 1362899 79 190 85 61 147 60 < 1
1754 1754 1392373 1392530 73 190 85 56 147 60 < 1
1680 1678 1333359 1332097 85 190 85 65 147 60 < 1
1690 1692 1341715 1342826 78 190 85 60 147 60 < 1
We have no connection issues or overload under the current broadcrawl harvest. The harvesters start and stops without any help and all harvesters are running at the moment.
Only when we start the broadcrawl - it must be done in steps - that is very important and we need to be pacient if we have startet too many at once, because it can take 2-3 hours before all harvesters are connected and running.
It is implemented that way because we ealier had a lot of problems with jobs which changed from the New to the Submitted queue and some hang there forever, until we restarted the broker without empty queues.
We will only have a quite platform when we cold start the whole platform in connection with a depoy of a new NAS version. This will first happen januar 2024.
Best regards
Tue
________________________________
Fra: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org> på vegne af Peter Svanberg <Peter.Svanberg at kb.se>
Sendt: 9. november 2023 11:35
Til: netarchivesuite-users at ml.sbforge.org
Emne: Re: [Netarchivesuite-users] Messaging (openmq) problems: looping (?) messages, > 6000 per second in and out
Thank you Tue, I’ll look into those config aspects. The imqcmd command was:
imqcmd metrics dst -passfile passfile -u admin -t q -n PLIKT_COMMON_HCHAN_VAL_RESP -m rts
But I forgot to mention that this happens on totally inactive servers, both admin and harvesters!
So some strange looping seems to start under certain conditions, where the openmq process keeps sending messages to the harvester instances as fast as it can. (See my e-mail 2 yesterday.)
-----
Peter Sv.
Från: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org> För Tue Hejlskov Larsen
Skickat: den 8 november 2023 15:46
Till: netarchivesuite-users at ml.sbforge.org
Ämne: Re: [Netarchivesuite-users] Messaging (openmq) problems: looping (?) messages, > 6000 per second in and out
Hello Peter
For a couple years ago we had similar problems with to high volumen of messages/communicaton between harvesters and HarvestJobManager and the GUIApplication. The GUI hang and messageques where overloaded with too mutch trafic. We lowered the number of calls, minimized retries, and delayed answers and repsonses in the settings files and perhaps also in the code. The last thing did Colin - i don't remeber where in the code.
I have attached relevant common snips from our GUIApplication, HarvestJobManager and and broadcrawl Harvester settings files.
If the trick is hided in the NAS code section for default values i need to investigate that further together with Colin.
Which options do you use for your imqcmd listning?
Normally I only use the imqcmd list dst to see if there are something wrong with the JMS queues.
Best regards
Tue
<jms>
..
<retries>10</retries>
..
</jms>
..
<jmx>
..
<timeout>120</timeout>
..
</jmx>
..
<monitor>
..
<jmxProxyTimeout>500</jmxProxyTimeout>
..
<reregisterDelay>10</reregisterDelay>
</monitor>
<heritrix>
<inactivityTimeout>1800</inactivityTimeout>
<noresponseTimeout>1800</noresponseTimeout>
<crawlLoopWaitTime>60</crawlLoopWaitTime>
..
</heritrix>
<frontier>
<!-- 2 minutes -->
<frontierReportWaitTime>120</frontierReportWaitTime>
..
</frontier>
________________________________
Fra: NetarchiveSuite-users <netarchivesuite-users-bounces at ml.sbforge.org<mailto:netarchivesuite-users-bounces at ml.sbforge.org>> på vegne af Peter Svanberg <Peter.Svanberg at kb.se<mailto:Peter.Svanberg at kb.se>>
Sendt: 8. november 2023 14:04
Til: netarchivesuite-users at ml.sbforge.org<mailto:netarchivesuite-users at ml.sbforge.org>
Emne: [Netarchivesuite-users] Messaging (openmq) problems: looping (?) messages, > 6000 per second in and out
We have recurring problems (in July in both test and production and now just in test) with lots of messaging in the queue for sending HarvesterRegistrationResponse.
Over 6000 messages per second (see below) in both directions, figures which are normally 0.
Is cured by stopping all harvesters, restarting openq on the admin server and then starting harvesters.
Anyone else have had this? Any hints on what it could be? Or if we should debug in some way before we restart?
Currently the openmq process gets 120 % CPU while the NAS processes get less than 1 % now and then. I suppose this indicates that something is looping internally in the openmq process. But why …?
Displaying destination metrics where:
-----------------------------------------------
Destination Name Destination Type
-----------------------------------------------
PLIKT_COMMON_HCHAN_VAL_RESP Queue
On the broker specified by:
-------------------------
Host Primary Port
-------------------------
localhost 7676
----------------------------------------------------------------------------------
Msgs/sec Msg Bytes/sec Msg Count Total Msg Bytes (k) Largest
In Out In Out Current Peak Avg Current Peak Avg Msg (k)
----------------------------------------------------------------------------------
0 0 0 0 9 16 10 6 12 8 < 1
5848 5847 4650296 4649825 10 16 10 7 12 8 < 1
5920 5921 4707657 4707976 10 16 10 7 12 8 < 1
5818 5817 4626266 4625949 12 16 10 9 12 8 < 1
5784 5784 4599286 4599763 8 16 10 6 12 8 < 1
5970 5969 4747221 4746267 15 16 10 11 12 8 < 1
5830 5831 4635982 4636777 10 16 10 7 12 8 < 1
[KB Logo]<https://www.kb.se/>
Peter Svanberg
Technical officer
Aquisitions and Metadata Department
Film, Games, Sheet Music and Web Unit
National Library of Sweden
PO Box 5039, SE-102 41 Stockholm
Visits: Karlavägen 96, Stockholm
+46 10-709 32 78
Peter.Svanberg at kb.se<mailto:Peter.Svanberg at kb.se>
www.kb.se<https://www.kb.se/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://ml.sbforge.org/pipermail/netarchivesuite-users/attachments/20231109/a1cf0e16/attachment-0001.html>
More information about the NetarchiveSuite-users
mailing list