Overview
The customer has reported high CPU utilization on CGNAT collector nodes. The problem is observed on more than one node.
Solution
Initially ask for the information needed to troubleshoot this issue.
To resolve the high CPU utilization issue on the collector nodes, the following measures have been taken:
- Investigation: Compare the XML Files of the service to the CPU Core available on the system. Excessive thread allocation usually leads to CPU Spikes. In the reported example, the lscpu command and nproc command reported 24 Cores Available on the system.
- Adjusting Thread Allocation: To address this issue, the thread allocation for each service was adjusted to optimize CPU usage. The thread allocation in the services was as follows:
- Service 1 (parsingservice): Minimum thread = 15, Maximum thread = 25
- Service 2 (processingservice): Minimum thread = 15, Maximum thread = 25
- Service 3 (distributionservice): Minimum thread = 15, Maximum thread = 20
- Service 4 (distributionservice): Minimum thread = 10, Maximum thread = 15
-
And the CPU utilization reported was:
- Average CPU utilization: 70.65%
- Average idle CPU: 17.25%
- Average load average: 26.20, 25.06, 24.99
- Uptime: 515 days
- The thread allocation changes can be made similar to the lines of Slowness in Processing Service and require a restart of each service where the change is made.
- Updated Thread Allocation: The post-activity thread allocation in the services is now as follows:
- Service 1 (parsingservice): Minimum thread = 8, Maximum thread = 10
- Service 2 (processingservice): Minimum thread = 8, Maximum thread = 10
- Service 3 (distributionservice): Minimum thread = 4, Maximum thread = 6
- Service 4 (distributionservice): Minimum thread = 4, Maximum thread = 6
- Monitoring CPU Utilization: After implementing the necessary changes, the CPU utilization was monitored again. The post-activity CPU logs showed improved CPU utilization with reduced load on the system.
-
Post-activity CPU utilization:
- Average CPU utilization: 42.93%
- Average idle CPU: 49.71%
- Average load average: 14.01, 14.20, 15.84
- Uptime: 515 days
- Documentation and Communication: The changes made in the system were documented and shared with the customer, along with the activity completion mail and activity logs for their reference.
Points to consider
- Identify if any service has a backlog by checking the service input directory of the Service XML Config File.
- Identify which service has maximum traffic using DCOUNTERSTATUS
-
The idea will be to reduce the thread in service with lower backlog/traffic and allocate higher threads to the service with maximum backlog/traffic.
-
RAM/max Thread ratio recommended is 7:1
- The collection service needs not to be fine-tuned usually.
- The sum of the minimum threads of all the services should not cross the number of cores available in the system.
Conclusion:
By optimizing the thread allocation and reducing the CPU load, the high CPU utilization issue on the collector nodes can be resolved. Ask the customer to monitor the system.
Comments
0 comments
Article is closed for comments.