Overview

A customer reports an issue that they have High Availability(HA) configured between primary collector and secondary collector using "keepalived" service. They expect the node to switch to Secondary collector when the Service stops or the Node stops on the Primary Collector. However this is not working as they shared an instance where the Service went down and the HA did not switch.

Solution

As a first step ask for the following:

Any custom scripts they have implemented to take care of restarting keepalived to achieve HA.
- This includes the cronjobs, changes to startup scripts such as rc.local, and the custom scripts that are invoked
Ask the output of systemctl status keepalived.service from both the nodes.

Scenario 1:

Usually, the issue could have because there is a bug in the Script invocation in the rc.local or crontab. In an instance it was notice the nohup command was used without a whitespace. Example of incorrect syntax:

Ask the customer to fix this.

Scenario 2:

The keepalived service should be running on both nodes. Example in this case it was notice the service was not running on the Secondary node:

If the customer asks for RCA these are On-Premise systems, so it is the responsibility of the Managed Services to take care of monitoring and changes to customer scripts.

If this was a script written by Skyvera(former STL) team, please ask the customer to share details of the HA UAT(User Acceptance Test) document. If the changes are simple make changes to the script and ask the customer to test it and move to production.

High Availability not working on CGNAT Deployment

Overview

Solution

Scenario 1:

Scenario 2:

Comments