Note: JetPatch is designed to allow critical functionality to be operational, even when the JetPatch Manager Console is down during the automatic failover period. That is because the JetPatch connector running on all endpoints is independent of manager status. Thus, execution and monitoring of agent and patch management operations can still occur even when the manager is down. The primary activity that is unavailable when the manger is down is the ability to change policies and remediation plans until the standby server becomes active, which should only take a handful of minutes. End-point connector logs also are not lost, since connector logs are independent of manager logs.
With the following high-availability solution, upon JetPatch Manager Console failure, a single load balancer automatically switches to a standby server with replicated data:
For database high availability, use standard third-party high availability solutions for the PostgreSQL database. Otherwise, see our set up guide.
For this configuration, you set up two JetPatch Manager Consoles, designated as Active and Standby. The Active server's data (configuration files, VAIs) is routinely backed up to the standby server via SSH (using rsync). A third-party standard load balancer (NGINX) receives client requests and connects to both JetPatch Manager Consoles in Monitor mode.
Note: You can also run the dr.py script without a fencing device (NGINX server). This is supported for Python 3 only. If you choose this option then perform the steps described in article 6 below on the standby server, however in step 6-3 run the following command instead:
./dr.py --master=[main-server] --slave=[standby-server] --monitor --standby
During normal operation, the load balancer directs client requests only to the Active server; the Standby server can be down. The load balancer monitors its connection to the Active server; upon failure, it attempts once to stop and then restart the Active server. If that doesn't restore the connection, the load balancer automatically starts the Standby server, which has all the same data as the Active server, and begins directing client requests to it.
To deploy this high-availability configuration:
- Completely set up two JetPatch Manager Consoles. Decide which will be Active and which Standby.
- Prepare a server with NGINX. It must have a hostname that is resolvable and reachable from the Active/Standby endpoints. NGINX service should be installed and always run on both Active, Standby and Load-Balancer (NGINX) servers. In case you are using a different Load-Balancer service than NGINX, the Load-Balancer server must have the ability to run the dr.py Python script.
- Download the HA software files. Use the Python3 zip file.
- On the active server i.e SSH to JetPatch Manager Console, install rsync by running:
yum install -y rsync
- On the standby server, do the following:
- Place the contents of disaster-recovery_py3 in:
/usr/share/intigua
- Install rsync by running:
yum install -y rsync
- Establish a secure, trusted connection to the active server (to enable rsync to retrieve data) by running:
ssh-keygen
ssh-copy-id root@<active>
where <active> is the active server's resolvable name or IP address.
You will be prompted for the active server's root password.
- Create the following directory:
/etc/intigua
- Configure synchronization by running: go to path /usr/share/intigua
./sync.py --src=<active> -–createconfig
where <active> is the Active server's resolvable name or IP address.
- Copy intigua-sync to:
/etc/init.d
- Start synchronization by running:
service intigua-sync start
- Optionally, test synchronization by running:
./sync.py --src=<active>
where <active> is the active server's resolvable name or IP address.
Check that the script is running, and that its output shows data being synchronized.
- Place the contents of disaster-recovery_py3 in:
- On the NGINX server, do the following:
- Establish a secure, trusted connection to the Active server (to enable stopping and restarting it upon its initial failure) by running:
ssh-keygen
ssh-copy-id root@<active>
where <active> is the Active server's resolvable name or IP address.
You will be prompted for the Active server's root password.
- Establish a secure, trusted connection to the Standby server (to enable start the Standby server upon the Active server's failure) by running:
ssh-copy-id root@<standby>
where <standby> is the Standby server's resolvable name or IP address.
You will be prompted for the Standby server's root password.
- Configure active/standby monitoring by running the monitor script:
Monitor script should be started in screen session (yum install screen -y or sudo yum install screen), so it will keep running even when the terminal session is closed.
Start the screen session by running the command: screen
You can detach from the screen session at any time by typing:Ctrl+a
d
To resume your screen session use the following command:
screen -r
In case you have multiple screen sessions running on your machine, you will need to append the screen session ID after the
r
switch.To find the session ID list the current running screen sessions with:
screen -ls
The DR script ('dr.py) needs to run with the master and slave parameters:./dr.py --master=<active> --slave=<standby> --monitor
where <active> and <standby> are the respective servers' resolvable names or IP addresses;
- Establish a secure, trusted connection to the Active server (to enable stopping and restarting it upon its initial failure) by running:
- Direct a browser to the NGINX server's name or IP address, log in as an Administrator, and configure JetPatch connector management services' Communication setting to the NGINX server's hostname (not IP address). Make sure this setting is deployed to all endpoints.
All users should direct their browsers to the NGINX server.
How to restore the HA cluster after an Active server fault
After a failure occurred and the previously Active server's fault got resolved. It is ready to be restored to the cluster, the following procedure should be executed to initialize the cluster:
1. Run the dr.py script on the Load-Balancer server with --init parameter first:
Where <active> and <standby>
./dr.py --master=<active> --slave=<standby> --monitor
Where <active> and <standby>
/dr.py --master=[main-server] --slave=[standby-server] --init --standby
/dr.py --master=[main-server] --slave=[standby-server] --monitor --standby
How to perform a Manager software upgrade procedure when HA is enabled
When upgrading the system to a new version (e.g. RPM upgrade), it is important to shutdown the DR script ('dr.py') before the upgrade, otherwise keeping the DR active can lead into a split-brain situation.
You can use the Linux kill command and the dr.py process pid to shutdown the process. After shutting-down the script, upgrade first the primary server (see upgrade procedure) and then upgrade the standby server. Only then you can restore the DR ('dr.py') script to work normally as described in the section: "How to restore the HA cluster after an Active server fault".
Comments
0 comments
Please sign in to leave a comment.