High Availability VIP Management With ucarp

Previously, we explored using keepalived for managing a virtual IP address (VIP). For the purposes of high availability, this was functional and is indeed the most popular way of doing so on Linux but you’ll notice several drawbacks to that approach.

The configuration for keepalived is notoriously finicky, the software itself is loaded with features that extend beyond the simple task of managing a virtual IP address, the VRRP protocol is owned by Cisco for the time being (meaning subject to a lawsuit at any moment) and worst of all the VRRP protocol itself is completely and utterly insecure. That means any VRRP implementation, whether it’s keepalived or uvrrpd, should be considered unacceptable for enterprise use.

You can configure a password for your VRRP instances but the password is sent in the clear. This means that if an intermediate switch or router is compromised then taking down your entire website is as simple as crafting a VRRP packet that contains the password and the proper VRRP metrics to ensure success in the router election. Presumably, you wouldn’t like that which is why you went through the trouble of architecting a high availability solution to begin with.

There has to be a better solution than that. Luckily there is. Enter the CARP protocol and it’s Linux-compatible implementation ucarp. The ucarp daemon simultaneously gives us a daemon that’s easier to use and one that’s more secure than keepalived.

Basics of the CARP protocol

CARP (“Common Address Redundancy Protocol”) was developed by the OpenBSD project to counter the patent encumbered and insecure VRRP protocol (the protocol used by keepalived and uvrrpd). It’s intended specialize only in VIP management. Being technically more specialize that keepalived isn’t usually a liability for ucarp as VIP management is usually all administrators are using it for in the first place.

The similarity between VRRP is more than just skin deep, the entire protocol is structured as a drop-in replacement for vip management with VRRP. Multiple pools can be managed, peers are discovered via multicast (although keepalived allows for manually specifying peers to communicate over unicast), in each pool one server will be MASTER while the other will be in a BACKUP, with a up/down script being kicked off on each node. It even shares the multicast IP address reserved for VRRP (224.0.0.18 if you were wondering).

The primary differences between the two:

  • Terminology: For instance what VRRP calls a “Virtual Router” CARP refers to as a “VIP” and what VRRP refers to as a “physical router” is simply a “node” in CARP. You’ll note that CARP terminology is actually more in line with popular terminology rather than a niche part of the industry.
  • Ease of use:  Each implementation of the CARP protocol is substantially easier to use than keeplaived itself. For instance, ucarp is substantially less finicky when it comes to your use of whitespace.
  • Security: Whereas in VRRP the password is sent in the cleartext, CARP actually produces an HMAC for each advertisement and hashes the password (both SHA1). Even though SHA1’s integrity has been on the decline in recent years, it’s still a better than clear text.

Theory Behind High Availability Protocols

Let’s suppose you wanted to construct a par of nginx load balancers balancers to route HTTP traffic to backend webserver. The trick here is how do we get one of our load balancers to take over the load currently going to the other node?

We have two choices here:

  • Load balance through DNS. Each time a client resolves the hostname, the A records are in a different order, thus if one is down, when it retries it’ll (hopefully) get a server that is actually up. This also requires all nodes be in use to some degree (in other words an active-active configuration)
  • DNS points to a virtual IP address (VIP) that’s shared between the nodes such that the whichever nginx instance is the one that’s supposed to be handling the traffic it need only have the VIP assigned. These nodes necessarily need to be on the same subnet.

The first one, besides being out of scope for this article, has the issue of resulting in random slowdowns and depending on HTTP clients’ retry mechanisms eventually failing over to the other servers, either by trying to results in order or re-resolving the IP so a different one is up top (if at all). In reality, for incredibly large sites you would likely combine these two approaches (DNS actively load balancing between several VIP’s) so that the active load is spread around but to keep it simple, we’ll pretend DNS is just a single A record pointing at our VIP.

When each node boots up, each node in the cluster has to determine which will be the master node and which will be the backup nodes. CARP does this by determining which node will send out advertisements the most frequently (and thus would be quicker to notice if it disappeared) and promotes that to master. If two or more nodes have the same advertisement interval, then the node with the lowest IP address (simple tie breaker) is selected as the default master. Consequently, if you want a particular node to come up as the default master and regain the status when it comes back up, increase the frequency of its advertisements.

Similar to other HA protocols, when a CARP node becomes a master it informs the neighboring routers and switches that it has a new IP address. It does this by issuing what’s called a gratuitous ARP reply informing all of its “link up” event. This causes the neighbors to update their local arp tables and IP addresses that would originally be sent out the switch port associated with the old master will now exit the switch port for the new master (where the IP address has hopefully been assigned).

Configuring Two Ubuntu Web Servers

OK so now that we have a basic understanding of what the CARP protocol is, how ucarp specifically fits into the mix, and roughly how the two will interact. Let’s do something useful with this information.

Let’s continue with our example of two nginx load balancers. I’ll forgo explaining the nginx load balancing itself as being out of scope, but you also need to be aware of how each individual instance (once it’s receiving traffic) is going to get that traffic to the nodes that need it. First let’s figure out my the IP addresses I’ll be using:

  • node01: will have a physical (aka “real” in CARP-ese) IP address of 10.10.10.79
  • node02: will have a physical IP address of 10.10.10.80
  • The VIP that the two nodes will be managing will be 10.10.10.100

OK so once the VM’s are spun up and the IP’s statically assigned, we install ucarp and nginx with apt-get install -y nginx ucarp (I’ll spare you the uninteresting command output). Once installed, we now need to configure the interface and nginx.

The nginx configuration is fairly simple and omitted here for brevity. Essentially you must listen on all interfaces. If you specify only one IP, then obviously when the VIP migrates to the given machine it won’t relay that traffic to nginx. An example nginx virtual host  (without any load balancing) might be:

server {

  listen 80 default_server;
  root /var/www/html;
  index index.php index.html;
  server_name _;

  location / {
    try_files $uri $uri/ =404;
  }

}

As you can see, nothing all that special. The important point is just to make sure the listen directive doesn’t specify a particular IP address. Now let’s get to the actual high availability portion, configuring the IP address to migrate over.

Before getting into a fully enterprise-grade solution, let’s start you off easy. The Debian version of ucarp is actually pretty neat in that it integrates with the regular networking functionality of the OS. Let’s look at an example /etc/network/interfaces configuration:

auto lo
iface lo inet loopback

auto enp0s3
iface enp0s3 inet static
  address 10.10.10.79
  netmask 255.255.255.0
  gateway 10.10.10.1
  dns-nameservers 8.8.8.8
  ucarp-vid      1
  ucarp-vip      10.10.10.100
  ucarp-password rascaldev
  ucarp-advbase  1
  ucarp-advskew  0
  ucarp-master   no

iface enp0s3:ucarp inet static
  address 10.10.10.100
  netmask 255.255.255.0

Before going over each of the new lines in your configuration file, let’s explain the structure. In most high availability systems, a virtual IP is just considered a resource that is traded around the member nodes. As such, the high availability configuration is place on each nodes physical interface. In the simplest of test environments it’s usually alright to have the production network interface and the high availability interface be the same, however for reasons that may soon become apparent, for high workload production systems, it’s often beneficial to move the cluster over a different network.

So with the high availability configuration going on the physical interface, where do we put the VIP? You can see our VIP on the ucarp-vip directive but don’t be fooled, that’s only there for the conversations the ucarp instances have with each other. When a node in this sort of configuration becomes the master node the Debian scripts will attempt to bring up the subinterface that matches <high-availability-interface>:ucarp which in our case above is enp0s3 and so the interface it bring online upon becoming master is called enp0s3:ucarp which you can see we’re defined below our physical interface. This is just a regular interface and so it can take all the normal directives any other interface can take.

Now that we know how the file is organized and how things actually work, let’s break down your high availability configuration:

  • ucarp-vid This is the VIP identification number. It can be anything from 1-255 and uniquely identifies this VIP pool as opposed to all others. As a consequence you can’t have more than 255 HA pools on a single subnet, but in practice this is hardly ever an issue.
  • ucarp-vip is the  IP that the networking service will give the ucarp daemon as being our main VIP. This isn’t used for actually assigning the IP address to an interface and is just included in computing the hmac that this node will give to its neighbors for authentication/integrity.
  • ucarp-password a short text string that will be combined with the rest of the information in a node’s advertisement and when it’s received, both the main VIP and this string are used to validate the other node really belongs to our cluster.
  • ucarp-advbase The base amount of time (in seconds) between advertisement messages. This needs to be the same on all nodes and defaults to 1 second which is usually sufficient.
  • ucarp-advskew This particular node’s skew away from the base (in seconds). This will be included in the node’s advertisement to other nodes and gives it a sense of how often other nodes should be seeing advertisements from this node. For example, a cluster base of 1 second and a node skew of 1 second means the other nodes will expect to see advertisements every two seconds. Usually this can be left at zero (for an advertisement every second) since usually you don’t care which node is master but if you want to force a particular node to always be the master if it’s available, then increase the skew of the other nodes and the cluster will elect the node with the lowest skew master as the preference always goes to the node sending the advertisement most often.
  • ucarp-master configures this nodes initial state upon starting. Usually it’s best to assume all other nodes exist rather than potentially trying to add a VIP that’s currently in use.

So far it seems to make sense, and if you configure the above, reboot each node, you’ll eventually start seeing the VIP appear on whichever node was elected:

root@node01:~# ip add
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:7b:22:d5 brd ff:ff:ff:ff:ff:ff
    inet 10.10.10.79/24 brd 10.107.159.255 scope global enp0s3
       valid_lft forever preferred_lft forever
    inet 10.10.10.100/24 brd 10.107.159.255 scope global secondary enp0s3:ucarp
       valid_lft forever preferred_lft forever
    inet6 2620:0:691:5337:a00:27ff:fe7b:22d5/64 scope global mngtmpaddr dynamic 
       valid_lft 2591902sec preferred_lft 604702sec
    inet6 fe80::a00:27ff:fe7b:22d5/64 scope link 
       valid_lft forever preferred_lft forever
root@node01:~#

If you ever want to demote an individual node, just send the SIGUSR2 signal to the ucarp process you want demoted or do a killall -SIGUSR2 ucarp to demote all instances to backup.

Go ahead and test this configuration by accessing the website in your browser printing some node-specific data so you can see which node is listening on the VIP. As you reload have another window open where you reboot one of your nodes and see how the other node eventually notices and promotes itself to master.

Maximizing Availability

Depending on the speed of the connected switches and devices, you may have noticed a problem with the current means of VIP management: fail over was automatic and reasonably fast but there was a period of time (usually between 2-4 seconds on my machine) during which the website was unavailable. Sometimes it would seem to kick over immediately, but other times the browser just clocked for a bit then came back on the new node, other times it timed out completely. That level of variability isn’t acceptable for a production environment.

This behavior is primarily due to a combination the following:

  • systemd arranges service shut downs such that the web server will be stopped prior to the networking service signalling ucarp to gracefully migrate the VIP. Since networking is a service that any web server will obviously list as a dependency, some amount of time will pass with the switch directing traffic to an old master that no longer has a program running to respond to requests. Not good.
  • Even if they were in the correct order, we have no control over when layer 2 switches will let go of the MAC address associated with the old master’s interface and issue a new arp request on the VIP to get the MAC for the new interface. When the new master comes up it will send the aforementioned “gratuitous ARP” response but there’s no guarantee as to when the switch will process that response and put it into its ARP table.

As mentioned previously, depending on your equipment and level of network traffic, you may be experience one or the other as being more of an issue but we need to come up with something that resolves both problems. Regular maintenance of the server should never result in down time.

The networking service is obviously fundamental and ordered early in a system’s startup and is one of the last to stop during shutdown. So we need to find a way to ensure that the ucarp executable receives SIGTERM before nginx (or httpd or whatever) and well before networking is stopped.

To resolve this ordering of systemd units, we’ll have to craft our own unit and position it earlier in the shutdown sequence than our webserver. There are obviously a wide variety of ways to do this. As an example I put the following at /root/srv/ucarp.service  on each of my nodes:

[Unit]
Description=ucarp Virtual IP Address Management
After=nginx.service

[Service]
Type=simple
ExecStart=/usr/local/bin/start-ucarp
RemainAfterExit=true
ExecStop=/usr/bin/killall -SIGTERM ucarp
ExecStop=/bin/sleep 10
TimeoutStopSec=30
StandardOutput=journal

[Install]
WantedBy=multi-user.target

The above should be fairly obvious but essentially:

  • After starting the nginx.service unit systemd will execute a bash script at /usr/local/bin/start-ucarp not caring if it exits as long as it’s successful.
  • For shutdown (ExecStop) it first issues a killall -SIGTERM ucarp (to signal a graceful shutdown and VIP transition)
  • systemd the pauses for 10 seconds before resorting to a hard kill on our ucarp processes if the entire shutdown takes longer than 30 seconds.

Then on each system I symlink the new service to /etc/systemd/system/multi-user.target.wants so that it loads on boot. Now that we have a system service, we actually have to write that script that referenced in our unit such that it spawns ucarp with the appropriate configuration:

#!/bin/bash

/usr/sbin/ucarp -i enp0s3 -s 10.10.10.79 -B -z -v 1 -p rascaldev -a 10.10.10.100 -u /usr/local/bin/vip-up -z -d /usr/local/bin/vip-down

Please note that the value of the -s parameter will have to be changed to the “real” IP address for the node’s network interface. The above (from left to right):

  • Establishes the source interface and IP address to use for cluster communication (-i and -s respectively)
  • -B daemonizes in case this script actually spawns multiple VIP groups to manage (detailed in the next section)
  • -v establishes the VIP identification number. This is how multiple VIP groups are managed. In production setups you would change this and the VIP listed in -a to the appropriate values so that ucarp could differentiate the two groups.
  • -p sets the shared secret (in our case “rascaldev”) to authenticate the messages as being from trusted sources.
  • -u points to the “up” script ran when the daemon transitions out of a BACKUP state and into MASTER state.
  • -z enables us to use -d for specifying a vip shut down script.

The contents of the /usr/local/bin/vip-up are as follows:

#!/bin/sh

/sbin/ip addr add ${2}/22 dev $1

Which is pretty straight forward, it simply adds the IP given by ucarp as the second script argument to the device specified by the first argument. In our case we only have a single VIP, which is the VIP that’s we’re giving ucarp at invocation, so all we have to do is add the IP. If we have additional IP’s in this VIP group, though, you should add an arping -B -S $2 to your vip-up script as ucarp will only do the gratuitous ARP on the IP that you specified in -a when invoking it and thus you have to do that manually.

The /usr/local/bin/vip-down script contains:

#!/bin/sh

/sbin/arptables -I INPUT -d $2 -j DROP

/bin/sleep 2 && /sbin/ip addr del $2/22 dev enp0s3

/sbin/arptables -D INPUT 1

Which is less obvious so let’s break it down:

  • For traffic intending to hit the old master to reach nginx we need to keep the ip on the network interface. However if an ARP request happens to come in the old master may respond and potentially undo the effects of the gratuitous ARP response from the new master.
  • The arptables command is used to drop any incoming ARP requests for the VIP ($2).  Blocking only ARP requests for the given VIP on the old master allows us to keep servicing TCP/UDP requests directed to it for this VIP until all nodes have the MAC of the new master in their ARP tables.
  • We pause for two seconds, continuing to service any last requests that come in on the departing VIP. You may have to play around with this number if two seconds doesn’t seem like it’s long enough to avoid an apparent disruption. If you determine that during cut over there is a disruption is service (no matter how small) that indicates that this waiting period isn’t long enough. You’ll have to remember that we had to do an arping for our supplemental IP addresses so it may not be immediately available.
  • After the wait is over, we use arptables one last time to delete the rule we just inserted. You’ll notice that I used an index to delete the rule, this is just my habit. The arptables command follows a syntax similar to the iptables command you might be used to.

So there you have it. Now when one node goes down gracefully, there should be zero disruption in application availability when browsing from page-to-page.

Managing Multiple VIP Groups

The above vip-up and vip-down scripts should work fine for however many virtual IP addresses you’re needing to make sure migrate to an available node. Occasionally though, you may have to manage VIP’s as distinct groups.

As one example, let’s say you have a group of 100 VIP’s and two HA’d nginx load balancer nodes. If you were to manage them all as a single group then one nginx instance will be handling 100% of the traffic, whereas the other node would remain completely idle. This is known as having an “active-passive” configuration in your cluster. If you have the money to spare this is usually preferable as you can be fairly certain that should the passive node be needed it should have the capacity available to take the workload on.

However, you may not have the money and need to make sure that load balancer is actually doing. To solve this problem, you would split those 100 VIP’s up and create two separate VIP instances in ucarp such that each node will in normal operation take one node whereas should a node go down, the VIP’s it’s managing will migrate to the other node. This is known as having an active-active configuration.

Additionally, as a compromise, you may have three load balancers, two active load balancers each actively distributing the load of a different set of 50 VIP’s with the third load balancer sitting idle in the back ready to have VIP’s from either one transferred to it. This is called having an active-active-passive configuration. This is a risky proposition though, as the idle node needs to have the capacity to deal with the traffic 100 VIP’s should worst come to worse (but also each node only needs to manage its normal 50 and can be sized accordingly).

So how do these scenarios come up in the ucarp configuration? Well in our final setup (using systemd services instead of the networking scripts) it’s relatively simple. We just modify /usr/local/bin/start-ucarp bash script to launch multiple instances of ucarpwith separate VIP id’s and different vip-up and vip-down scripts. For instance:

/usr/sbin/ucarp -i enp0s3 -s 10.10.10.79 -B -z -v 1 -p rascaldev -a 10.10.10.100 -u /usr/local/bin/vip-up -z -d /usr/local/bin/vip-down

/usr/sbin/ucarp -i enp0s3 -s 10.10.10.79 -B -z -v 2 -p rascaldev -a 10.10.10.110 -u /usr/local/bin/other-vip-up -z -d /usr/local/bin/other-vip-down

In the above, the only substantive difference between the two are the VIP id given to each group (-v 1 and -v 2 respectively) and the VIP. Since the up/down scripts are called with the main VIP in $2 theoretically you could use the same scripts for each ucarp instance but in my experience there can tend to be additional steps that you end up needing to do for one VIP pool that aren’t relevant to the other. So putting too many unneeded steps/checks in an up/down script may slow down its execution at the precise moment you would wish it would execute as quickly as possible.

Stability Notes

About that extra interface…

So I mentioned before that you might not want to run the high availability communication over the same physical interface as your production traffic. The reason for that is because if a node misses three advertisements in a row then its cluster members will consider it MIA and if that node is a master it will be replaced by a different master. This could be a problem if you have a high traffic interface as you could end up missing traffic due to some combination of buffers overfill, packets being dropped, or corruption on the wire (garbling the HMAC for instance).

For this reason, if your application runs any a considerable amount of data over that interface, it’s best if it has a switch unto itself where only other cluster nodes are connected. This reduces latency and ensures the master can continue to communicate with the backup nodes to prove that yes indeed Johnny-5 is still alive.

Alternatives to running ip addr add

Earlier in our vip-up and vip-down scripts we were manually invoking ip addr add to add the IP address to our interface. You may run into issues, though, where you need to configure software such as nginx to listen only on a particular VIP and port combination. If the system doesn’t see this IP address on any interface it will fail to bind a listening socket.

To work around this, you may explore the option of having all VIP’s permanently configured for the production interface and use arptables to drop any incoming (or outgoing) ARP responses for each VIP. When a particular node becomes master your vip-up script can merely be the removal of these arptables rules (along with the arping broadcast you were likely doing for supplemental IP’s anyways). This approach has potentially issues though. For instance, you would need to ensure the IP’s were only adding after your arptables rules else you run the risk of the system broadcasting its MAC address out while another server is actually the production system. It’s considerations like this that I omitted them here

As with any advice you get from the internet though, trust but verify.

On the subject of long running downloads…

As alluded to before, this HA solution only works for seamless availability between requests. As in, you’re on one page and click to the next page which is served by a different load balancer without you being aware. Alternatively, such as the case with something like a React app, your subsequent asynchronous requests to the backend web server are being handled by a different load balancer than what loaded the initial page for you.

In either case, the situation that falls through the cracks is the case of long running downloads. Currently, this problem appears to still be in the research phase in academic circles and there exists no production grade tool for transferring TCP connections to another system. This is except for the case of live migrating VM’s which work by dumping kernel memory over the wire which isn’t normally available to non-hypervisors. There are relatively new features in the Linux kernel such as TCP_REPAIR which purport to restore TCP connections on other nodes, but I have no experience with that and it’s not widely used.

If a server fails unexpectedly, long running downloads will fail. That’s just a given, the thing you were talking to just ain’t there anymore. For graceful maintenance though, there is hope in architecting around this issue. To achieve high availability there, redirect all potentially long-running downloads through an HTTP redirection (not a load balance) and when a download server needs to be restarted, simply temporarily remove it from the redirection layer, wait for current downloads to finish, perform the reboot and re-join the redirection pool. The process of which can all be automated.

Further Reading