The Trouble with Tribbles...: Load balancers

You want your online service to be reliable - a service that isn't working is of no use to your users or customers. Yet, the components that you're using - hardware and software - are themselves unreliable. How do you arrange things so that the overall service is more releiable than the individual components it's made up of?

This is logically 2 distinct problems. First, given a set of N systems able to provide a service, how do you maintain service if one or more of those fail? Second, given a single service, how do you make sure it's always available?

The usual solution here involves some form of load balancer. A software or hardware device that takes incoming requests and chooses a system to handle the request, only considering as candidates those systems that are actually working.

(The distinction between hardware and software here is more between prepackaged appliances and DIY. The point about the "hardware" solutions is that you buy a thing, and treat it as a black box with little access to its internals.)

For hardware appliances, most people have heard of F5. Other competitors in this space are Kemp and A10. All are relatively (sometimes eye-wateringly) expensive. Not necessarily bad value, mind, depending on your needs. At the more affordable end of the spectrum sits loadbalancer.org.

Evaluating these recently, one thing I've noticed is that there's a general tendency to move upmarket. They're no longer load balancers, there's a new term here - ADC, or Application Delivery Controllers. They may do SSL termination and add functionality such as simple firewall functionality, Web Applications Firewalls, or Intrusion Detection and Threat Management. While this is clearly to add differentiation and keep ahead of the cannibalization of the market, many of the additional functionality simply isn't relevant for me.

Then there is a whole range of open source software solutions that do load balancing. Often these are also reverse proxies.

HAProxy is well known, and very powerful and flexible. It's not just web, it's very good at handling generic TCP. Packed with features, my only criticism is that configuration is rather monolithic.

You might think of Nginx as a web server, but it's also an excellent reverse proxy and load balancer. It doesn't quite have the range of functionality of HAProxy, but most people don't need anything that powerful anyway. One thing I like about Nginx is directory-based configuration - drop a configuration fragment into a directory, signal nginx, and you're off. If you're managing a lot of sites behind it, such a configuration mode is a godsend.

There's an interesting approach used in SNI Proxy. It assumes an incoming HTTPS connection has an SNI header on it, picks that out, and uses that to decide where to forward the TCP session. By using SNI, you don't have to put certificates on the proxy host, or get it to decrypt (and possibly re-encrypt) anything.

Offering simpler configuration are Pound and Pen. Neither are very keen on virtual hosting configurations. If all your backend servers are the same and you do all the virtual hosting there, then that's fine, but if you need to route to different sets of back end servers depending on the incoming request, they aren't a good choice.

For more dynamic configurations, there's vulcand, where you put all you configuration into Etcd. If you're into microservices and containers the it's definitely worth a look.

All the above load balancers assume that they're relatively reliable (or are relatively stable) compared to the back end services they're proxying. So they give you protection against application or hardware failure, and allow you to manage replacement, upgrades, and general deployment tasks without affecting users. The operational convenience of being able to manage an application independent of it's user-facing endpoint can be a huge win.

To achieve availability of the service customers connect to needs a little extra work. What's to stop it failing?

In terms of the application failing, that should be less of a concern. Compared to a fully-fledged business application, the proxy is a fairly simple, usually stateless, so has less to fail and can be automatically restarted pretty quickly if and when it fails.

But what if the underlying system goes away? That's what you need to protect against. And what you're really doing here is trying to ensure that the IP address associated with that service is always live. If it goes away, move it someplace else and carry on.

Ignoring routing tricks and things like VRRP and anycast, some solutions here are:

UCARP is a userland implementation of the Common Address Redundancy Protocol (CARP). Basically, hosts in a group monitor each other. If the host holding the address disappears, another host in the group will bring up a virtual interface with the required IP address. The bringup/teardown is delegated to scripts, allowing you to perform any other steps you might need to as part of the failover.

Wackamole, which uses the Spread toolkit, is another implementation of the same idea. It's getting a bit old now and hasn't seen any work for a while.

A newer variation that might be seen as the logical successor to wackamole is vippy, which is built on Node. The downside here is that Node is a moving target, so vippy won't build as is on current versions of Node, and I had trouble building it at all.

As you can see, this is a pretty large subject, and I've probably only scratched the surface. If there are things I've missed, especially if they're relevant to illumos, let me know.

The Trouble with Tribbles...

Sunday, March 06, 2016

Load balancers - improving site reliability

1 comment: