Linux Clustering - Advanced Concepts

Brad Marshall (

In the previous article in this series, we covered Beowulf systems, clusters of workstations, and MOSIX. This article will more focus on the ``business'' side of clustering - high availability, and virtual servers. We will also discuss some of the issues that need addressing to reduce single points of failure, and some technologies that help with this.

Good examples of virtual servers are LVS (Linux Virtual Server), Squid in accelerated mode, Cisco LocalDirector, or any of the Apache front ends to Java applications servers, such as mod_jk and Tomcat. These all work by having a front end server that talks to multiple back end servers.

While there is a single point of failure here (the front end server), the load balancing allows the server to handle much higher levels of requests than would be possible with a single server. Additionally, you can add and remove servers from the pool without effecting clients ability to get service. This allows upgrades to happen piece by piece, or servers that crash to be removed without downtime for the entire system, or respond to increased demands by simply adding more servers.

Linux Virtual Server ( as it currently stands is basically a layer 4 switch. This means it routes packets between the client and servers, without knowing the content of the packets it is routing. The ability for LVS to make decisions based on the content, which is called layer 7 switching, would allow session management or services based on the content.

The front end server, also known as the director, basically routes packets to the backend server, depending on how it is set up. Currently LVS has 3 methods of routing:

VS-NAT works by using Network Address Translation, or NAT. Most people have had experience with NAT, in the form of ip masquerading. There are other forms of NAT -- static NAT works by mapping addresses one to one to other addresses, where dynamic network address translation is translating the address from a pool (the pool contains less addresses than you wish to translate). IP masquerading is just a special form of dynamic network address translation - it is a many to one translation.

The VS-NAT code is based on the ip masquerading code, as well as the port forwarding code, and works by matching the destination address and port of the incoming client request to a list of known virtual server service. A real server is chosen by the virtual servers rule table, this mapping is recorded, and the destination address and port are rewritten, and forwarded on to its ultimate destination. When reply packets come back, the director rewrites the outgoing packets with the address of the virtual server, and forwards it on. Timeouts or connection terminations simply remove the mapping from the table.

VS-TUN works using IPIP encapsulation, or tunnelling - this is a way of encapsulating IP packets inside IP packets, which allows redirection of packets to another destination. It is very similar to VS-NAT where, instead of masquerading the packets to the real servers, it sends them to the real servers via an IP tunnel. This means the servers can be in physically distributed locations, but must support IP encapsulation.

In this method of routing, packets that are accepted for the virtual server will choose a real server, based on its scheduling algorithm, and log the connection into its records. It then encapsulates the packet and forwards it off to its destination server, which will decapsulate the packet, respond to the request, and return this response to the client directly. Note that the real servers need to have a non-arp device - usually lo - configured with the IP address of the virtual server so when the packets arrive they can be treated as being destine for the real server.

VS-DR, or Virtual Server Direct Routing, works similar to the others - it accepts packets for the virtual server, and then directly routes them to the selected server, which is chosen by some scheduling algorithm. Each of the servers have a non-arp interface, perhaps lo, that has the virtual server IP address. This allows the director to change the MAC address of the packet to that of the server and retransmit it on the LAN which contains the director and the servers. The server receives these rewritten packets, and sees that it is destine for a local address and processes the request, returning the result to the user directly.

As mentioned previously, LVS has a few scheduling algorithms, namely round robin, weighted round robin, least connection and weighted least connection. There is much more to configure to properly set up LVS, and for more information on these and how to actually configure a server for LVS, see their website at

On the other side of the clustering coin we have high availability. Where the virtual servers worked by balancing the load over multiple servers, high availability works by having redundant servers ready to take over when the active server crashes. The aim is to reduce all single points of failure by either making parts of each computer redundant, or making the whole computer redundant. The level you go to to make a redundant server will depend on how much downtime costs you - there are many things you can do to increase reliability, but they all come at a cost.

There are several types of failover schemes in the existing HA market: idle standby, rotating standby, simple failover and mutual takeover. Depending on the vendor they might have different names, but the concepts are fairly similar.

Idle standby works by, as the name suggests, having an idle standby. This means there is one (or more) servers standing by doing nothing - it can be the same as the existing working server, or it can be of lesser spec, as long as it is sufficient to ensure operation of some level. Idle servers are given a priority, and the server with the highest priority takes over when the real server fails.

Rotating standby is similar to idle standby, but the idle servers are not given a priority - it is a simple FIFO (first in, first out) type replacement strategy. This does mean that each server needs to be about equal, as there is no ranking.

Simple failover is where the backup server runs some non-critical application, and takes over the critical one fails. The backup server does not have to be able to cope with both jobs - it is sufficient for it to just take over the work of the critical application(s) and drop the non-critical ones for the period of the downtime. When the real server comes back, it simply resumes the non-critical applications.

Mutual takeover is where two (or more) servers are configured such that they can take over the other jobs, while still running their own applcation - if the servers are not powerful enough to run both applications, the downgrade in performance must be acceptable while fixes are made.

Another simple form of clustering is round robin DNS. This is only applicable for simple connection services, and works by having multiple hosts pointed to by the DNS entry of the service. The problem with this is all hosts are considered equal - there is no way of balancing out requests equally, and also that some clients will cache the response to the DNS request, thus defeating the clustering. However, for all its faults, using round robin DNS is a cheap way of getting some form of clustering.

One important thing to remember is that clustering, regardless of what type, is only as good as its implementation. This means it will still require monitoring (and notification), it will still need backups, and it will still need regular maintance. There are stories about clusters that had all the redundancy in the world but stopped working - this happens because the system slowly decays over time - hardware will always fail at some stage - and no-one notices because there is no monitoring (and notification) of the system.

So as you have seen there are many types of clustering and many ways each type can be used. It is an important technology to understand, as it will let you provide services for load which you never could have coped with before. However, it is not a silver bullet, and has many tricks and traps of its own you need to understand before you can fully utilise it.