How to handle server failure in an n-tier architecture?
        Posted  
        
            by 
                andy
            
        on Server Fault
        
        See other posts from Server Fault
        
            or by andy
        
        
        
        Published on 2012-09-24T16:14:52Z
        Indexed on 
            2012/09/24
            21:39 UTC
        
        
        Read the original article
        Hit count: 555
        
high-availability
|infrastructure
Imagine I have an n-tier architecture in an auto-scaled cloud environment with say:
- a load balancer in a failover pair
- reverse proxy tier
- web app tier
- db tier
Each tier needs to connect to the instances in the tier below.
What are the standard ways of connecting tiers to make them resilient to failure of nodes in each tier? i.e. how does each tier get the IP addresses of each node in the tier below?
For example if all reverse proxies should route traffic to all web app nodes, how could they be set up so that they don't send traffic to dead web app nodes, and so that when new web app nodes are brought online they can send traffic to it?
- I could run an agent that would update all the configs to all the nodes, but it seems inefficient.
- I could put an LB pair between each tier, so the tier above only needs to connect to the load balancers, but how do I handle the problem of the LBs dying? This just seems to shunt the problem of tier A needing to know the IPs of all nodes in tier B, to all nodes in tier A needing to know the IPs of all LBs between tiers A and B.
For some applications, they can implement retry logic if they contact a node in the tier below that doesn't respond, but is there any way that some middleware could direct traffic to only live nodes in the following tier?
If I was hosting on AWS I could use an ELB between tiers, but I want to know how I could achieve the same functionality myself.
I've read (briefly) about heartbeat and keepalived - are these relevant here? What are the virtual IPs they talk about and how are they managed? Are there still single points of failure using them?
© Server Fault or respective owner