Quite often I am presented with the challenge of deploying Windows COTS applications onto the AWS platform with a need to requirement for taking advantage of cloud native patterns like auto-scaling and auto-healing. In this blog post I’m going to describe how I’ve used Auto Scaling Groups, Load Balancers, Cloudwatch Alarms and Route 53 to provide a self healing implementation for a heritage COTS Windows application. This pattern was also extended to use lifecycle hooks to support a Blue/Green deployment with zero downtime.
This pattern works quite nicely for heritage applications which suggest an Active/Passive configuration, ie the Passive node does not have full application capability or is write only. When the primary node is unavailable during failure or blue/green upgrade, clients are transparently redirected to the passive node. I like to use the term “heritage” as it seems to have a softer ring than “legacy”. During actual failure outage is less than 2 minutes in order for automatic failover to complete.
The diagram below summarises a number of the key components used in the design. In essence we have two autoscaling groups each within minimum, maximum of 1 within two availability zones. We have a private Route 53 hosted zone (int.aws) to host custom CNAME records which typically point to load balancers. A cross zone load balancer, in the example below I’m using a classic load balancer as I’m not doing SSL offload, however it could just as easily be an application load balancer. Route 53 and custom Cloudwatch alarms have been utilised to reduce the time required to fail over between nodes and support separate configuration of primary and secondary nodes.
A number of other assumptions:
- Cloudwatch Alarm is set to detect where number of healthy nodes in AutoScaleGroup (ASG) ELB is less than 1. Current minimum polling interval is 60 seconds.
- Independent server components – can support different configurations, ie primary/secondary config
- Route 53 component (TTL 30 seconds) with a CNAME created with internal DNS (app.corp.com) to point to Route 53 CNAME (dns.master.int.aws). I use
- ASG health checks on TCP port 443 configured (5 seconds interval, Healthy and Unhealthy threshold of 2). No point in setting any more granular as dependent on Cloudwatch alarm interval.
- Single ASG deployed within each availability zone
- ELB still deployed over two availability zones.
- TCP port monitors configured without SSL offload
- No session stickiness configured as there is only a single server behind each ASG/ELB. In failover scenario clients will need to re-authenticate.
- Use pre-baked AMIs to support shortest possible healing times.
- Normal behaviour, client traffic is directed to Active node in AZ A.
2. Instance fails, and within 60 seconds Cloudwatch Alarm is triggered.
3. Route 53 health check is updated and Route 53 updates DNS record to the passive node. Clients now access to secondary/passive server. Clients may need to re-authenticate if application requires a stateful session.
4. Auto-healing rebuilds the failed server within AZ A.
5. Client now passes Route 53 health check and so Route 53 updates DNS record back to the primary node. Clients may need to re-authenticate if application requires a stateful session.
Secondary Node Failure
If secondary instance fails, there is no service disruption to service as traffic is never actively sent to this node, except during primary node failure.
Availability Zone Failure
These behave in a similar manner to instance failure and are dependent upon Cloudwatch alarm being sent.
Blue Green deployments
Blue Green deployments can be achieved using similar behaviour as experienced before.
On the left we see the existing release/build of the application stack, whilst on the right is the environment to be built. These are all within the same account and same availability zones, just different cloudformation stacks. There will be two stages described, a deploy stage where the new environment is being deployed and a release stage, where DNS is cutover. No additional build activities are conducted during the release stage.
1.Servers are built as independent components and then baked as AMIs.
2.Scales down server 2 component from previous build.
- Server 2 is scaled up as part of deploy stage. Team can now test and validate this release prior to release via ELB for second instance. I like to include custom host headers with the servername and specific build number in order to easily identify which server I am hitting, which can be identified through Chrome debugger or fiddler.
4.Route 53 DNS is automatically updated to point to server 2 ELB. No service outage
5.Terminates the previous primary instance of the build and the primary server is now built within the new stack.
- Server 1 bootstrap is initiated within the new cloudformation stack.
- Route 53 DNS is updated to the CNAME of the ELB in front of the primary node and normal service resumes in newly released environment.