Tuesday, April 02, 2013

Ruminating on Availability and Reliability

High availability is a function of both hardware + software combined. In order to design a highly available infrastructure, we have to ensure that all the components are made highly available and not just the database or app servers. This includes the network switches, SSO servers, power supply, etc.

The availability of each component is calculated and then we typically multiply the availabilities of all components together to get the overall availability, usually expressed as a percentage.

Common patterns for high availability are: Clustering & load-balancing, data replication (near real time), warm standby servers, effective DR strategy, etc. From an application architecture perspective availability would depend on effective caching, memory management, hardened security mechanisms, etc.

Application downtime occurs not just because of hardware failures, but could be due to lack of adequate testing (including unit testing, integration testing, performance testing, etc.) It's also very important to have proper monitoring mechanisms in place to proactively detect failures, performance issues, etc.

So how is availability typically measured? It is expressed as a percentage; for e.g. 99.9% availability.
To calculate the availability of a component, we need to understand the following 2 concepts:

Mean Time Between Failure (MTBF): It is defined as the average length of time the application runs before failing. Formula: Total Hours Ran / No. of failures (count)

Mean Time To Recovery (MTTR): It is defined as the average length of time needed to repair and restore service after a failure. Formula: Hours spend on repair / Failure Count

Formula: Availability = (MTBF / (MTBF + MTTR)) X 100

Using the above formula, we get the following percentages:

3 nines (99.9% availability) represents about ~ 9 hours of service outage in a single year. 
4 nines (99.99% availability) come to ~ 1 hour of outage in a year. 
5 nines (99.999% availability) represents only about 5 minutes of outage per year.