Tuesday, September 11, 2012

Ruminating on SQLServer HA and DR strategy

Recently for one of our customers, we were evaluating the options for configuring SQLServer 2008 for High Availability and Disaster Recovery.

Having successfully used Oracle RAC technology in the past many a times, I was suprised to realize that SQLServer does not have any equivalent to Oracle RAC. There is essentially no concept of load-balancing "read-write" requests between server instances working on a shared database storage system (e.g. SAN, RAID-10 array).

The only near equivalent to the RAC concept is using the SQLServer 2008 peer-to-peer transaction replication with updatable subscriptions. This is horribly complex to configure and maintain with data being replicated to peer servers. Also MS has finally decided not to support this feature anymore.

The clustering techniques of SQLServer 2008 use confusing words such as "active/active" and "active/passive". But in reality, there is no load-balancing of the database requests. The secondary server is typically in stand-by mode (as in active/passive). The 'active/active' concept essentially means that the SQL Servers are accessing two separate databases or a partitioned database. If one server fails, the processing is offloaded to the second machine. So now the second machine is running 2 server instances and this bogs down the resources of the server.

In SQLServer 2012, MS has tried to move closed to the RAC feature. We have a new feature of "AlwaysOn" that enables us to create HA groups and define replication strategies between the nodes in a HA group. The 'secondary' server would always be 'read-only' and can be used for reporting purposes or for 'read' operations. This is far better, as now we can atleast offload our 'read' operations to a different server, but we need to funnel all 'write' operations through the 'primary' server. A good discussion on this topic is available here.

After HA, we moved over to define our DR strategy. We first compared log-shipping vs mirroring. In Log Shipping, the secondary database is marked as 'read-only' and can be used for reporting, but the time taken for replication can be upto 30 mins. Hence it cannot be used if you desire instantaneous failover.
In Mirroring, there is almost instantaneous update to the secondary failover database, but since the database is always in recovery mode, it cannot be used for any purpose.

Again in Mirroring, we have 'synchronous' and 'asynchronous' replication. IMHO, synch replication is a disaster for any OLTP application and should actually never be used. Found a good case study on MS site that details out the strategy for "identity columns" when there is a potential for data-loss using async mirroring. Another option that can be considered for DR is SQL Replication.