The Pros and Cons of Using Database Availability Groups

The Pros and Cons of Using Database Availability Groups

Guest Post By: Brien M. Posey

Database Availability Groups (DAGs) are Microsoft’s go to solution for providing high availability for Exchange 2010 (and Exchange 2013) mailbox servers. Even so, it is critically important for administrators to consider whether or not a DAG is the most appropriate high availability solution for their organization.

The primary advantage offered by DAGs is that of high availability for mailbox servers within an Exchange Server organization. DAGs make use of failover clustering. As such, the failure of a DAG member results in any active mailbox databases failing over to another DAG member.

At first this behavior likely seems ideal, but depending on an organization’s needs DAGs can leave a lot to be desired. One of the first considerations that administrators must take into account is the fact that DAGs only provide high availability for mailbox databases. This means that administrators must find other ways to protect the other Exchange Server roles and any existing public folder databases. Incidentally, Exchange Server 2013 adds high availability for public folders through DAGs, but DAGs cannot be used to protect any additional Exchange Server components.

In spite of the limitations that were just mentioned, DAGs have historically proven to be an acceptable high availability solution for medium sized organizations. While it is true that DAGs fail to protect the individual server roles, Exchange stores all of its configuration information in the Active Directory, which means that entire servers can be rebuilt by following these steps:

  1. Reset the Active Directory account for the failed server (reset the account, do not delete it).
  2. Install Windows onto a new server and giving it the same name as the failed server.
  3. Install any Windows patches or service packs onto the new server that were running on the failed server.
  4. Join the server to the Active Directory domain.
  5. Create an Exchange Server installation DVD that contains the same service pack level that was used on the failed server.
  6. Insert the Exchange installation media that you just created and run Setup /m:RecoverServer

The method outlined above can be used to recreate a failed Exchange Server. The only thing that is not recreated using this method are databases, but databases are protected by DAGs. As such, these two mechanisms provide relatively comprehensive protection against a disaster. Even so, the level of protection afforded by these mechanisms often proves to be inadequate for larger organizations.

One of the reasons for this has to do with the difficulty of rolling a database back to an earlier point in time. Microsoft allows DAG members to be configured as lagged copies. This means that transaction logs are not committed to the lagged copy as quickly as they would otherwise be. This lag gives administrators the ability to activate an older version of the database if necessary. The problem is that activating a lagged copy is not an intuitive process. Furthermore, activating a lagged copy always results in data loss.

The other reason why DAGs are not always an adequate solution for larger organizations has to do with the difficulty of providing off-site protection. Exchange Server 2010 supports the creation of stretched DAGs, which are DAGs that span multiple datacenters. Although being able to fail over to an off-site datacenter sounds like a true enterprise class feature, the reality of the situation is that architectural limitations often prevent organizations from being able to achieve such functionality

The most common barriers to implementing a stretched DAG are network latency and Active Directory design. Stretched DAGs are only supported on networks with a maximum round trip latency of 500 milliseconds. Additionally, DAGs cannot span multiple Active Directory domains, which means that the domain in which the DAG members reside must span datacenters.

Even if an organization is able to meet the criteria outlined above, they must construct the DAG in a way that will ensure continued functionality both in times of disaster and during minor outages. In order for a DAG to function, it must maintain quorum. This means that at least half plus one of the total number of existing DAG members must be functional in order for the DAG to remain online. This requirement is relatively easy to meet in a single datacenter deployment, but is quite challenging in stretched DAG environments.

One of the issues that must be considered when building a stretched DAG is that Exchange cannot tell the difference between a WAN failure and the failure of the Exchange servers on the other side of the WAN link. As such, the primary site must have enough DAG members to maintain quorum even in the event of a WAN failure. Ideally, the primary site should have enough DAG members to retain quorum during a WAN failure and still be able to absorb the failure of at least one member in the primary site.

Another problem with stretched DAGs is that the requirement for the primary site to have enough DAG members to always maintain quorum means that if the DAG will never failover to the remote site, even if the entire primary datacenter is destroyed. The secondary site lacks enough DAG members to achieve quorum without an administrator manually evicting nodes from the DAG.

As you can see, DAGs tend to deliver an acceptable level of functionality in single datacenter environments, but the limitations that are inherent in stretched DAGs make them impractical for use in multi-datacenter deployments. Larger organizations are typically better off implementing other types of redundancy rather than depending on DAGs. One possible solution for example is to virtualize an organization’s Exchange servers and then replicate the virtual machines to a standby datacenter. This approach will usually make the process of failing over to an alternate datacenter much simpler and more efficient.

6 thoughts on “The Pros and Cons of Using Database Availability Groups

  1. Pingback: NeWay Technologies – Weekly Newsletter #25 – January 10, 2013NeWay | NeWay

  2. EAdmin

    Do you really see that virtual is the better solution instead of the DAG? You need to married the Exchange monitoring and virtual system. You do not expect the data loss could is higher when you do standby copy to passive server?
    Also why the other roles needs to be protected by the DAG when HLB is meant to do that?
    Is it a real problem to span domain to your disaster site with DAG?

    It might be good to clarify differences between fail over and disaster. When fail over is done in one computer room and it is automatic process. The disaster is more likely manual operation.

    Reply
  3. Mccopias

    An outstanding share! I have just forwarded this onto a colleague who had been conducting a little research on this. And he actually bought me breakfast simply because I found it for him… lol. So allow me to reword this…. Thank YOU for the meal!! But yeah, thanx for spending the time to discuss this issue here on your blog.

    Reply
  4. Isabelle Riley

    Hi Brien,

    Nice Blog!

    My name is Isabelle and I am currently working with http://www.accessguru.com.au
    I just came across your blog and found it helpful with what I was researching.
    We were wondering if we could publish a similar blog on your site?

    Here are some blog ideas that I am currently working on:
    – Benefits of Microsoft Access Development?
    – What you should know before hiring professional access developers
    – Or another topic of your own.

    Looking forward hearing from you.

    Thanks for your time,

    Isabelle

    Reply
  5. Brandon

    What about datacenter activation control features for a multi site DAG architecture for split brain scenarios? This is missing as it’s meant to resolve issues with network failures.

    Reply
  6. Stephen Butler

    Misleading in a dozen different directions.

    Yes, DAGs don’t protect hubs or CAS servers. If you want your hub and CAS servers to be redundant, all you have to do is have more than one.

    If you don’t have enough servers for quorum in your backup datacenter, then in the event of a total site collapse in the primary datacenter, you will in fact have to spend a whopping 5 minutes removing a couple primary datacenter servers from the DAG before the backup site will come online. My SLAs can handle that just fine for total site failover. You also didn’t mention anything about the file share witness. If you have even numbers of servers in primary and secondary, and you lose the primary, all you have to do is redirect the cluster to use an FSW in the secondary datacenter (taking maybe 2 minutes to accomplish), and voila it starts.

    Virtualization is a good idea for servers which are comparatively lightly loaded or if space in the datacenter is an issue. From a pure technical standpoint, you are better off architecting your solution to fully leverage your steel, leaving room for overhead and running additional active database copies in the event of a failure. In the real world, virtualized Exchange servers hold fewer mailboxes. They break more often because of the added complexity. They take longer to fix when they do break especially if Microsoft, EMC, and VMWare are all pointing the finger at each other. Virtual teams are powerfully fond of using DRS rules to live migrate Exchange servers in the middle of the day which causes issues that the messaging team cannot identify because no one is notified the server is being migrated. That says nothing of the fact that Microsoft’s public statement regarding the supportability of live migrations is, at best, convoluted to understand and almost no one interprets it correctly.

    For example … number one use of a Live Migration rule is to remove active loads from an overloaded host … which is to say you are initiating a Live Migration not for administrative purposes but rather for performance reasons … which is unsupported. Virtualizing fully loaded Exchange servers doesn’t gain you anything but space in the NOC and headaches.

    As for the effectiveness of DAG high availability and fault tolerance in stretch DAG scenarios, it works just fine. Just ask any of the Fortune 500 companies who run this way including the one I’m sitting in. It works so reliably that most companies are reducing their backup frequency to once a week or less, and I cannot recall the last time I had any valid reason to activate a lagged copy.

    But … if I do need to activate my lagged copy and I don’t find the procedure to be intuitive, I have this handy thing called Google to lookup the article with the procedure explained.

    Punch the following string in to your query and voila … it’s the top link. Not challenging.
    “site:microsoft.com exchange how to activate a lagged copy”

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s