What’s new for Exchange Server 2013 Database Availability Groups?
By: Brien M. Posey
When Microsoft created Exchange Server 2010, they introduced the Concept of Database Availability Groups. Database Availability Groups are the mechanism that makes it possible for a mailbox database to fail over from one mailbox server to another. In retrospect, Database Availability Groups worked really well for organizations whose operations were confined to a single data center. Although it was possible to stretch a Database Availability Group across multiple data centers, performing site level failovers was anything but simple. Microsoft has made a number of enhancements to Database Availability Groups in Exchange Server 2013. Some of these enhancements are geared toward making site level failovers less complex.
Although site resilience could be achieved using Exchange Server 2010, there were a number of different factors preventing organizations from achieving the level resilience that they might have liked. For starters, site level resilience was something that had to be planned ahead of time before Exchange Server 2010 was put into place. One of the reasons for this was that all of the Database Availability Group members had to belong to the same Active Directory domain. This meant that site resilience could only be achieved if the Active Directory domain spanned multiple data centers.
Another major limitation was that Microsoft designed Exchange Server 2010 so that a simple WAN failure would not trigger a site failover. One of the ways that they did this was to make it so that the failover process had to be initiated manually. Furthermore, the primary data center had to contain enough Database Availability Group members to allow the site to retain quorum in the event that the WAN link failed. Because of these limitations, there was really no such thing as true site resilience in Exchange Server 2010.
In Exchange Server 2013, it is finally possible to achieve full site resilience – with enough planning. As was the case with Exchange Server 2010, a DAG can only function if it is able to maintain quorum. Maintaining quorum means that at least half plus one of the DAG members are online and able to communicate with one another at any given time. This can be accomplished by placing an equal number of DAG members in each datacenter and then placing a witness server into a remote location that is accessible to each datacenter.
This approach will allow a datacenter level failover in the event of a major outage or a WAN failure. It is worth noting however, that this approach to site resiliency still does not achieve fully comprehensive protection for mailbox databases because situations could still occur that cause the DAG to lose quorum. Imagine for example, that a WAN link failure occurs between two datacenters. In that situation, whichever datacenter is still able to communicate with the witness server will retain quorum. Now, imagine that one of the DAG members in this datacenter were to fail before the WAN link is fixed. This failure would cause the datacenter to lose quorum, resulting in a DAG failure.
Another major change that Microsoft has made to DAGs has to do with the way that lagged copies work. Lagged copies are database replicas for which transaction log replay is delayed so as to facilitate point in time recovery.
In Exchange 2013, Microsoft has built some intelligence into lagged copies to detect and correct instances of corruption or low disk space. It is worth noting however, that in these types of circumstances you could end up losing the lag.
One of the big problems with lagged copies in Exchange 2010 was the fact that transaction had to be stored for the full lag period and could grow to a considerable size. As such, there have been instances in which organizations underestimated the volume of transaction logs that would be stored for lagged copies, resulting in the mailbox server running out of disk space.
Exchange 2013 monitors the available disk space. If the volume containing the transaction logs begins to run short on space then Exchange will initiate an automatic play down, which commits the contents of the transaction logs to lagged copy so that disk space can be freed on the transaction log volume.
Exchange uses a similar log file play down if it detects a corrupt database page. According to Microsoft however, “Lagged copies aren’t patchable with the ESE single page restore feature. If a lagged copy encounters database page corruption (for example, a -1018 error), it will have to be reseeded (which will lose the lagged aspect of the copy)”.
Another change that Microsoft has made to lagged copies is that it is now possible to activate a lagged copy and bring it to a current state, even if the transaction logs are not available. This is due to a new feature called the Safety Net. The Safety Net replaces the transport dumpster. Its job is to store copies of every message that has been successfully delivered to an active mailbox database. If a lagged database copy needs to be activated and the transaction logs are not available, Exchange can use the Safety Net’s contents to bring the database into a current state.
One of the most welcome changes that Microsoft has made to DAGs is that it is now possible to use DAGs to protect your public folders. In Exchange Server 2010, DAGs could only protect mailbox databases, not public folder databases. Public folder databases do not exist in Exchange 2010. Instead, public folders are stored in mailbox databases, which make it possible to use DAGs to protect public folders.
The most significant changes that Microsoft has made to DAGs include the ability to fail over at the datacenter level, the ability to use DAGs to provide high availability for public folders, and automated maintenance for lagged copies. In addition, Microsoft has also built in some minor improvements such as automatic database reseeding after a storage failure, and automated notification in situations in which only a single healthy copy of a DAG exists.