Monthly Archives: May 2012

Exchange 2010: The mystery of the 9223372036854775766 copy queue…

Database Copy Self-Protection Mode

What would you say if I told you that a copy queue length (CQL) of 9 quintillion log files was not a bad thing? By the way, if you are wondering, that works out to a CQL of 9,223,372,036,854,775,766, a number so big that it cannot be entered into Windows Calculator, and so big that if it represented actual transaction log files, it would require 8 yottabytes of storage, and sadly, as of this year, no storage or network system on the planet has reached even one thousandth of a yottabyte (which would be a zettabyte, by the way) of information.

But fear not, as you don’t need to start collecting all of the storage on the planet to plan for a CQL this large, but it is a real and valid value for a passive copy’s CQL, and in recent weeks, I’ve worked with a few customers who have experienced this.

Background

During regular operations, the Microsoft Exchange Information Store service (store.exe) and the Microsoft Exchange Replication service (msexchangerepl.exe) on DAG members hosting an active database copy write two values to the cluster registry at HKLMClusterExchangeActiveManagerLastLog:

DatabaseGUID with a decimal value representing the last log generated by the active copy
DatabaseGUID_TIME with the system time of when that log file was generated

Here is an example of these entries in the cluster registry:

To decipher these entries, you can use Get-MailboxDatabase to get the GUID for a mailbox database:

[PS] C:Windowssystem32>Get-MailboxDatabase dag-db0 | fl name,*guid*

Name : DAG-DB0
Guid : 2abcac37-1b5d-4b9c-8472-e33c65379698

These values are written to the cluster registry on the server hosting the active copy, and native cluster registry replication is used to propagate this information to all other DAG members. DAG members that host a passive copy of the database use this information (the last log file generated by the active copy) along with information about the last log file replicated to the passive copy, to determine the CQL for the database copy. Thus, it is critical that all DAG members have up-to-date values, as the CQL is used by Active Manager to evaluating whether or not to mount a database in response to a failover.

A Larger-than-Life CQL

In Exchange 2010 SP1, we changed how we determine the CQL. In addition to tracking the last log generated by the active copy, we also track the last time that log file was generated. This was done specifically to prevent situations in which a passive copy is not aware of log files generated by the active copy and makes automount decisions based on stale data.

We use the DATABASEGUID_Time entry for this purpose. If the difference between the timestamp recorded in the cluster registry and the system time on the server hosting the passive copy is off by more than 12 minutes, the Microsoft Exchange Replication service on the server hosting the passive copy places the database copy into a self-protection mode. This is done by setting the CQL for that passive copy to 9223372036854775766. Because a passive copy can be activated and automatically mounted only when its CQL is equal to or less than the configured value for AutoDatabaseMountDial, this has the effect of preventing the passive copy from ever mounting automatically. After all, a value of 9223372036854775766 will always be higher than any possible value for AutoDatabaseMountDial.

Where Did the Time Go?

So why would a condition exist that causes the time discrepancy to be greater than 12 minutes in the first place? This can actually happen for a few reasons:

The Cluster service on the server hosting the active copy might be having a problem writing updates even though the node remains in cluster membership.
The Cluster service on the server hosting the passive copy might be having a problem receiving updates even though they remain in cluster membership.
The Information Store service and Exchange Replication service could be stopped on the server hosting the active copy. (Remember that a copy that is “active” simply signifies the node that owns the copy not the actual mount / dismount state of the database).
A datacenter switchover is being performed and more than 12 minutes have elapsed between the time when the failed DAG members were stopped and when the remote DAG members were activated.

What to Do When CQL Reaches 9223372036854775766?

To recover from this situation, an administrator can simply perform a database switchover. Note, though, that the switchover will need to be performed using the Exchange Management Shell, as the administrator will need to force the move by using multiple switches and the MountDialOverride parameter. Because the following command skips all of the built-in safety checks for a passive copy, it should be used only when you know that the copies to be activated were healthy prior to the copy going into self-protection mode.

Attempted move without overrides:

Move-ActiveMailboxDatabase DAG-DB0 -ActivateOnServer DAG-2

Confirm
Are you sure you want to perform this action?
Moving mailbox database "DAG-DB0" from server "DAG-1.domain.com" to server "DAG-2.domain.com".
[Y] Yes [A] Yes to All [N] No [L] No to All [?] Help (default is "Y"): y

Identity        ActiveServerAtS ActiveServerAtE Status     NumberOfLogsLost   RecoveryPoint MountStatus MountStatus
                tart            nd                                            Objective     AtMoveStart AtMoveEnd
——–        ————— ————— ——     —————-   ————- ———– ———–
DAG-DB0         dag-1           dag-1           Failed                                      Dismounted Dismounted
An Active Manager operation failed. Error The database action failed. Error: An error occurred while trying to validate the specified database copy for possible activation. Error: Database copy ‘DAG-DB0’ on server ‘DAG-2.domain.com’ has a copy queue length of 9223372036854725486 logs, which is too high to enable automatic recovery. You can use the Move-ActiveMailboxDatabase cmdlet with the -SkipLagChecks and -MountDialOverride parameters to move the database with loss. If the database isn’t mounted after successfully running Move-ActiveMailboxDatabase, use the Mount-Database cmdlet to mount the database.. [Database: DAG-DB0, Server: DAG-2.domain.com]
    + CategoryInfo          : InvalidOperation: (DAG-DB0:ADObjectId) [Move-ActiveMailboxDatabase], AmDbActionWrapperException
    + FullyQualifiedErrorId : 3F936D4B,Microsoft.Exchange.Management.SystemConfigurationTasks.MoveActiveMailboxDatabase

Successful move with overrides:

Move-ActiveMailboxDatabase DAG-DB0 -ActivateOnServer DAG-2 -SkipHealthChecks -SkipActiveCopyChecks -SkipClientExperienceChecks -SkipLagChecks -MountDialOverride:BESTEFFORT

At this point, the database can be mounted on the remote server after moving the active copy.

Of course, the ultimate question is: why 9223372036854725486? The value for CQL must be a 64-bit integer that cannot be null; therefore, we chose a value close to maxInt64 that is so large that it prevents a potentially out-of-date copy from being activated.

Part 2: Datacenter Activation Coordination and the File Share Witness

10 Replies

In Part 1 of this series, I provided a high level overview of how Datacenter Activation Coordination (DAC) mode works and how the database mount process on startup is affected when DAC mode is enabled. (http://blogs.technet.com/b/timmcmic/archive/2012/05/21/my-databases-do-not-automatically-mount-after-i-enabled-datacenter-activation-coordination.aspx)

Remember, with DAC mode enabled, different rules apply for mounting databases on startup. The starting DAG member must be able to participate in a cluster that has quorum, and it must be able to communicate with another DAG member that has a DACP value of 1 or be able to communicate with all DAG members listed on the StartedMailboxServers list.

When a datacenter switchover has been performed and DAC mode is enabled, there could exist a condition where the standby datacenter contains a single DAG member (and therefore a single node cluster). This raises two interesting conditions:

A single node cluster always has quorum
There could be a single server on the started servers list

If the primary datacenter were restored to service without connectivity to the standby datacenter, this configuration could result in a split brain condition. To protect against this, we use an independent arbitrator to assist in determining the DACP bit: the boot time of the witness server.

When DAC mode is enabled, a DAG member will record two values in the registry at HKEY_LOCAL_MACHINESoftwareMicrosoftExchangeServerv14ReplayParameters:

BootTimeCookie = the boot time of the DAG member
BootTimeFSWCookie = the boot time of the witness server (which we obtain using WMI)

When DAC mode is enabled, there are different mount-on-startup rules that apply when only a single DAG member remains:

If the bootTimeCookie equals the boot time of the DAG member <and> the bootTimeFSWCookie does not equal the boot time of the witness server, then the DACP bit is set to 1.
If the bootTimeFSWCookie equals the boot time of the witness server <and> the bootTimeCookie does not equal the boot time of the DAG member, then the DACP bit is set to 1.
If the bootTimeFSWCookie equals the boot time of the witness server <and> the bootTimeCookie equals the boot time of the DAG member, then the DACP bit is set to 1.
If the bootTimeFSWCookie is not equal to the boot time of the witness server <and> the bootTimeCookie is not equal to the boot time of the DAG member, then the DACP bit remains at 0.

In the following examples, a two-member DAG was configured, and a datacenter switchover was performed resulting in a single-node cluster. The specific test, with tracing data, is provided for each example.

Example #1

In this example, the Microsoft Exchange Replication service on the single surviving node is restarted. Neither the DAG member itself nor the witness server was restarted. The bootTimeFSWCookie equals the boot time of the witness server <and> the bootTimeCookie equals the boot time of the DAG member, resulting in a DACP bit of 1.

438 00000000 5264 Cluster.Replay ActiveManager GetBootTimeWithWmi: WMI says that the boot time for dc.exchange.msft is 05/27/2012 16:28:12.
439 00000000 5264 Cluster.Replay ActiveManager DetermineAutomountConsensus: checking if the replay service has restarted since the MommyMayIMount bit was set.
455 00000000 5264 Cluster.Replay ActiveManager GetBootTimeWithWmi: WMI says that the boot time for MBX-2.exchange.msft is 05/27/2012 18:02:49.
456 00000000 5264 Cluster.Replay ActiveManager DetermineAutomountConsensus: WMI says the boot time is 05/27/2012 18:02:49, and the boot time when the Mount bit was set was 05/27/2012 18:02:49.
457 00000000 5264 Cluster.Replay ActiveManager DetermineAutomountConsensus found matching boot timestamps, assuming that the replay service has restarted since setting the bit.
458 00000000 5264 Cluster.Replay ActiveManager AllowAutoMount called: Found matching boot timestamps, assuming that the replay service has restarted since setting the bit.
460 00000000 5264 Cluster.Replay ActiveManager RefreshConfigInternal: The Automount consensus is true.

Example #2

In this example, the remaining DAG member was restarted and the witness server remained running. The bootTimeFSWCookie equals the boot time of the witness server <and> the bootTimeCookie does not equal the boot time of the DAG member, resulting in a DACP bit of 1.

85 00000000 2996 Cluster.Replay ActiveManager GetBootTimeWithWmi: WMI says that the boot time for dc.exchange.msft is 05/27/2012 16:28:12.
86 00000000 2996 Cluster.Replay ActiveManager DetermineAutomountConsensus: checking if the replay service has restarted since the MommyMayIMount bit was set.
87 00000000 2996 Cluster.Replay ActiveManager GetBootTimeWithWmi: WMI says that the boot time for MBX-2.exchange.msft is 05/27/2012 19:11:49.
88 00000000 2996 Cluster.Replay ActiveManager DetermineAutomountConsensus: WMI says the boot time is 05/27/2012 19:11:49, and the boot time when the Mount bit was set was 05/27/2012 18:58:31.
89 00000000 2996 Cluster.Replay ActiveManager DetermineAutomountConsensusUnanimity: There is only one node in the cluster — this is not sufficient to allow mounts!
90 00000000 2996 Cluster.Replay ActiveManager DetermineAutomountConsensusForSingleMachine: checking if the file share witness has restarted since the MommyMayIMount bit was set.
91 00000000 2996 Cluster.Replay ActiveManager DetermineAutomountConsensusForSingleMachine: WMI says the boot time for the FSW server is 05/27/2012 16:28:12, and the boot time when the Mount bit was set was 05/27/2012 16:28:12.
92 00000000 2996 Cluster.Replay ActiveManager DetermineAutomountConsensusForSingleMachine found matching boot timestamps, assuming that only this computer has restarted since setting the bit.
93 00000000 2996 Cluster.Replay ActiveManager AllowAutoMount called: Found matching FSW boot timestamps, assuming that only this computer has restarted since setting the bit.
94 00000000 2996 Cluster.Replay ActiveManager GetBootTimeWithWmi: WMI says that the boot time for MBX-2.exchange.msft is 05/27/2012 19:11:49.
95 00000000 2996 Cluster.Replay ActiveManager DetermineAutomountConsensusUnanimity is returning True.
96 00000000 2996 Cluster.Replay ActiveManager RefreshConfigInternal: The Automount consensus is true.

Example #3

In this example, the witness server was rebooted and the Microsoft Exchange Replication service on the DAG member was restarted. The bootTimeFSWCookie does not equal the boot time of the witness server <and> the bootTimeCookie does equal the boot time of the DAG member resulting, in a DACP bit of 1.

263 00000000 1552 Cluster.Replay ActiveManager GetBootTimeWithWmi: WMI says that the boot time for dc.exchange.msft is 05/27/2012 19:36:51.
264 00000000 1552 Cluster.Replay ActiveManager DetermineAutomountConsensus: checking if the replay service has restarted since the MommyMayIMount bit was set.
265 00000000 1552 Cluster.Replay ActiveManager GetBootTimeWithWmi: WMI says that the boot time for MBX-2.exchange.msft is 05/27/2012 19:27:30.
266 00000000 1552 Cluster.Replay ActiveManager DetermineAutomountConsensus: WMI says the boot time is 05/27/2012 19:27:30, and the boot time when the Mount bit was set was 05/27/2012 19:27:30.
267 00000000 1552 Cluster.Replay ActiveManager DetermineAutomountConsensus found matching boot timestamps, assuming that the replay service has restarted since setting the bit.
268 00000000 1552 Cluster.Replay ActiveManager AllowAutoMount called: Found matching boot timestamps, assuming that the replay service has restarted since setting the bit.
269 00000000 1552 Cluster.Replay ActiveManager GetBootTimeWithWmi: WMI says that the boot time for MBX-2.exchange.msft is 05/27/2012 19:27:30.
270 00000000 1552 Cluster.Replay ActiveManager RefreshConfigInternal: The Automount consensus is true.

Example #4

In this last example, both the witness server and the remaining single DAG member were restarted. Thus, the bootTimeFSWCookie does equal the boot time of the witness server <and> the bootTimeCookie does not equal the boot time of the remaining DAG member. As such, the DACP bit remains at 0.

76 00000000 3664 Cluster.Replay ActiveManager GetBootTimeWithWmi: WMI says that the boot time for dc.exchange.msft is 05/27/2012 19:47:49.
77 00000000 3664 Cluster.Replay ActiveManager DetermineAutomountConsensus: checking if the replay service has restarted since the MommyMayIMount bit was set.
78 00000000 3664 Cluster.Replay ActiveManager GetBootTimeWithWmi: WMI says that the boot time for MBX-2.exchange.msft is 05/27/2012 19:55:40.
79 00000000 3664 Cluster.Replay ActiveManager DetermineAutomountConsensus: WMI says the boot time is 05/27/2012 19:55:40, and the boot time when the Mount bit was set was 05/27/2012 19:27:30.
80 00000000 3664 Cluster.Replay ActiveManager DetermineAutomountConsensusUnanimity: There is only one node in the cluster — this is not sufficient to allow mounts!
81 00000000 3664 Cluster.Replay ActiveManager DetermineAutomountConsensusForSingleMachine: checking if the file share witness has restarted since the MommyMayIMount bit was set.
82 00000000 3664 Cluster.Replay ActiveManager DetermineAutomountConsensusForSingleMachine: WMI says the boot time for the FSW server is 05/27/2012 19:47:49, and the boot time when the Mount bit was set was 05/27/2012 19:36:51.
83 00000000 3664 Cluster.Replay ActiveManager DetermineAutomountConsensusUnanimity is returning False.
84 00000000 3664 Cluster.Replay ActiveManager Automount consensus not reached, going to Unknown AM role.

When performing a datacenter switchover where only a single node remains in the cluster supporting the DAG, any reboot that changes both the boot time of the witness server and the boot time of the DAG member will prevent databases from mounting automatically. If the reboots were necessary and valid operations, administrators can force the databases online without causing split brain.

========================================================

Datacenter Activation Coordination Series:

Part 1: My databases do not mount automatically after I enabled Datacenter Activation Coordination (https://aka.ms/F6k65e)
Part 2: Datacenter Activation Coordination and the File Share Witness (https://aka.ms/Wsesft)
Part 3: Datacenter Activation Coordination and the Single Node Cluster (https://aka.ms/N3ktdy)
Part 4: Datacenter Activation Coordination and the Prevention of Split Brain (https://aka.ms/C13ptq)
Part 5: Datacenter Activation Coordination: How do I Force Automount Concensus? (https://aka.ms/T5sgqa)
Part 6: Datacenter Activation Coordination: Who has a say? (https://aka.ms/W51h6n)
Part 7: Datacenter Activation Coordination: When to run start-databaseavailabilitygroup to bring members back into the DAG after a datacenter switchover. (https://aka.ms/Oieqqp)
Part 8: Datacenter Activation Coordination: Stop! In the Name of DAG… (https://aka.ms/Uzogbq)
Part 9: Datacenter Activation Coordination: An error cause a change in the current set of domain controllers (https://aka.ms/Qlt035)

========================================================

Exchange 2010: Using seedingPostponed results in a seeded database…

4 Replies

In Exchange 2010 the add-mailboxdatabasecopy command is utilized to add mailbox database copies to database availability group members. When a copy is first added to a member the database is automatically seeded along with the content index. In some instances administrators desire to add a copy but not immediately invoke database seeding. In order to do this the add-mailboxdatabasecopy command is run with the –seedingPostponed switch.

In some instances administrators have noticed that when adding a database copy with the seedingPostponed switch that the copy is healthy and the database has “seeded”. Let us take a look at how this can happen…

A new database is created on the Exchange 2010 server and mounted. This results in a new log stream and a new edb file. The administrator invokes the add-mailboxdatabasecopy command with the –seedingPostponed switch.

Add-MailboxDatabaseCopy –Identity DB1 -MailboxServer DAG-2 –SeedingPostponed

The command completes successfully. The copy is verified using the get-mailboxdatabasecopystatus command.

[PS] C:>Get-MailboxDatabaseCopyStatus DB1*

Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                              Length    Length                             State
—-                                          ——          ——— ———– ——————–   ————
DB1DAG-1                                     Mounted         0         0                                  Healthy
DB1DAG-2                                     Healthy         0         0           5/7/2012 8:56:09 AM    Healthy

The copy status shows healthy even though the database was not seeded. How did this occur? When a database is mounted for the first time the log sequence is created first – this is to allow us to actually log the creation of the EDB file. When looking at the log records of the first log file you can see a createDB record populated:

[PS] H:DB1 > eseutil /ml .E0400000001.log /v

Extensible Storage Engine Utilities for Microsoft(R) Exchange Server
Version 14.02
Copyright (C) Microsoft Corporation. All Rights Reserved.

Initiating FILE DUMP mode…

      Base name: E04
      Log file: .E0400000001.log
      lGeneration: 1 (0x1)
      Checkpoint: (0x3A,FFFF,FFFF)
      creation time: 05/07/2012 08:39:17
      prev gen time: 00/00/1900 00:00:00
      Format LGVersion: (7.3704.16.2)
      Engine LGVersion: (7.3704.16.2)
      Signature: Create time:05/07/2012 08:39:17 Rand:2179103003 Computer:
      Env SystemPath: h:DB1
      Env LogFilePath: h:DB1
      Env Log Sec size: 512 (matches)
      Env (CircLog,Session,Opentbl,VerPage,Cursors,LogBufs,LogFile,Buffers
          (    off,   1027, 51350, 16384, 51350,   2048,   2048, 29487
      Using Reserved Log File: false
      Circular Logging Flag (current file): off
      Circular Logging Flag (past files): off
      Checkpoint at log creation time: (0x1,8,0)

Last Lgpos: (0x1,A,0)

==================================
Op         # Records     Avg. Size
———————————-
Others         294             3
Begin            0             0
Commit           0             0
Rollback         0             0
Refresh          0             0
MacroOp          0             0
CreateDB         1           113
AttachDB         0             0
DetachDB         0             0
ShutDown         0             0
CreateFDP        0             0
Convert          0             0
Split            0             0
Merge            0             0
Insert           0             0
Replace          0             0
Delete           0             0
UndoInfo         0             0
Delta            0             0
SetExtHdr        0             0
Undo             0             0
EmptyTree        0             0
BeginDT          0             0
PreCommit        0             0
PreRollbk        0             0
FFlushLog        0             0
Convert          0             0
FRollLog         0             0
Split2           0             0
Merge2           0             0
Scrub            0             0
PageMove         0             0
PagePatch        0             0
McroInfo         0             0
ExtendDB         0             0
Ignored          0             0
Ignored          0             0
Ignored          0             0
Ignored          0             0
Ignored          0             0
Ignored          0             0
Ignored          0             0
==================================

Number of database page references: 0

Integrity check passed for log file: .E0400000001.log

Operation completed successfully in 0.343 seconds.

When reviewing the file system on the target node administrators note that logs and an EDB file do exist.

Adding a database copy with seedingPostponed does not result in a copy that is suspended. The replication service acknowledges the copy has been added, determines that the first log file exist and contains the createDB record, and subsequently begins copying log files. As log files are copied they are replayed on the target server, processing the createDB record, resulting in the edb file creation. The database is effectively seeded through shipping and replaying the log sequence.

What happens when the first log file does not exist – for example in situations where a backup has been performed? After adding the copy with seedingPostponed the administrator is presented with a warning verifying that database seeding is required:

WARNING: Replication is suspended for database copy ‘DB1’ because the database copy needs to be seeded.

When reviewing the copy status with get-mailboxdatabasecopystatus the added copy is now suspended:

Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                              Length    Length                             State
—-                                          ——          ——— ———– ——————–   ————
DB1DAG-1                                     Mounted         0         0                                  Healthy
DB1DAG-2                                     FailedAndSus… 63        0                                  Failed

At this time the database can be manually seeded using the update-mailboxdatabasecopycommand.

It is important to remember that seedingPostponed does not always results in a suspended copy. If using seedingPostpostponed with a new database that has the first log file the administrator must manually suspend the copy after adding it to ensure that no logs ship and no database is created on the target until manual seeding is performed.

Exchange 2010: Mapping DAG network states to cluster network states…

1 Reply

In Exchange 2010 after a Database Availability Group (DAG) is created a series of DAG networks is also created. These DAG networks serve to define how Exchange 2010 servers within the DAG will replicate and seed databases. The initial network topology that is established is based off the enumeration of cluster networks.

The cluster service defines two core states for a cluster network and one substate:

Allow cluster network communication on this network (Cluster Use: Internal)
- Allow clients to connect through this network (Cluster Use: Enabled)
Do not allow cluster network communications on this network (Cluster Use: Disabled)

The setting “Allow cluster network communications on this network” signifies this network can be utilized for heartbeat and cluster configuration traffic. When combined with the substate “Allow clients to connect through this network” this signifies that virtual IP addresses created within the cluster are allowed on this network. The setting “Do not allow cluster network communications on this network” signifies that the network is disabled for cluster use.

Exchange 2010 defines three network states associated with a Database Availability Group Network:

MAPIAccessEnabled
ReplicationEnabled
IgnoreNetwork

RunspaceId         : 5fdd66b9-afd7-47b7-b985-369dc8da7ac4
Name               : DAG-MAPI
Description        :
Subnets            : {{10.0.0.0/24,Up}}
Interfaces         : {{DAG-1,Up,10.0.0.20}, {DAG-2,Up,10.0.0.21}}
MapiAccessEnabled : True
ReplicationEnabled : False
IgnoreNetwork      : False
Identity           : DAGDAG-MAPI
IsValid            : True

The network states ReplicationEnabled and IgnoreNetwork can be adjusted by the administrator using the set-DatabaseAvailabilityGroup commandlet. By default any DAG network enumerated is ReplicationEnabled:TRUE and IgnoreNetwork:FALSE.

The network state MapiAccessEnabled is actually determined by the settings of the network adapter on the local server. A network is MapiAccessEnabled when we detect that the “Resgister this connection’s address in DNS” checkbox is enabled on the advanced TCP properties of the network adapter.

Why are DAG network states important? The replication services uses the defined network states to maintain the different cluster network states. Many customers have found that when attempting to modify a cluster network state using Failover Cluster Manager on a server that is a member of a DAG results in the change being reverted. How are the cluster network states mapped to DAG network states? Let us take a look:

MapiAccessEnabled	ReplicationEnabled	IgnoreNetwork	Cluster Use
TRUE	TRUE	FALSE	Enabled
TRUE	FALSE	FALSE	Enabled
FALSE	TRUE	FALSE	Internal
FALSE	FALSE	TRUE	Disabled

Any network that is IgnoreNetwork:TRUE results in a state of disabled regardless of the other settings.

Correctly managing DAG network states ensures that appropriate cluster network states are maintained and that the DAG / Cluster service can function as expected.

Part 1: My databases do not automatically mount after I enabled Datacenter Activation Coordination

13 Replies

Datacenter Activation Coordination (DAC) mode is a property of a Database Availability Group (DAG) that, when enabled, forces starting DAG members to acquire permission in order to mount databases. Administrators can enable DAC mode at any time after the DAG has been created. DAC was designed specifically to handle the following scenario:

You have a DAG extended to two datacenters.
You lose the power to your primary datacenter, which also takes out WAN connectivity between your primary and secondary datacenters.
Because primary datacenter power will be down for a while, you decide to activate your secondary datacenter and you perform a datacenter switchover.
Eventually, power is restored to your primary datacenter, but WAN connectivity between the two datacenters is not yet functional.
The DAG members starting up in the primary datacenter cannot communicate with any of the running DAG members in the secondary datacenter.

In this scenario, the starting DAG members in the primary datacenter have no idea that a datacenter switchover has occurred. They still believe they are responsible for hosting active copies of databases, and without DAC mode, if they have a sufficient number of votes to establish quorum, they would try to mount their active databases. This would result in a bad condition called split brain, which would occur at the database level. In this condition, multiple DAG members that cannot communicate with each other both host an active copy of the same mailbox database. This would be a very unfortunate condition that increases the chances of data loss, and make data recovery challenging and lengthy (albeit possible, but definitely not a situation we would want any customer to be in).

Once DAC mode is enabled, the integrated datacenter switchover tasks (Stop, Restore and Start-DatabaseAvailabilityGroup) are also enabled.

DAC mode works by using a bit stored in memory by Active Manager called the Datacenter Activation Coordination Protocol (DACP). DACP is simply a bit in memory set to either a 1 or a 0. A value of 1 means Active Manager can issue mount requests, and a value of 0 means it cannot.

The starting bit is always 0, and because the bit is held in memory, any time the Microsoft Exchange Replication service (MSExchangeRepl.exe) is stopped and restarted, the bit reverts to 0. In order to change its DACP bit to 1 and be able to mount databases, a starting DAG member needs to either:

Be able to communicate with any other DAG member that has a DACP bit set to 1; or
Be able to communicate with all DAG members that are listed on the StartedMailboxServers list.

If either condition is true, Active Manager on a starting DAG member will issue mount requests for the active databases copies it hosts. If neither condition is true, Active Manager will not issue any mount requests.

In order for the DACP bit to be set to 1 (mount database allowed) the starting DAG member must also be a member of the DAG’s cluster, and the cluster must have quorum.

For a variety of reasons, an administrator may need to shut down all members of a DAG. When starting up a DAG in DAC mode after a complete shutdown, databases may not mount automatically as they would if DAC mode were not enabled. This behavior may sound confusing but it is actuality by design. Let me explain why.

First, let’s view the configuration of a DAG using Get-DatabaseAvailabilityGroup (relevant attributes for this post highlighted in red):

[PS] C:>Get-DatabaseAvailabilityGroup -Identity DAG -status | fl

RunspaceId                             : c0bbcd75-40c8-41cb-8622-3550cd7e0e5e
Name                            : DAG
Servers                         : {DAG-4, DAG-3, DAG-2, DAG-1}
WitnessServer                   : mbx-1.domain.com
WitnessDirectory                : c:DAG-FSW
AlternateWitnessServer                 : mbx-2.domain.com
AlternateWitnessDirectory              : c:DAG-FSW
NetworkCompression                     : Enabled
NetworkEncryption                      : Enabled
DatacenterActivationMode        : DagOnly
StoppedMailboxServers           : {}
StartedMailboxServers           : {DAG-4.domain.com, DAG-2.domain.com, DAG-1.domain.com, DAG-3.domain.com}
DatabaseAvailabilityGroupIpv4Addresses : {10.0.0.24}
DatabaseAvailabilityGroupIpAddresses   : {10.0.0.24}
AllowCrossSiteRpcClientAccess          : False
OperationalServers              : {DAG-1, DAG-2, DAG-4, DAG-3}
PrimaryActiveManager            : DAG-1
ServersInMaintenance                   : {}
ThirdPartyReplication                  : Disabled
ReplicationPort                        : 64327
NetworkNames                           : {DAG-4-iSCSI, DAG-MAPI, DAG-REPL-A, DAG-REPL-B}
WitnessShareInUse               : Primary
AdminDisplayName                       :
ExchangeVersion                        : 0.10 (14.0.100.0)
DistinguishedName                      : CN=DAG,CN=Database Availability Groups,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=domain Home,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=home,DC=domain,DC=com
Identity                               : DAG
Guid                                   : 72c87136-6721-46e6-ac43-2ad5f6bd66d2
ObjectCategory                         : domain.com/Configuration/Schema/ms-Exch-MDB-Availability-Group
ObjectClass                            : {top, msExchMDBAvailabilityGroup}
WhenChanged                            : 1/29/2012 4:26:42 PM
WhenCreated                            : 9/19/2009 6:16:52 PM
WhenChangedUTC                         : 1/29/2012 9:26:42 PM
WhenCreatedUTC                         : 9/19/2009 10:16:52 PM
OrganizationId                         :
OriginatingServer                      : DC-5.domain.com
IsValid                                : True

The DAG has 4 members (DAG-1, DAG-2, DAG-3, and DAG-4) and MBX-1 is the witness server for the DAG.

During normal operations, all databases are mounted and available:

[PS] C:>Get-MailboxDatabaseCopyStatus *

Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                              Length    Length                             State
—-                                          ——          ——— ———– ——————–   ————
DAG-1-DB0DAG-1                               Mounted         0         0                                  Healthy
DAG-DB0DAG-1                                 Mounted         0         0                                  Healthy
DAG-DB1DAG-1                                 Mounted         0         0                                  Healthy
DAG-2-DB0DAG-2                               Mounted         0         0                                  Healthy
DAG-DB1DAG-2                                 Healthy         0         0           1/29/2012 4:28:01 PM   Healthy
DAG-DB0DAG-2                                 Healthy         0         0           1/29/2012 4:28:04 PM   Healthy
DAG-DB0DAG-3                                 Healthy         0         617         1/29/2012 4:28:04 PM   Healthy
DAG-DB1DAG-3                                 Healthy         0         373         1/29/2012 4:28:01 PM   Healthy
DAG-DB0DAG-4                                 Healthy         0         2268        1/29/2012 4:28:04 PM   Healthy
DAG-DB1DAG-4                                 Healthy         0         1435        1/29/2012 4:28:01 PM   Healthy

To illustrate the scenario I will shut down all DAG members without manually dismounting or moving any databases. I will leave the witness server online and accessible.

[PS] C:>Get-MailboxDatabaseCopyStatus * | fl name,status

Name : DAG-1-DB0DAG-1
Status : ServiceDown

Name : DAG-DB0DAG-1
Status : ServiceDown

Name : DAG-DB1DAG-1
Status : ServiceDown

Name : DAG-2-DB0DAG-2
Status : ServiceDown

Name : DAG-DB1DAG-2
Status : ServiceDown

Name : DAG-DB0DAG-2
Status : ServiceDown

Name : DAG-DB0DAG-3
Status : ServiceDown

Name : DAG-DB1DAG-3
Status : ServiceDown

Name : DAG-DB0DAG-4
Status : ServiceDown

Name : DAG-DB1DAG-4
Status : ServiceDown

I’ll start by powering on DAG-1. Since DAG-1 and the witness server do not have a sufficient number of votes to achieve quorum (3 votes are necessary for quorum); therefore DAG-1 won’t be able to mount any databases.

Attempts to get the status of the DAG members using get-databaseavailabilitygroup –status fails with an error due to the cluster service not being initialized on the node.

[PS] C:>Get-DatabaseAvailabilityGroup -Identity DAG -Status | fl name,servers,witnessserver,witnessdirectory,datacenteractivationmode,stoppedmailboxservers,startedmailboxservers,operationalservers,primaryactivemanager,witnessshareinuse
A server-side administrative operation has failed. ‘GetDagNetworkConfig’ failed on the server. Error: The NetworkManager has not yet been initialized. Check the event logs to determine the cause. [Server: DAG-1.domain.com]
+ CategoryInfo : NotSpecified: (0:Int32) [Get-DatabaseAvailabilityGroup], DagNetworkRpcServerException
+ FullyQualifiedErrorId : C3C89A48,Microsoft.Exchange.Management.SystemConfigurationTasks.GetDatabaseAvailabilityGroup

Get-mailboxdatabasecopystatus * also reports all databases on DAG-1 as dismounted. All other nodes report service down.

[PS] C:>Get-MailboxDatabaseCopyStatus * | fl name,status

Name : DAG-1-DB0DAG-1
Status : Dismounted

Name : DAG-DB0DAG-1
Status : Dismounted

Name : DAG-DB1DAG-1
Status : Dismounted

Name : DAG-2-DB0DAG-2
Status : ServiceDown

Name : DAG-DB1DAG-2
Status : ServiceDown

Name : DAG-DB0DAG-2
Status : ServiceDown

Name : DAG-DB0DAG-3
Status : ServiceDown

Name : DAG-DB1DAG-3
Status : ServiceDown

Name : DAG-DB0DAG-4
Status : ServiceDown

Name : DAG-DB1DAG-4
Status : ServiceDown

Next, I’ll boot DAG-2. The addition of a second DAG member server allows quorum to be achieved. However, Active Manager on DAG-2 is unable to contact another DAG member that has a DACP bit of 1, and it can’t contact all of the DAG members on the StartedMailboxServers. If DAC mode was not enabled for this DAG, databases would have automatically mounted. But because DAC mode is enabled, the databases do not automatically mount.

Using the failover cluster PowerShell integration (Windows 2008 R2) we can see that two nodes of the cluster show up (indicating quorum was successfully achieved and the nodes successfully formed a cluster).

[PS] C:>Get-Cluster DAG | Get-ClusterNode | fl name,state

Name : dag-1
State : Up

Name : dag-2
State : Up

Name : dag-3
State : Down

Name : dag-4
State : Down

Using get-databaseavailabilitygroup –status will return the same error as previously recorded.

Using get-mailboxdatabasecopystatus * we can confirm that databases remain dismounted on server DAG-1 and copies on server DAG-2 failed.

[PS] C:>Get-MailboxDatabaseCopyStatus * | fl name,status

Name : DAG-1-DB0DAG-1
Status : Dismounted

Name : DAG-DB0DAG-1
Status : Dismounted

Name : DAG-DB1DAG-1
Status : Dismounted

Name : DAG-2-DB0DAG-2
Status : Dismounted

Name : DAG-DB1DAG-2
Status : Failed

Name : DAG-DB0DAG-2
Status : Failed

Name : DAG-DB0DAG-3
Status : ServiceDown

Name : DAG-DB1DAG-3
Status : ServiceDown

Name : DAG-DB0DAG-4
Status : ServiceDown

Name : DAG-DB1DAG-4
Status : ServiceDown

If the administrator attempts to mount a database an error will be displayed that the nodes either do not have quorum or automount consensus has not been reached.

[PS] C:>Mount-Database DAG-DB1
Couldn’t mount the database that you specified. Specified database: DAG-DB1; Error code: An Active Manager operation failed. Error An Active Manager operation encountered an error. To perform this operation, the server must be a member ofa database availability group, and the database availability group must have quorum. Error: Automount consensus not reached.. [Server: DAG-1.home.domain.com].
+ CategoryInfo : InvalidOperation: (DAG-DB1:ADObjectId) [Mount-Database], InvalidOperationException
+ FullyQualifiedErrorId : FE8BAD3C,Microsoft.Exchange.Management.SystemConfigurationTasks.MountDatabase

Next, I’ll boot DAG-3. As with DAG-2, although quorum is achieved, databases will not be automatically mounted. DAG-3 is unable to contact another server with a DACP bit of 1 or all of the servers on the StartedMailboxServers list.

[PS] C:>Get-Cluster DAG | Get-ClusterNode | fl name,state

Name : dag-1
State : Up

Name : dag-2
State : Up

Name : dag-3
State : Up

Name : dag-4
State : Down

Using get-databaseavailabilitygroup –status will return the same error as previously recorded.

Using get-mailboxdatabasecopystatus * we can confirm that databases remain dismounted on server DAG-1 and copies on server DAG-2 failed.

[PS] C:>Get-MailboxDatabaseCopyStatus * | fl name,status

Name : DAG-1-DB0DAG-1
Status : Dismounted

Name : DAG-DB0DAG-1
Status : Dismounted

Name : DAG-DB1DAG-1
Status : Dismounted

Name : DAG-2-DB0DAG-2
Status : Dismounted

Name : DAG-DB1DAG-2
Status : Failed

Name : DAG-DB0DAG-2
Status : Failed

Name : DAG-DB0DAG-3
Status : Failed

Name : DAG-DB1DAG-3
Status : Failed

Name : DAG-DB0DAG-4
Status : ServiceDown

Name : DAG-DB1DAG-4
Status : ServiceDown

If the administrator attempts to mount a database an error will be displayed that the nodes either do not have quorum or automount consensus has not been reached.

Finally, I’ll boot DAG-4

All nodes are a member of a cluster that has quorum.

At this point, all nodes are a member of a cluster that has quorum, and DAG-4 can contact all servers on the StartedMailboxServers list. Therefore, the DACP bit on DAG-4 is set to 1.

DAG-1, DAG-2, and DAG-3 can now contact a server with a DACP bit set to 1, and therefore they set their DACP bit set to 1.

[PS] C:>Get-Cluster DAG | Get-ClusterNode | fl name,state

Name : dag-1
State : Up

Name : dag-2
State : Up

Name : dag-3
State : Up

Name : dag-4
State : Up

Using get-databaseavailabilitygroup –status we can see that the DAG has successfully initialized, all nodes are operational, and a primary active manager has been initialized.

Name                     : DAG
Servers                  : {DAG-4, DAG-3, DAG-2, DAG-1}
WitnessServer            : mbx-1.domain.com
WitnessDirectory         : c:DAG-FSW
DatacenterActivationMode : DagOnly
StoppedMailboxServers    : {}
StartedMailboxServers    : {DAG-3.domain.com, DAG-4.domain.com, DAG-2.domain.com, DAG-1.home.domain.com}
OperationalServers       : {DAG-1, DAG-2, DAG-4, DAG-3}
PrimaryActiveManager     : DAG-1
WitnessShareInUse        : Primary

Using get-mailboxdatabasecopystatus * we can observe that databases have now automatically mounted and copies are healthy.

[PS] C:>Get-MailboxDatabaseCopyStatus * | fl name,status

Name : DAG-1-DB0DAG-1
Status : Mounted

Name : DAG-DB0DAG-1
Status : Mounted

Name : DAG-DB1DAG-1
Status : Mounted

Name : DAG-2-DB0DAG-2
Status : Mounted

Name : DAG-DB1DAG-2
Status : Healthy

Name : DAG-DB0DAG-2
Status : Healthy

Name : DAG-DB0DAG-3
Status : Healthy

Name : DAG-DB1DAG-3
Status : Healthy

Name : DAG-DB0DAG-4
Status : Healthy

Name : DAG-DB1DAG-4
Status : Healthy

As I’ve described above, when a DAG in DAC mode is started after a complete shutdown, databases will not be mountable until all DAG members are up, running, and in communication with each other.

*Special thanks to Scott Schnoll for reviewing and editing content.

========================================================

Datacenter Activation Coordination Series:

========================================================

Verifying log file truncation…

9 Replies

In support we see cases from customers regarding log file truncation. One question that commonly comes up is how do you verify log file truncation actually occurs?

The most common method that administrators use when verifying log file truncation post backup is to review the log file directory and note the count of files within this directory. In general, on servers under load that generate a large number of log files in-between backups, this is a safe method to verify log truncation. If there were many logs in the log directory prior to backup and only a few remain post backup, log file truncation would be considered successful.

In a lab environment or servers that are not under load the directory method of verification may not actually work. Take the following example:

I review the log directory prior to backup and the directory contains 10 logs.
I run a successful full backup.
I review the log directory after backup and the directory now contains 10 logs.

Visually it would appear that no log files truncated at all. In actuality 1 or 2 log files truncated at the same time you generated 1 or 2 more. Therefore there is no change in file count for the directory.

How can I verify that log truncation was successful?

In general I recommend avoiding the directory file count to verify log truncation. A reliable way to verify log file truncation is to record the log sequence both pre and post backup. This can be done using the eseutil /ml command. To dump an entire log sequence the administrator would run eseutil /ml ENN (ENN = log generation prefix). This will dump all log files found in the directory and their order – the output can be piped to a text file for later review. Let us take a look at an example.

Using a command prompt navigate to the log file directory.
Run eseutil /ml ENN. This will provide a list of the log files found in the directory. (Note that the error at the end of the command is expected since the current log file is locked for access).

[PS] P:DAGDAG-DB0DAG-DB0-Logs>eseutil /ml E02

Initiating FILE DUMP mode…

Verifying log files…
Base name: E02

      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A66C.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A66D.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A66E.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A66F.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A670.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A671.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A672.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A673.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A674.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A675.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A676.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A677.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A678.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A679.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A67A.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A67B.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A67C.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A67D.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A67E.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE02.log
                ERROR: Cannot open log file (P:DAGDAG-DB0DAG-DB0-LogsE02.log). Error -1032.

Operation terminated with error -1032 (JET_errFileAccessDenied, Cannot access file, the file is locked or in use) after 1.981 seconds.

Send mail with large attachments to force log file generation – this step is necessary when testing in a lab environment. (See http://blogs.technet.com/b/timmcmic/archive/2012/03/12/exchange-2010-log-truncation-and-checkpoint-at-log-creation-in-a-database-availability-group.aspx for an explanation of why this step is necessary)
Perform a full or incremental backup
Run eseutil /ml ENN. This will provide a list of the log files found in the directory. (Note that the error at the end of the command is expected since the current log file is locked for access).

[PS] P:DAGDAG-DB0DAG-DB0-Logs>eseutil /ml E02

Initiating FILE DUMP mode…

Verifying log files…
Base name: E02

      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A679.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A67A.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A67B.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A67C.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A67D.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A67E.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A67F.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A680.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A681.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A682.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A683.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A684.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A685.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A686.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A687.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A688.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A689.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A68A.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A68B.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A68C.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A68D.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A68E.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A68F.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A690.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A691.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A692.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A693.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A694.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A695.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A696.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A697.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A698.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A699.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A69A.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A69B.log – OK

Compare the two files to note the first log file referenced in each file.
- Pre-backup: Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A66C.log – OK
- Post-backup: Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A679.log – OK
Since there is a difference in the first log noted in each output then log truncation can be considered successful.

By using this method you can not only reliably verify log files both pre and post backup but also record diagnostic information that may be helpful when engaging support for potential issues regarding log file truncation.

Backup dates and times are incorrect even though data has been committed to backup media successfully….

TIMMCMIC

Navigating the world of high availability….and occassionally sticking my head in the cloud…

Monthly Archives: May 2012

Exchange 2010: The mystery of the 9223372036854775766 copy queue…

Database Copy Self-Protection Mode

Background

A Larger-than-Life CQL

Where Did the Time Go?

What to Do When CQL Reaches 9223372036854775766?

Part 2: Datacenter Activation Coordination and the File Share Witness

Example #1

Example #2

Example #3

Example #4

Exchange 2010: Using seedingPostponed results in a seeded database…

Exchange 2010: Mapping DAG network states to cluster network states…

Part 1: My databases do not automatically mount after I enabled Datacenter Activation Coordination

Verifying log file truncation…

Backup dates and times are incorrect even though data has been committed to backup media successfully….