Author Archives: TIMMCMIC

Part 2: Datacenter Activation Coordination and the File Share Witness

In Part 1 of this series, I provided a high level overview of how Datacenter Activation Coordination (DAC) mode works and how the database mount process on startup is affected when DAC mode is enabled. (http://blogs.technet.com/b/timmcmic/archive/2012/05/21/my-databases-do-not-automatically-mount-after-i-enabled-datacenter-activation-coordination.aspx)

Remember, with DAC mode enabled, different rules apply for mounting databases on startup. The starting DAG member must be able to participate in a cluster that has quorum, and it must be able to communicate with another DAG member that has a DACP value of 1 or be able to communicate with all DAG members listed on the StartedMailboxServers list.

When a datacenter switchover has been performed and DAC mode is enabled, there could exist a condition where the standby datacenter contains a single DAG member (and therefore a single node cluster). This raises two interesting conditions:

A single node cluster always has quorum
There could be a single server on the started servers list

If the primary datacenter were restored to service without connectivity to the standby datacenter, this configuration could result in a split brain condition. To protect against this, we use an independent arbitrator to assist in determining the DACP bit: the boot time of the witness server.

When DAC mode is enabled, a DAG member will record two values in the registry at HKEY_LOCAL_MACHINESoftwareMicrosoftExchangeServerv14ReplayParameters:

BootTimeCookie = the boot time of the DAG member
BootTimeFSWCookie = the boot time of the witness server (which we obtain using WMI)

When DAC mode is enabled, there are different mount-on-startup rules that apply when only a single DAG member remains:

If the bootTimeCookie equals the boot time of the DAG member <and> the bootTimeFSWCookie does not equal the boot time of the witness server, then the DACP bit is set to 1.
If the bootTimeFSWCookie equals the boot time of the witness server <and> the bootTimeCookie does not equal the boot time of the DAG member, then the DACP bit is set to 1.
If the bootTimeFSWCookie equals the boot time of the witness server <and> the bootTimeCookie equals the boot time of the DAG member, then the DACP bit is set to 1.
If the bootTimeFSWCookie is not equal to the boot time of the witness server <and> the bootTimeCookie is not equal to the boot time of the DAG member, then the DACP bit remains at 0.

In the following examples, a two-member DAG was configured, and a datacenter switchover was performed resulting in a single-node cluster. The specific test, with tracing data, is provided for each example.

Example #1

In this example, the Microsoft Exchange Replication service on the single surviving node is restarted. Neither the DAG member itself nor the witness server was restarted. The bootTimeFSWCookie equals the boot time of the witness server <and> the bootTimeCookie equals the boot time of the DAG member, resulting in a DACP bit of 1.

438 00000000 5264 Cluster.Replay ActiveManager GetBootTimeWithWmi: WMI says that the boot time for dc.exchange.msft is 05/27/2012 16:28:12.
439 00000000 5264 Cluster.Replay ActiveManager DetermineAutomountConsensus: checking if the replay service has restarted since the MommyMayIMount bit was set.
455 00000000 5264 Cluster.Replay ActiveManager GetBootTimeWithWmi: WMI says that the boot time for MBX-2.exchange.msft is 05/27/2012 18:02:49.
456 00000000 5264 Cluster.Replay ActiveManager DetermineAutomountConsensus: WMI says the boot time is 05/27/2012 18:02:49, and the boot time when the Mount bit was set was 05/27/2012 18:02:49.
457 00000000 5264 Cluster.Replay ActiveManager DetermineAutomountConsensus found matching boot timestamps, assuming that the replay service has restarted since setting the bit.
458 00000000 5264 Cluster.Replay ActiveManager AllowAutoMount called: Found matching boot timestamps, assuming that the replay service has restarted since setting the bit.
460 00000000 5264 Cluster.Replay ActiveManager RefreshConfigInternal: The Automount consensus is true.

Example #2

In this example, the remaining DAG member was restarted and the witness server remained running. The bootTimeFSWCookie equals the boot time of the witness server <and> the bootTimeCookie does not equal the boot time of the DAG member, resulting in a DACP bit of 1.

85 00000000 2996 Cluster.Replay ActiveManager GetBootTimeWithWmi: WMI says that the boot time for dc.exchange.msft is 05/27/2012 16:28:12.
86 00000000 2996 Cluster.Replay ActiveManager DetermineAutomountConsensus: checking if the replay service has restarted since the MommyMayIMount bit was set.
87 00000000 2996 Cluster.Replay ActiveManager GetBootTimeWithWmi: WMI says that the boot time for MBX-2.exchange.msft is 05/27/2012 19:11:49.
88 00000000 2996 Cluster.Replay ActiveManager DetermineAutomountConsensus: WMI says the boot time is 05/27/2012 19:11:49, and the boot time when the Mount bit was set was 05/27/2012 18:58:31.
89 00000000 2996 Cluster.Replay ActiveManager DetermineAutomountConsensusUnanimity: There is only one node in the cluster — this is not sufficient to allow mounts!
90 00000000 2996 Cluster.Replay ActiveManager DetermineAutomountConsensusForSingleMachine: checking if the file share witness has restarted since the MommyMayIMount bit was set.
91 00000000 2996 Cluster.Replay ActiveManager DetermineAutomountConsensusForSingleMachine: WMI says the boot time for the FSW server is 05/27/2012 16:28:12, and the boot time when the Mount bit was set was 05/27/2012 16:28:12.
92 00000000 2996 Cluster.Replay ActiveManager DetermineAutomountConsensusForSingleMachine found matching boot timestamps, assuming that only this computer has restarted since setting the bit.
93 00000000 2996 Cluster.Replay ActiveManager AllowAutoMount called: Found matching FSW boot timestamps, assuming that only this computer has restarted since setting the bit.
94 00000000 2996 Cluster.Replay ActiveManager GetBootTimeWithWmi: WMI says that the boot time for MBX-2.exchange.msft is 05/27/2012 19:11:49.
95 00000000 2996 Cluster.Replay ActiveManager DetermineAutomountConsensusUnanimity is returning True.
96 00000000 2996 Cluster.Replay ActiveManager RefreshConfigInternal: The Automount consensus is true.

Example #3

In this example, the witness server was rebooted and the Microsoft Exchange Replication service on the DAG member was restarted. The bootTimeFSWCookie does not equal the boot time of the witness server <and> the bootTimeCookie does equal the boot time of the DAG member resulting, in a DACP bit of 1.

263 00000000 1552 Cluster.Replay ActiveManager GetBootTimeWithWmi: WMI says that the boot time for dc.exchange.msft is 05/27/2012 19:36:51.
264 00000000 1552 Cluster.Replay ActiveManager DetermineAutomountConsensus: checking if the replay service has restarted since the MommyMayIMount bit was set.
265 00000000 1552 Cluster.Replay ActiveManager GetBootTimeWithWmi: WMI says that the boot time for MBX-2.exchange.msft is 05/27/2012 19:27:30.
266 00000000 1552 Cluster.Replay ActiveManager DetermineAutomountConsensus: WMI says the boot time is 05/27/2012 19:27:30, and the boot time when the Mount bit was set was 05/27/2012 19:27:30.
267 00000000 1552 Cluster.Replay ActiveManager DetermineAutomountConsensus found matching boot timestamps, assuming that the replay service has restarted since setting the bit.
268 00000000 1552 Cluster.Replay ActiveManager AllowAutoMount called: Found matching boot timestamps, assuming that the replay service has restarted since setting the bit.
269 00000000 1552 Cluster.Replay ActiveManager GetBootTimeWithWmi: WMI says that the boot time for MBX-2.exchange.msft is 05/27/2012 19:27:30.
270 00000000 1552 Cluster.Replay ActiveManager RefreshConfigInternal: The Automount consensus is true.

Example #4

In this last example, both the witness server and the remaining single DAG member were restarted. Thus, the bootTimeFSWCookie does equal the boot time of the witness server <and> the bootTimeCookie does not equal the boot time of the remaining DAG member. As such, the DACP bit remains at 0.

76 00000000 3664 Cluster.Replay ActiveManager GetBootTimeWithWmi: WMI says that the boot time for dc.exchange.msft is 05/27/2012 19:47:49.
77 00000000 3664 Cluster.Replay ActiveManager DetermineAutomountConsensus: checking if the replay service has restarted since the MommyMayIMount bit was set.
78 00000000 3664 Cluster.Replay ActiveManager GetBootTimeWithWmi: WMI says that the boot time for MBX-2.exchange.msft is 05/27/2012 19:55:40.
79 00000000 3664 Cluster.Replay ActiveManager DetermineAutomountConsensus: WMI says the boot time is 05/27/2012 19:55:40, and the boot time when the Mount bit was set was 05/27/2012 19:27:30.
80 00000000 3664 Cluster.Replay ActiveManager DetermineAutomountConsensusUnanimity: There is only one node in the cluster — this is not sufficient to allow mounts!
81 00000000 3664 Cluster.Replay ActiveManager DetermineAutomountConsensusForSingleMachine: checking if the file share witness has restarted since the MommyMayIMount bit was set.
82 00000000 3664 Cluster.Replay ActiveManager DetermineAutomountConsensusForSingleMachine: WMI says the boot time for the FSW server is 05/27/2012 19:47:49, and the boot time when the Mount bit was set was 05/27/2012 19:36:51.
83 00000000 3664 Cluster.Replay ActiveManager DetermineAutomountConsensusUnanimity is returning False.
84 00000000 3664 Cluster.Replay ActiveManager Automount consensus not reached, going to Unknown AM role.

When performing a datacenter switchover where only a single node remains in the cluster supporting the DAG, any reboot that changes both the boot time of the witness server and the boot time of the DAG member will prevent databases from mounting automatically. If the reboots were necessary and valid operations, administrators can force the databases online without causing split brain.

========================================================

Datacenter Activation Coordination Series:

Part 1: My databases do not mount automatically after I enabled Datacenter Activation Coordination (https://aka.ms/F6k65e)
Part 2: Datacenter Activation Coordination and the File Share Witness (https://aka.ms/Wsesft)
Part 3: Datacenter Activation Coordination and the Single Node Cluster (https://aka.ms/N3ktdy)
Part 4: Datacenter Activation Coordination and the Prevention of Split Brain (https://aka.ms/C13ptq)
Part 5: Datacenter Activation Coordination: How do I Force Automount Concensus? (https://aka.ms/T5sgqa)
Part 6: Datacenter Activation Coordination: Who has a say? (https://aka.ms/W51h6n)
Part 7: Datacenter Activation Coordination: When to run start-databaseavailabilitygroup to bring members back into the DAG after a datacenter switchover. (https://aka.ms/Oieqqp)
Part 8: Datacenter Activation Coordination: Stop! In the Name of DAG… (https://aka.ms/Uzogbq)
Part 9: Datacenter Activation Coordination: An error cause a change in the current set of domain controllers (https://aka.ms/Qlt035)

========================================================

Exchange 2010: Using seedingPostponed results in a seeded database…

4 Replies

In Exchange 2010 the add-mailboxdatabasecopy command is utilized to add mailbox database copies to database availability group members. When a copy is first added to a member the database is automatically seeded along with the content index. In some instances administrators desire to add a copy but not immediately invoke database seeding. In order to do this the add-mailboxdatabasecopy command is run with the –seedingPostponed switch.

In some instances administrators have noticed that when adding a database copy with the seedingPostponed switch that the copy is healthy and the database has “seeded”. Let us take a look at how this can happen…

A new database is created on the Exchange 2010 server and mounted. This results in a new log stream and a new edb file. The administrator invokes the add-mailboxdatabasecopy command with the –seedingPostponed switch.

Add-MailboxDatabaseCopy –Identity DB1 -MailboxServer DAG-2 –SeedingPostponed

The command completes successfully. The copy is verified using the get-mailboxdatabasecopystatus command.

[PS] C:>Get-MailboxDatabaseCopyStatus DB1*

Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                              Length    Length                             State
—-                                          ——          ——— ———– ——————–   ————
DB1DAG-1                                     Mounted         0         0                                  Healthy
DB1DAG-2                                     Healthy         0         0           5/7/2012 8:56:09 AM    Healthy

The copy status shows healthy even though the database was not seeded. How did this occur? When a database is mounted for the first time the log sequence is created first – this is to allow us to actually log the creation of the EDB file. When looking at the log records of the first log file you can see a createDB record populated:

[PS] H:DB1 > eseutil /ml .E0400000001.log /v

Extensible Storage Engine Utilities for Microsoft(R) Exchange Server
Version 14.02
Copyright (C) Microsoft Corporation. All Rights Reserved.

Initiating FILE DUMP mode…

      Base name: E04
      Log file: .E0400000001.log
      lGeneration: 1 (0x1)
      Checkpoint: (0x3A,FFFF,FFFF)
      creation time: 05/07/2012 08:39:17
      prev gen time: 00/00/1900 00:00:00
      Format LGVersion: (7.3704.16.2)
      Engine LGVersion: (7.3704.16.2)
      Signature: Create time:05/07/2012 08:39:17 Rand:2179103003 Computer:
      Env SystemPath: h:DB1
      Env LogFilePath: h:DB1
      Env Log Sec size: 512 (matches)
      Env (CircLog,Session,Opentbl,VerPage,Cursors,LogBufs,LogFile,Buffers
          (    off,   1027, 51350, 16384, 51350,   2048,   2048, 29487
      Using Reserved Log File: false
      Circular Logging Flag (current file): off
      Circular Logging Flag (past files): off
      Checkpoint at log creation time: (0x1,8,0)

Last Lgpos: (0x1,A,0)

==================================
Op         # Records     Avg. Size
———————————-
Others         294             3
Begin            0             0
Commit           0             0
Rollback         0             0
Refresh          0             0
MacroOp          0             0
CreateDB         1           113
AttachDB         0             0
DetachDB         0             0
ShutDown         0             0
CreateFDP        0             0
Convert          0             0
Split            0             0
Merge            0             0
Insert           0             0
Replace          0             0
Delete           0             0
UndoInfo         0             0
Delta            0             0
SetExtHdr        0             0
Undo             0             0
EmptyTree        0             0
BeginDT          0             0
PreCommit        0             0
PreRollbk        0             0
FFlushLog        0             0
Convert          0             0
FRollLog         0             0
Split2           0             0
Merge2           0             0
Scrub            0             0
PageMove         0             0
PagePatch        0             0
McroInfo         0             0
ExtendDB         0             0
Ignored          0             0
Ignored          0             0
Ignored          0             0
Ignored          0             0
Ignored          0             0
Ignored          0             0
Ignored          0             0
==================================

Number of database page references: 0

Integrity check passed for log file: .E0400000001.log

Operation completed successfully in 0.343 seconds.

When reviewing the file system on the target node administrators note that logs and an EDB file do exist.

Adding a database copy with seedingPostponed does not result in a copy that is suspended. The replication service acknowledges the copy has been added, determines that the first log file exist and contains the createDB record, and subsequently begins copying log files. As log files are copied they are replayed on the target server, processing the createDB record, resulting in the edb file creation. The database is effectively seeded through shipping and replaying the log sequence.

What happens when the first log file does not exist – for example in situations where a backup has been performed? After adding the copy with seedingPostponed the administrator is presented with a warning verifying that database seeding is required:

WARNING: Replication is suspended for database copy ‘DB1’ because the database copy needs to be seeded.

When reviewing the copy status with get-mailboxdatabasecopystatus the added copy is now suspended:

Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                              Length    Length                             State
—-                                          ——          ——— ———– ——————–   ————
DB1DAG-1                                     Mounted         0         0                                  Healthy
DB1DAG-2                                     FailedAndSus… 63        0                                  Failed

At this time the database can be manually seeded using the update-mailboxdatabasecopycommand.

It is important to remember that seedingPostponed does not always results in a suspended copy. If using seedingPostpostponed with a new database that has the first log file the administrator must manually suspend the copy after adding it to ensure that no logs ship and no database is created on the target until manual seeding is performed.

Exchange 2010: Mapping DAG network states to cluster network states…

1 Reply

In Exchange 2010 after a Database Availability Group (DAG) is created a series of DAG networks is also created. These DAG networks serve to define how Exchange 2010 servers within the DAG will replicate and seed databases. The initial network topology that is established is based off the enumeration of cluster networks.

The cluster service defines two core states for a cluster network and one substate:

Allow cluster network communication on this network (Cluster Use: Internal)
- Allow clients to connect through this network (Cluster Use: Enabled)
Do not allow cluster network communications on this network (Cluster Use: Disabled)

The setting “Allow cluster network communications on this network” signifies this network can be utilized for heartbeat and cluster configuration traffic. When combined with the substate “Allow clients to connect through this network” this signifies that virtual IP addresses created within the cluster are allowed on this network. The setting “Do not allow cluster network communications on this network” signifies that the network is disabled for cluster use.

Exchange 2010 defines three network states associated with a Database Availability Group Network:

MAPIAccessEnabled
ReplicationEnabled
IgnoreNetwork

RunspaceId         : 5fdd66b9-afd7-47b7-b985-369dc8da7ac4
Name               : DAG-MAPI
Description        :
Subnets            : {{10.0.0.0/24,Up}}
Interfaces         : {{DAG-1,Up,10.0.0.20}, {DAG-2,Up,10.0.0.21}}
MapiAccessEnabled : True
ReplicationEnabled : False
IgnoreNetwork      : False
Identity           : DAGDAG-MAPI
IsValid            : True

The network states ReplicationEnabled and IgnoreNetwork can be adjusted by the administrator using the set-DatabaseAvailabilityGroup commandlet. By default any DAG network enumerated is ReplicationEnabled:TRUE and IgnoreNetwork:FALSE.

The network state MapiAccessEnabled is actually determined by the settings of the network adapter on the local server. A network is MapiAccessEnabled when we detect that the “Resgister this connection’s address in DNS” checkbox is enabled on the advanced TCP properties of the network adapter.

Why are DAG network states important? The replication services uses the defined network states to maintain the different cluster network states. Many customers have found that when attempting to modify a cluster network state using Failover Cluster Manager on a server that is a member of a DAG results in the change being reverted. How are the cluster network states mapped to DAG network states? Let us take a look:

MapiAccessEnabled	ReplicationEnabled	IgnoreNetwork	Cluster Use
TRUE	TRUE	FALSE	Enabled
TRUE	FALSE	FALSE	Enabled
FALSE	TRUE	FALSE	Internal
FALSE	FALSE	TRUE	Disabled

Any network that is IgnoreNetwork:TRUE results in a state of disabled regardless of the other settings.

Correctly managing DAG network states ensures that appropriate cluster network states are maintained and that the DAG / Cluster service can function as expected.

Part 1: My databases do not automatically mount after I enabled Datacenter Activation Coordination

13 Replies

Datacenter Activation Coordination (DAC) mode is a property of a Database Availability Group (DAG) that, when enabled, forces starting DAG members to acquire permission in order to mount databases. Administrators can enable DAC mode at any time after the DAG has been created. DAC was designed specifically to handle the following scenario:

You have a DAG extended to two datacenters.
You lose the power to your primary datacenter, which also takes out WAN connectivity between your primary and secondary datacenters.
Because primary datacenter power will be down for a while, you decide to activate your secondary datacenter and you perform a datacenter switchover.
Eventually, power is restored to your primary datacenter, but WAN connectivity between the two datacenters is not yet functional.
The DAG members starting up in the primary datacenter cannot communicate with any of the running DAG members in the secondary datacenter.

In this scenario, the starting DAG members in the primary datacenter have no idea that a datacenter switchover has occurred. They still believe they are responsible for hosting active copies of databases, and without DAC mode, if they have a sufficient number of votes to establish quorum, they would try to mount their active databases. This would result in a bad condition called split brain, which would occur at the database level. In this condition, multiple DAG members that cannot communicate with each other both host an active copy of the same mailbox database. This would be a very unfortunate condition that increases the chances of data loss, and make data recovery challenging and lengthy (albeit possible, but definitely not a situation we would want any customer to be in).

Once DAC mode is enabled, the integrated datacenter switchover tasks (Stop, Restore and Start-DatabaseAvailabilityGroup) are also enabled.

DAC mode works by using a bit stored in memory by Active Manager called the Datacenter Activation Coordination Protocol (DACP). DACP is simply a bit in memory set to either a 1 or a 0. A value of 1 means Active Manager can issue mount requests, and a value of 0 means it cannot.

The starting bit is always 0, and because the bit is held in memory, any time the Microsoft Exchange Replication service (MSExchangeRepl.exe) is stopped and restarted, the bit reverts to 0. In order to change its DACP bit to 1 and be able to mount databases, a starting DAG member needs to either:

Be able to communicate with any other DAG member that has a DACP bit set to 1; or
Be able to communicate with all DAG members that are listed on the StartedMailboxServers list.

If either condition is true, Active Manager on a starting DAG member will issue mount requests for the active databases copies it hosts. If neither condition is true, Active Manager will not issue any mount requests.

In order for the DACP bit to be set to 1 (mount database allowed) the starting DAG member must also be a member of the DAG’s cluster, and the cluster must have quorum.

For a variety of reasons, an administrator may need to shut down all members of a DAG. When starting up a DAG in DAC mode after a complete shutdown, databases may not mount automatically as they would if DAC mode were not enabled. This behavior may sound confusing but it is actuality by design. Let me explain why.

First, let’s view the configuration of a DAG using Get-DatabaseAvailabilityGroup (relevant attributes for this post highlighted in red):

[PS] C:>Get-DatabaseAvailabilityGroup -Identity DAG -status | fl

RunspaceId                             : c0bbcd75-40c8-41cb-8622-3550cd7e0e5e
Name                            : DAG
Servers                         : {DAG-4, DAG-3, DAG-2, DAG-1}
WitnessServer                   : mbx-1.domain.com
WitnessDirectory                : c:DAG-FSW
AlternateWitnessServer                 : mbx-2.domain.com
AlternateWitnessDirectory              : c:DAG-FSW
NetworkCompression                     : Enabled
NetworkEncryption                      : Enabled
DatacenterActivationMode        : DagOnly
StoppedMailboxServers           : {}
StartedMailboxServers           : {DAG-4.domain.com, DAG-2.domain.com, DAG-1.domain.com, DAG-3.domain.com}
DatabaseAvailabilityGroupIpv4Addresses : {10.0.0.24}
DatabaseAvailabilityGroupIpAddresses   : {10.0.0.24}
AllowCrossSiteRpcClientAccess          : False
OperationalServers              : {DAG-1, DAG-2, DAG-4, DAG-3}
PrimaryActiveManager            : DAG-1
ServersInMaintenance                   : {}
ThirdPartyReplication                  : Disabled
ReplicationPort                        : 64327
NetworkNames                           : {DAG-4-iSCSI, DAG-MAPI, DAG-REPL-A, DAG-REPL-B}
WitnessShareInUse               : Primary
AdminDisplayName                       :
ExchangeVersion                        : 0.10 (14.0.100.0)
DistinguishedName                      : CN=DAG,CN=Database Availability Groups,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=domain Home,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=home,DC=domain,DC=com
Identity                               : DAG
Guid                                   : 72c87136-6721-46e6-ac43-2ad5f6bd66d2
ObjectCategory                         : domain.com/Configuration/Schema/ms-Exch-MDB-Availability-Group
ObjectClass                            : {top, msExchMDBAvailabilityGroup}
WhenChanged                            : 1/29/2012 4:26:42 PM
WhenCreated                            : 9/19/2009 6:16:52 PM
WhenChangedUTC                         : 1/29/2012 9:26:42 PM
WhenCreatedUTC                         : 9/19/2009 10:16:52 PM
OrganizationId                         :
OriginatingServer                      : DC-5.domain.com
IsValid                                : True

The DAG has 4 members (DAG-1, DAG-2, DAG-3, and DAG-4) and MBX-1 is the witness server for the DAG.

During normal operations, all databases are mounted and available:

[PS] C:>Get-MailboxDatabaseCopyStatus *

Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                              Length    Length                             State
—-                                          ——          ——— ———– ——————–   ————
DAG-1-DB0DAG-1                               Mounted         0         0                                  Healthy
DAG-DB0DAG-1                                 Mounted         0         0                                  Healthy
DAG-DB1DAG-1                                 Mounted         0         0                                  Healthy
DAG-2-DB0DAG-2                               Mounted         0         0                                  Healthy
DAG-DB1DAG-2                                 Healthy         0         0           1/29/2012 4:28:01 PM   Healthy
DAG-DB0DAG-2                                 Healthy         0         0           1/29/2012 4:28:04 PM   Healthy
DAG-DB0DAG-3                                 Healthy         0         617         1/29/2012 4:28:04 PM   Healthy
DAG-DB1DAG-3                                 Healthy         0         373         1/29/2012 4:28:01 PM   Healthy
DAG-DB0DAG-4                                 Healthy         0         2268        1/29/2012 4:28:04 PM   Healthy
DAG-DB1DAG-4                                 Healthy         0         1435        1/29/2012 4:28:01 PM   Healthy

To illustrate the scenario I will shut down all DAG members without manually dismounting or moving any databases. I will leave the witness server online and accessible.

[PS] C:>Get-MailboxDatabaseCopyStatus * | fl name,status

Name : DAG-1-DB0DAG-1
Status : ServiceDown

Name : DAG-DB0DAG-1
Status : ServiceDown

Name : DAG-DB1DAG-1
Status : ServiceDown

Name : DAG-2-DB0DAG-2
Status : ServiceDown

Name : DAG-DB1DAG-2
Status : ServiceDown

Name : DAG-DB0DAG-2
Status : ServiceDown

Name : DAG-DB0DAG-3
Status : ServiceDown

Name : DAG-DB1DAG-3
Status : ServiceDown

Name : DAG-DB0DAG-4
Status : ServiceDown

Name : DAG-DB1DAG-4
Status : ServiceDown

I’ll start by powering on DAG-1. Since DAG-1 and the witness server do not have a sufficient number of votes to achieve quorum (3 votes are necessary for quorum); therefore DAG-1 won’t be able to mount any databases.

Attempts to get the status of the DAG members using get-databaseavailabilitygroup –status fails with an error due to the cluster service not being initialized on the node.

[PS] C:>Get-DatabaseAvailabilityGroup -Identity DAG -Status | fl name,servers,witnessserver,witnessdirectory,datacenteractivationmode,stoppedmailboxservers,startedmailboxservers,operationalservers,primaryactivemanager,witnessshareinuse
A server-side administrative operation has failed. ‘GetDagNetworkConfig’ failed on the server. Error: The NetworkManager has not yet been initialized. Check the event logs to determine the cause. [Server: DAG-1.domain.com]
+ CategoryInfo : NotSpecified: (0:Int32) [Get-DatabaseAvailabilityGroup], DagNetworkRpcServerException
+ FullyQualifiedErrorId : C3C89A48,Microsoft.Exchange.Management.SystemConfigurationTasks.GetDatabaseAvailabilityGroup

Get-mailboxdatabasecopystatus * also reports all databases on DAG-1 as dismounted. All other nodes report service down.

[PS] C:>Get-MailboxDatabaseCopyStatus * | fl name,status

Name : DAG-1-DB0DAG-1
Status : Dismounted

Name : DAG-DB0DAG-1
Status : Dismounted

Name : DAG-DB1DAG-1
Status : Dismounted

Name : DAG-2-DB0DAG-2
Status : ServiceDown

Name : DAG-DB1DAG-2
Status : ServiceDown

Name : DAG-DB0DAG-2
Status : ServiceDown

Name : DAG-DB0DAG-3
Status : ServiceDown

Name : DAG-DB1DAG-3
Status : ServiceDown

Name : DAG-DB0DAG-4
Status : ServiceDown

Name : DAG-DB1DAG-4
Status : ServiceDown

Next, I’ll boot DAG-2. The addition of a second DAG member server allows quorum to be achieved. However, Active Manager on DAG-2 is unable to contact another DAG member that has a DACP bit of 1, and it can’t contact all of the DAG members on the StartedMailboxServers. If DAC mode was not enabled for this DAG, databases would have automatically mounted. But because DAC mode is enabled, the databases do not automatically mount.

Using the failover cluster PowerShell integration (Windows 2008 R2) we can see that two nodes of the cluster show up (indicating quorum was successfully achieved and the nodes successfully formed a cluster).

[PS] C:>Get-Cluster DAG | Get-ClusterNode | fl name,state

Name : dag-1
State : Up

Name : dag-2
State : Up

Name : dag-3
State : Down

Name : dag-4
State : Down

Using get-databaseavailabilitygroup –status will return the same error as previously recorded.

Using get-mailboxdatabasecopystatus * we can confirm that databases remain dismounted on server DAG-1 and copies on server DAG-2 failed.

[PS] C:>Get-MailboxDatabaseCopyStatus * | fl name,status

Name : DAG-1-DB0DAG-1
Status : Dismounted

Name : DAG-DB0DAG-1
Status : Dismounted

Name : DAG-DB1DAG-1
Status : Dismounted

Name : DAG-2-DB0DAG-2
Status : Dismounted

Name : DAG-DB1DAG-2
Status : Failed

Name : DAG-DB0DAG-2
Status : Failed

Name : DAG-DB0DAG-3
Status : ServiceDown

Name : DAG-DB1DAG-3
Status : ServiceDown

Name : DAG-DB0DAG-4
Status : ServiceDown

Name : DAG-DB1DAG-4
Status : ServiceDown

If the administrator attempts to mount a database an error will be displayed that the nodes either do not have quorum or automount consensus has not been reached.

[PS] C:>Mount-Database DAG-DB1
Couldn’t mount the database that you specified. Specified database: DAG-DB1; Error code: An Active Manager operation failed. Error An Active Manager operation encountered an error. To perform this operation, the server must be a member ofa database availability group, and the database availability group must have quorum. Error: Automount consensus not reached.. [Server: DAG-1.home.domain.com].
+ CategoryInfo : InvalidOperation: (DAG-DB1:ADObjectId) [Mount-Database], InvalidOperationException
+ FullyQualifiedErrorId : FE8BAD3C,Microsoft.Exchange.Management.SystemConfigurationTasks.MountDatabase

Next, I’ll boot DAG-3. As with DAG-2, although quorum is achieved, databases will not be automatically mounted. DAG-3 is unable to contact another server with a DACP bit of 1 or all of the servers on the StartedMailboxServers list.

[PS] C:>Get-Cluster DAG | Get-ClusterNode | fl name,state

Name : dag-1
State : Up

Name : dag-2
State : Up

Name : dag-3
State : Up

Name : dag-4
State : Down

Using get-databaseavailabilitygroup –status will return the same error as previously recorded.

Using get-mailboxdatabasecopystatus * we can confirm that databases remain dismounted on server DAG-1 and copies on server DAG-2 failed.

[PS] C:>Get-MailboxDatabaseCopyStatus * | fl name,status

Name : DAG-1-DB0DAG-1
Status : Dismounted

Name : DAG-DB0DAG-1
Status : Dismounted

Name : DAG-DB1DAG-1
Status : Dismounted

Name : DAG-2-DB0DAG-2
Status : Dismounted

Name : DAG-DB1DAG-2
Status : Failed

Name : DAG-DB0DAG-2
Status : Failed

Name : DAG-DB0DAG-3
Status : Failed

Name : DAG-DB1DAG-3
Status : Failed

Name : DAG-DB0DAG-4
Status : ServiceDown

Name : DAG-DB1DAG-4
Status : ServiceDown

If the administrator attempts to mount a database an error will be displayed that the nodes either do not have quorum or automount consensus has not been reached.

Finally, I’ll boot DAG-4

All nodes are a member of a cluster that has quorum.

At this point, all nodes are a member of a cluster that has quorum, and DAG-4 can contact all servers on the StartedMailboxServers list. Therefore, the DACP bit on DAG-4 is set to 1.

DAG-1, DAG-2, and DAG-3 can now contact a server with a DACP bit set to 1, and therefore they set their DACP bit set to 1.

[PS] C:>Get-Cluster DAG | Get-ClusterNode | fl name,state

Name : dag-1
State : Up

Name : dag-2
State : Up

Name : dag-3
State : Up

Name : dag-4
State : Up

Using get-databaseavailabilitygroup –status we can see that the DAG has successfully initialized, all nodes are operational, and a primary active manager has been initialized.

Name                     : DAG
Servers                  : {DAG-4, DAG-3, DAG-2, DAG-1}
WitnessServer            : mbx-1.domain.com
WitnessDirectory         : c:DAG-FSW
DatacenterActivationMode : DagOnly
StoppedMailboxServers    : {}
StartedMailboxServers    : {DAG-3.domain.com, DAG-4.domain.com, DAG-2.domain.com, DAG-1.home.domain.com}
OperationalServers       : {DAG-1, DAG-2, DAG-4, DAG-3}
PrimaryActiveManager     : DAG-1
WitnessShareInUse        : Primary

Using get-mailboxdatabasecopystatus * we can observe that databases have now automatically mounted and copies are healthy.

[PS] C:>Get-MailboxDatabaseCopyStatus * | fl name,status

Name : DAG-1-DB0DAG-1
Status : Mounted

Name : DAG-DB0DAG-1
Status : Mounted

Name : DAG-DB1DAG-1
Status : Mounted

Name : DAG-2-DB0DAG-2
Status : Mounted

Name : DAG-DB1DAG-2
Status : Healthy

Name : DAG-DB0DAG-2
Status : Healthy

Name : DAG-DB0DAG-3
Status : Healthy

Name : DAG-DB1DAG-3
Status : Healthy

Name : DAG-DB0DAG-4
Status : Healthy

Name : DAG-DB1DAG-4
Status : Healthy

As I’ve described above, when a DAG in DAC mode is started after a complete shutdown, databases will not be mountable until all DAG members are up, running, and in communication with each other.

*Special thanks to Scott Schnoll for reviewing and editing content.

========================================================

Datacenter Activation Coordination Series:

========================================================

Verifying log file truncation…

9 Replies

In support we see cases from customers regarding log file truncation. One question that commonly comes up is how do you verify log file truncation actually occurs?

The most common method that administrators use when verifying log file truncation post backup is to review the log file directory and note the count of files within this directory. In general, on servers under load that generate a large number of log files in-between backups, this is a safe method to verify log truncation. If there were many logs in the log directory prior to backup and only a few remain post backup, log file truncation would be considered successful.

In a lab environment or servers that are not under load the directory method of verification may not actually work. Take the following example:

I review the log directory prior to backup and the directory contains 10 logs.
I run a successful full backup.
I review the log directory after backup and the directory now contains 10 logs.

Visually it would appear that no log files truncated at all. In actuality 1 or 2 log files truncated at the same time you generated 1 or 2 more. Therefore there is no change in file count for the directory.

How can I verify that log truncation was successful?

In general I recommend avoiding the directory file count to verify log truncation. A reliable way to verify log file truncation is to record the log sequence both pre and post backup. This can be done using the eseutil /ml command. To dump an entire log sequence the administrator would run eseutil /ml ENN (ENN = log generation prefix). This will dump all log files found in the directory and their order – the output can be piped to a text file for later review. Let us take a look at an example.

Using a command prompt navigate to the log file directory.
Run eseutil /ml ENN. This will provide a list of the log files found in the directory. (Note that the error at the end of the command is expected since the current log file is locked for access).

[PS] P:DAGDAG-DB0DAG-DB0-Logs>eseutil /ml E02

Extensible Storage Engine Utilities for Microsoft(R) Exchange Server
Version 14.02
Copyright (C) Microsoft Corporation. All Rights Reserved.

Initiating FILE DUMP mode…

Verifying log files…
Base name: E02

      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A66C.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A66D.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A66E.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A66F.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A670.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A671.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A672.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A673.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A674.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A675.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A676.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A677.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A678.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A679.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A67A.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A67B.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A67C.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A67D.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A67E.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE02.log
                ERROR: Cannot open log file (P:DAGDAG-DB0DAG-DB0-LogsE02.log). Error -1032.

Operation terminated with error -1032 (JET_errFileAccessDenied, Cannot access file, the file is locked or in use) after 1.981 seconds.

Send mail with large attachments to force log file generation – this step is necessary when testing in a lab environment. (See http://blogs.technet.com/b/timmcmic/archive/2012/03/12/exchange-2010-log-truncation-and-checkpoint-at-log-creation-in-a-database-availability-group.aspx for an explanation of why this step is necessary)
Perform a full or incremental backup
Run eseutil /ml ENN. This will provide a list of the log files found in the directory. (Note that the error at the end of the command is expected since the current log file is locked for access).

[PS] P:DAGDAG-DB0DAG-DB0-Logs>eseutil /ml E02

Extensible Storage Engine Utilities for Microsoft(R) Exchange Server
Version 14.02
Copyright (C) Microsoft Corporation. All Rights Reserved.

Initiating FILE DUMP mode…

Verifying log files…
Base name: E02

      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A679.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A67A.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A67B.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A67C.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A67D.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A67E.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A67F.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A680.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A681.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A682.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A683.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A684.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A685.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A686.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A687.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A688.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A689.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A68A.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A68B.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A68C.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A68D.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A68E.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A68F.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A690.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A691.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A692.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A693.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A694.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A695.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A696.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A697.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A698.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A699.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A69A.log – OK
      Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A69B.log – OK

Compare the two files to note the first log file referenced in each file.
- Pre-backup: Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A66C.log – OK
- Post-backup: Log file: P:DAGDAG-DB0DAG-DB0-LogsE020000A679.log – OK
Since there is a difference in the first log noted in each output then log truncation can be considered successful.

By using this method you can not only reliably verify log files both pre and post backup but also record diagnostic information that may be helpful when engaging support for potential issues regarding log file truncation.

Backup dates and times are incorrect even though data has been committed to backup media successfully….

Backups fail due to consistency check failure…

10 Replies

Last week I had the opportunity to work with a customer who was experiencing issues backing up their Exchange 2010 databases. The issue they experienced though is relevant to both Exchange 2007 and Exchange 2003 installations (that leverage VSS based backups and consistency checking enabled).

After reviewing the logs it was apparent that the VSS process was functioning appropriately. All relevant events regarding the snapshot process were present. In this case the backup job was configured for consistency check, and relevant consistency check events were noted. In almost all backup jobs the following error was present in the logs:

Log Name: Application
Source: Storage Group Consistency Check
Event ID: 403
Task Category: Termination
Level: Error
Keywords: Classic
Description:
Instance: The physical consistency check successfully validated 0 out of xxxxxxxx pages of database ‘DATABASE’. Because some database pages were either not validated or failed validation, the consistency check has been considered unsuccessful.

In general this event would indicate that consistency check encountered an error when scanning the pages of an Exchange database. In most cases this would mean that there is page level corruption in the database such that the validation checks performed by consistency check would fail and the backup would be terminated. This is by design.

In theory corruption of this type would not be present in the environment configured. The customer was utilizing a Database Availability Group which has protections in it to self heal databases from this type of corruption. Replication was healthy and there were no indication that any page corrections were performed.

If you look at the event in greater detail you will see that it provides the number of pages that were successfully scanned before the issue occurred. When reviewing the application logs it was noted that on the same database the failure occurred after scanning a different number of pages. For example, in one failure the failure occurred after scanning 28000 pages and another failure 42456 pages.

At this point when reviewing the system log the following error was noted:

Time:     1/9/2012 12:40:56 PM
ID:       36
Level:    Error
Source: volsnap
Machine: server.company.com
Message: The shadow copies of volume F: were aborted because the shadow copy storage could not grow due to a user imposed limit.

This error would imply that while attempting to store differential changes while the snapshot existed the allotted snapshot storage space was exhausted and could not be grown. When reviewing vssadmin list shadowstorage it was noted that the shadow storage space assigned to the volume hosting the database was 321 megabytes.

vssadmin list shadowstorage

Shadow Copy Storage association
   For volume: (F:)\?Volume{0ecc7a68-be78-4c40-baf6-4d0d3b0b6693}
   Shadow Copy Storage volume: (H:)\?Volume{ed074b1d-b500-465b-a720-d2f733f49761}
   Used Shadow Copy Storage space: 0 B (0%)
   Allocated Shadow Copy Storage space: 0 B (0%)
   Maximum Shadow Copy Storage space: 321 MB (0%)

This is an extremely small shadow copy storage space. By default the allotted space is generally 10% of volume size. To correct this issue we can utilize the vssadmin command in order to reset the shadow storage space.

vssadmin Resize ShadowStorage /For=F: /On=F: /maxsize=20%

Successfully resized the shadow copy storage association

In our case the in-ability to continue to store differential changes in the shadow storage space caused the shadow copy to be removed. This subsequently caused consistency check to fail resulting in a failure of the backup job. Once the shadow copy storage was was allocated to an appropriate size, and differential changes could be successfully stored for the entire duration of the backup operation, the backups proceeded successfully.

Exchange 2010: Log Truncation and Checkpoint At Log Creation in a Database Availability Group

21 Replies

In previous versions of Exchange, when a backup was completed, almost all log files prior to the current log file were truncated from the system. Administrators monitoring the directory would originally see many logs, and post backup note that only a few logs remained. In Exchange 2010 Service Pack 1 and later administrators note that multiple log files remain on the disk post backup or the appearance that no log files have truncated at all. In many cases this leads to a belief that logs are actually not truncating successfully or that there is an issue with backups.

Why do we see logs remaining on disk for longer? Exchange 2010 SP1 and newer introduces a change in the behavior of log truncation. The changes were taken to ensure that replicated copies of databases within a database availability group always had the appropriate log files on the source server to complete an incremental resynchronization.

The change to log truncation is the tracking of Checkpoint At Log Creation. Remember that in a database availability group we can expect the checkpoint to be approximately 100 logs (or slightly more) off the current log file – this is known as checkpoint depth. As Exchange creates new log files we stamp into the header of the new log files what log file the checkpoint was pointing at when the current log was created. For example, let us say that log file 0xA679 (42617) was just created as the current ENN.log. We can expect that the checkpoint at log creation value stamped within the header of this log file would be approximately 0xA16 (42517). You can see the checkpoint at log creation value by using eseutil /ml <logfilename> to dump the header of a log file.

[PS] P:DAGDAG-DB0DAG-DB0-Logs>eseutil /ml .E020000A67E.log

Initiating FILE DUMP mode…

      Base name: E02
      Log file: .E020000A67E.log
      lGeneration: 42622 (0xA67E)
      Checkpoint: (0xA679,8,0)
      creation time: 03/11/2012 06:00:48
      prev gen time: 03/11/2012 04:01:17
      Format LGVersion: (7.3704.16.2)
      Engine LGVersion: (7.3704.16.2)
      Signature: Create time:05/02/2010 18:04:08 Rand:399094376 Computer:
      Env SystemPath: d:DAGDAG-DB0DAG-DB0-Logs
      Env LogFilePath: d:DAGDAG-DB0DAG-DB0-Logs
      Env Log Sec size: 512 (matches)
      Env (CircLog,Session,Opentbl,VerPage,Cursors,LogBufs,LogFile,Buffers)
          (    off,   1027, 51350, 16384, 51350,   2048,   2048, 29487)
      Using Reserved Log File: false
      Circular Logging Flag (current file): off
      Circular Logging Flag (past files): off
      Checkpoint at log creation time: (0xA679,8,0)
      1 d:DAGDAG-DB0DAG-DB0-DatabaseDAG-DB0.edb
                 dbtime: 18078306 (0-18078306)
                 objidLast: 2957
                 Signature: Create time:05/02/2010 18:04:08 Rand:399127765 Computer:
                 MaxDbSize: 0 pages
                 Last Attach: (0xA348,9,86)
                 Last Consistent: (0xA346,9,B5)

Last Lgpos: (0xa67e,252,0)

Number of database page references: 770

Integrity check passed for log file: .E020000A67E.log

Operation completed successfully in 0.265 seconds.

In the previous example the checkpoint at log creation is 0xA679.

Within a DAG all servers that contain a replicated copy of a database report the maximum log file that is eligible for truncation. These values are reported to the active node which subsequently calculates the maximum log file for truncation. In simplest terms the following process occurs:

Passive copy on Node-2 reports OK to truncate log 0xA679 (42617).
Passive copy on Node-3 reports OK to truncate log file 0xA678 (42616)
Passive copy on Node-4 reports ok to truncate log 0xA679 (42617).
The active node determines that the best log file eligible for truncation based on the passive copies is 0xA678 (42616). [This is essentially the minimum of all reported OK logs to truncate.]
The active node then looks at the checkpoint at log creation of 0xA678 (42616) and determines that value is 0xA614 (42516). In this example that would be 100 logs off the best log reported for truncation of the passive copies.
The active node sets the truncation point to be log 0xA614 (42516).
Therefore after a successful backup logs prior to 0x614 (42516). would truncate.

This essentially means that 100 additional logs that would have previously truncated prior to this change do not truncate.

Taking into account checkpoint at log creation administrators can better understand how log files are truncated and why log files remain on disk after a backup that might have in prior versions been truncated.

============================

Update 5/16/2012

Corrected hex conversions in example.

============================

Verifying the file share witness server / directory in use for Exchange 2010

19 Replies

If you’ve read my blog post on file share witness oddities you might be asking yourself “how can I actually verify what file share witness is in use in my environment?”

There are three different ways to verify the witness in use for Exchange 2010.

1) Exchange Management Shell

The Get-databaseavailabilitygroup –status command returns all the settings for the DAG from Active Directory. By adding the –status switch we also query additional values from the Cluster service and Replication service. These are not queried by default as they can delay the command from returning normal configuration values. Here is an example of the output:

[PS] C:>Get-DatabaseAvailabilityGroup -Identity DAG -Status | fl

RunspaceId                             : 717eb01d-a17b-4e8e-b018-acf00a0d748d
Name                                   : DAG
Servers                                : {DAG-4, DAG-3, DAG-2, DAG-1}
WitnessServer                          : mbx-1.domain.com
WitnessDirectory                       : c:DAG-FSW
AlternateWitnessServer                 : mbx-2.domain.com
AlternateWitnessDirectory              : c:DAG-FSW
NetworkCompression                     : Enabled
NetworkEncryption                      : Enabled
DatacenterActivationMode               : DagOnly
StoppedMailboxServers                  : {}
StartedMailboxServers                  : {DAG-3.domain.com, DAG-4.domain.com, DAG-2.domain.com, DAG-1.domain.com}
DatabaseAvailabilityGroupIpv4Addresses : {10.0.0.24}
DatabaseAvailabilityGroupIpAddresses   : {10.0.0.24}
AllowCrossSiteRpcClientAccess          : False
OperationalServers                     : {DAG-1, DAG-2, DAG-4, DAG-3}
PrimaryActiveManager                   : DAG-1
ServersInMaintenance                   : {}
ThirdPartyReplication                  : Disabled
ReplicationPort                        : 64327
NetworkNames                           : {DAG-4-iSCSI, DAG-MAPI, DAG-REPL-A, DAG-REPL-B}
WitnessShareInUse                      : Primary
AdminDisplayName                       :
ExchangeVersion                        : 0.10 (14.0.100.0)
DistinguishedName                      : CN=DAG,CN=Database Availability Groups,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=domain Home,CN=Microsoft Exchange`,CN=Services,CN=Configuration,DC=home,DC=domain,DC=com
Identity                               : DAG
Guid                                   : 72c87136-6721-46e6-ac43-2ad5f6bd66d2
ObjectCategory                         : domain.com/Configuration/Schema/ms-Exch-MDB-Availability-Group
ObjectClass                            : {top, msExchMDBAvailabilityGroup}
WhenChanged                            : 1/29/2012 5:34:25 PM
WhenCreated                            : 9/19/2009 6:16:52 PM
WhenChangedUTC                         : 1/29/2012 10:34:25 PM
WhenCreatedUTC                         : 9/19/2009 10:16:52 PM
OrganizationId                         :
OriginatingServer                      : DC-5.domain.com
IsValid                                : True

In this example you can see the attribute WitnessShareInUse with a value of Primary. This lets the administrator know that the current witness configured for cluster use is the primary file share witness (in this case the witness server and witness directory).

Name                      : DAG
Servers                   : {DAG-4, DAG-3, DAG-2, DAG-1}
WitnessServer             : mbx-1.domain.com
WitnessDirectory          : c:DAG-FSW
AlternateWitnessServer    : mbx-2.domain.com
AlternateWitnessDirectory : c:DAG-FSW
OperationalServers        : {DAG-1, DAG-2, DAG-4, DAG-3}
PrimaryActiveManager      : DAG-1
WitnessShareInUse         : Alternate

In this example you can see the attribute WitnessShareInUse with a value of Alternate. This is an example of where the AlternateWitnessServer and AlternateWitnessDirectory are configured for cluster use.

[PS] C:>Get-DatabaseAvailabilityGroup -Identity DAG -Status | fl name,servers,witnessserver,witnessdirectory,alternatew
itnessserver,alternatewitnessdirectory,operationalservers,primaryactivemanager,witnessshareinuse
WARNING: The witness server and directory currently in use by database availability group ‘DAG’ doesn’t match the
configured primary or alternate witness server. This may be due to Active Directory replication latency. If this
condition persists, please use the Set-DatabaseAvailabilityGroup cmdlet to correct the configuration.

In this example you can see the attribute WitnessShareInUse with a value of InvalidConfiguration. There is also a warning displayed indicating that the witness server in use does not match either the primary or alternate witness. This is an indication that the file share witness was modified outside of Exchange and the settings currently in use are not correct. Administrators can correct this by running the set-databaseavailabilitygroup command.

2) Utilize cluster commands

Windows 2008 / Windows 2008 R2

Using the command prompt execute the cluster <DAGNAME> res command. This will output all the resources within the cluster.

[PS] C:>cluster dag.domain.com res
Listing status for all available resources:

Resource             Group                Node            Status
——————– ——————– ————— ——
Cluster IP Address   Cluster Group        DAG-1           Online
Cluster Name         Cluster Group        DAG-1           Online
File Share Witness   Cluster Group        DAG-1           Online

Highlighted in red you can see the display name of the File Share Witness resource. With this information you can run the command cluster <DAGNAME> res “file share witness display name” /priv.

[PS] C:>cluster dag.domain.com res "File Share Witness" /priv

Listing private properties for ‘File Share Witness’:

T Resource             Name                           Value
— ——————– —————————— ———————–
S File Share Witness   SharePath                      \mbx-1.domain.comDAG.domain.com
D File Share Witness   ArbitrationDelay               6 (0x6)

This command lists the private properties within cluster associated with the file share witness resource. In our case we are interested in the SharePath. According to this output the current file share witness server is MBX-1.

Windows 2008 R2

Using powershell import the FailoverClusters modules.

[PS] C:>Import-Module FailoverClusters

Issue the command Get-ClusterQuorum –cluster <DAGNAME> | fl

[PS] C:>Get-ClusterQuorum -Cluster DAG.home.e-mcmichael.com | fl

Cluster : DAG
QuorumResource : File Share Witness
QuorumType : NodeAndFileShareMajority

Highlighted in red is the display name of the file share witness resource. Using the command Get-ClusterResource “Display Name” –cluster <DAGNAME> | Get-ClusterParameter

[PS] C:>Get-ClusterResource "File Share Witness" -Cluster DAG.home.e-mcmichael.com | Get-ClusterParameter

Object                        Name                          Value                         Type
——                        —-                          —–                         —-
File Share Witness            SharePath                     \mbx-1.domain.c… String
File Share Witness            ArbitrationDelay              6                             UInt32

3) Failover Cluster Manager

Using Failover Cluster Manager connect to the cluster service. You can connect to either a node or specify the DAG name as the connection point.

Click on the cluster name in the upper left hand corner of the utility.

In the center window information is displayed regarding the cluster configuration.

One piece of information is the “Quorum Configuration”. This will list the type of quorum in use and if a file share witness is configured the server and share name utilized as the witness.

In this example you can see that the cluster is configured for a quorum type of Node and File Share Majority with the file share witness server MBX-1.

Exchange and VSS — My Exchange writer is in a failed retryable state…

47 Replies

In Exchange 2007 and Exchange 2010 many customers are leveraging VSS based backups to retain and protect their Exchange data. By default Exchange provides two different VSS writers that share the same VSS writer ID but are loaded by two different services. The first is the Exchange Information Store VSS writer and the second is the Exchange Replication Service VSS writer. The Information Store writer allows for the backup of active / mounted databases and the replication service writer allows for the backup of passive databases (should a replicated database model be utilized). You can see the writers by running the command VSSADMIN LIST WRITERS from a command prompt.

Here is a sample put of a VSSAdmin List Writers from a Windows 2008 R2 SP1 server with Exchange 2010 SP1. Note how both writers share the same writer ID within the VSS framework.

Writer name: ‘Microsoft Exchange Replica Writer’
   Writer Id: {76fe1ac4-15f7-4bcd-987e-8e1acb462fb7}
   Writer Instance Id: {17e8df11-a8a2-4ee3-a3fb-e552b7da2d83}
   State: [1] Stable
   Last error: No error

Writer name: ‘Microsoft Exchange Writer’
   Writer Id: {76fe1ac4-15f7-4bcd-987e-8e1acb462fb7}
   Writer Instance Id: {e0ad4b68-8938-4be5-9b88-4c74df2b2d65}
   State: [1] Stable
   Last error: No error

In the course of protecting Exchange servers there maybe conditions that cause a backup job to fail. When an Exchange backup job fails the VSS framework aborts the backup and subsequently Exchange clears the backup in progress settings. When a failure is encountered either a single Exchange writer or both Exchange writers maybe left in a FAILED RETRYABLE state. We can utilize VSSAdmin List Writers again to query the writer status and see these results. Here is an example showing the Exchange Replication Service writer with a status 8 FAILED last error RETRYABLE.

Now the typical question that comes up at this point is how do I actually deal with an Exchange writer that consistently disallows backups. The answer – restart the service that the writer was associated with and/or fix whatever configuration issue is causing the failures. For example, given the above output I would restart the Exchange Replication Service in an attempt to return the writer to a Stable No Error state. (If it would have been the Microsoft Exchange Writer I would have restarted the Exchange Information Store Service).

The real question though is do I need to deal with a writer that is in a failed state? Unfortunately many administrators find themselves having to deal with a writer in a failed state because their experience is that while the writer is in a failed state subsequent backup jobs fail. If reviewing the issues carefully what you’ll find is that the backup jobs are not failing because of a VSS failure but rather they are failing because a writer was found in a failed state. From an Exchange / VSS perspective this is unexpected –> after all although the writer is failed the error is RETRYABLE –> essentially saying “hey…something failed but come on back and try me again…”

Let’s take a look at why this might be happening….

Within the VSS framework there are two states that we are interested in –> the Session State and the Current State. When a VSS session is in progress, and an administrator runs VSSAdmin List Writers, the state that is displayed is the current session state. When the VSS snapshot creation has completed, the current state becomes a session specific state and the status of the most recently completed session is copied to the current state. At this point when the administrator runs VSSAdmin List Writers the state of the most recently completed session is displayed. This is an important distinction –> the SESSION STATE AT THIS POINT REFLECTS THE STATUS OF THE LAST SESSION! The status of the last session does not imply anything in regards to the success <or> failure of future sessions.

Now that we know where VSSAdmin List Writers gets its information we’ll take a look at how the backup process should progress. (I’m going to attempt to present an overly simplified timeline of an expected backup)

The process starts with the VSS requester establishing a VSS session.

After the session is established the VSS requester requests metadata from the VSS framework.

At this point the VSS request and VSS framework further progress the snap shot process by determining components and preparing the snapshot set.

Once the components and snapshot sets have been prepared the VSS requester issues a PrepareForBackup. This in turns causes the VSS framework to prepare the components for backup.

After prepare backup is called the individual application level writers are now responsible for current writer status. The VSS requester is now allowed to call GatherWriterStatus. This call in turn should return the current writer status. For example, current writer status at this stage could be FREEZE / THAW / etc. This is regardless of if the previous status was FAILED or HEALTHY. This is the status that the VSS requester should be utilizing to make logic decisions at this point.

Once the snapshot is created the contents can then be transferred to the backup media. Once the transfer is complete, the VSS requester can inform the VSS framework that a backup has completed successfully and subsequently the VSS session ended.

In summary if the VSS requester is performing operations in an order that is expected, the writer status should be queried after the framework has received a prepare for backup event. This will ensure the writer status reflects that of the CURRENT SESSION IN PROGRESS and not the SESSION STATE OF THE PREVIOUS BACKUP.

The administrator can verify the functionality of the Exchange writer by utilizing the VSHADOW or DISKSHADOW utilities. These utilities utilize the workflow outlined in the successful handling of a failed retryable writer case. If either of these utilities are successful in creating the backup, and the writer in turn is returned to a healthy state you might consider following up with the backup vendor to ensure VSS calls are being made appropriately. Microsoft can also assist you in verifying the calls are made appropriately through assisting with both Exchange and OS VSS tracing.