Category Archives: Uncategorized

Exchange 2010 Database Availability Groups and Disk Sector Sizes

These days, some customers are deploying Exchange databases and log files on advanced format (4K) drives.  Although these drives support a physical sector size of 4096, many vendors are emulating 512 byte sectors in order to maintain backwards compatibility with application and operating systems.  This is known as 512 byte emulation (512e).  Windows 2008 and Windows 2008 R2 support native 512 byte and 512 byte emulated advanced format drives.  Windows 2012 supports drives of all sector sizes.  The sector size presented to applications and the operating system, and how applications respond, directly affects data integrity and performance.

 

For more information on sector sizes see the following links:

 

When deploying an Exchange 2010 Database Availability Group (DAG), the sector sizes of the volumes hosting the databases and log files must be the same across all nodes within the DAG.  This requirement is outlined at http://technet.microsoft.com/en-us/library/ee832792(v=exchg.141).aspx.

 

“Support requires that all copies of a database reside on the same physical disk type. For example, it is not a supported configuration to host one copy of a given database on a 512-byte sector disk and another copy of that same database on a 512e disk. Also be aware that 4-kilobyte (KB) sector disks are not supported for any version of Microsoft Exchange and 512e disks are not supported for any version of Exchange prior to Exchange Server 2010 SP1.”

 

Recently, we have noted that some customers have experienced issues with log file replication and replay as the result of sector size mismatch.  These issues occur when:

 

  • Storage drivers are upgraded resulting in the recognized sector size changing.
  • Storage firmware is upgraded resulting in the recognized sector size changing.
  • New storage is presented or existing storage is replaced with drives of a different sector size.

 

This mismatch can cause one or more database copies in a DAG to fail, as illustrated below. In my example environment, I have a three-member DAG with a single database that resides on a volume labeled Z that is replicated between each member.

 

 

[PS] C:>Get-MailboxDatabaseCopyStatus *

Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                              Length    Length                             State
—-                                          ——          ——— ———– ——————–   ————
SectorTestMBX-1                              Mounted         0         0                                  Healthy
SectorTestMBX-2                              Healthy         0         1           3/19/2013 10:27:50 AM  Healthy
SectorTestMBX-3                              Healthy         0         1           3/19/2013 10:27:50 AM  Healthy

 

If I use FSUTIL to query the Z volume on each DAG member, we can see that the Z volume currently has 512 logical bytes per sector and a 512 physical bytes per sector. Thus, the Z volume is currently seen by the operating system as having a native 512 byte sector size.

 

MBX-1:

C:>fsutil fsinfo ntfsinfo z:
NTFS Volume Serial Number :       0x18d0bc1dd0bbfed6
Version :                         3.1
Number Sectors :                  0x000000000fdfe7ff
Total Clusters :                  0x0000000001fbfcff
Free Clusters  :                  0x0000000001fb842c
Total Reserved :                  0x0000000000000000
Bytes Per Sector  :               512
Bytes Per Physical Sector :       512
Bytes Per Cluster :               4096
Bytes Per FileRecord Segment    : 1024
Clusters Per FileRecord Segment : 0
Mft Valid Data Length :           0x0000000000040000
Mft Start Lcn  :                  0x00000000000c0000
Mft2 Start Lcn :                  0x0000000000000002
Mft Zone Start :                  0x00000000000c0040
Mft Zone End   :                  0x00000000000cc840
RM Identifier:        EF486117-9094-11E2-BF55-00155D006BA1

MBX-3:

C:>fsutil fsinfo ntfsinfo z:
NTFS Volume Serial Number :       0x0ad44aafd44a9d37
Version :                         3.1
Number Sectors :                  0x000000000fdfe7ff
Total Clusters :                  0x0000000001fbfcff
Free Clusters  :                  0x0000000001fad281
Total Reserved :                  0x0000000000000000
Bytes Per Sector  :               512
Bytes Per Physical Sector :       512

Bytes Per Cluster :               4096
Bytes Per FileRecord Segment    : 1024
Clusters Per FileRecord Segment : 0
Mft Valid Data Length :           0x0000000000040000
Mft Start Lcn  :                  0x00000000000c0000
Mft2 Start Lcn :                  0x0000000000000002
Mft Zone Start :                  0x00000000000c0000
Mft Zone End   :                  0x00000000000cc820
RM Identifier:        B9B00E32-90B2-11E2-94E9-00155D006BA3

But what happens if there is a change in the way storage is seen on MBX-3, so that the volume now reflects a 512e sector size.  This can happen when upgrading storage drivers, upgrading firmware, or presenting new storage that implements advanced format storage.

 

C:>fsutil fsinfo ntfsinfo z:
NTFS Volume Serial Number :       0x0ad44aafd44a9d37
Version :                         3.1
Number Sectors :                  0x000000000fdfe7ff
Total Clusters :                  0x0000000001fbfcff
Free Clusters  :                  0x0000000001fad2e7
Total Reserved :                  0x0000000000000000
Bytes Per Sector  :               512
Bytes Per Physical Sector :       4096

Bytes Per Cluster :               4096
Bytes Per FileRecord Segment    : 1024
Clusters Per FileRecord Segment : 0
Mft Valid Data Length :           0x0000000000040000
Mft Start Lcn  :                  0x00000000000c0000
Mft2 Start Lcn :                  0x0000000000000002
Mft Zone Start :                  0x00000000000c0040
Mft Zone End   :                  0x00000000000cc840
RM Identifier:        B9B00E32-90B2-11E2-94E9-00155D006BA3

 

When reviewing the database copy status, notice that the copy assigned to MBX-3 has failed.

 

[PS] C:>Get-MailboxDatabaseCopyStatus *

Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                              Length    Length                             State
—-                                          ——          ——— ———– ——————–   ————
SectorTestMBX-1                              Mounted         0         0                                  Healthy
SectorTestMBX-2                              Healthy         0         0           3/19/2013 11:13:05 AM  Healthy
SectorTestMBX-3                              Failed          0         8           3/19/2013 11:13:05 AM  Healthy

The full details of the copy status of MBX-3 can be reviewed to display the detailed error:

 

[PS] C:>Get-MailboxDatabaseCopyStatus SectorTestMBX-3 | fl

RunspaceId                       : 5f4bb58b-39fb-4e3e-b001-f8445890f80a
Identity                         : SectorTestMBX-3
Name                             : SectorTestMBX-3
DatabaseName                     : SectorTest
Status                           : Failed
MailboxServer                    : MBX-3
ActiveDatabaseCopy               : mbx-1
ActivationSuspended              : False
ActionInitiator                  : Service
ErrorMessage                     : The log copier was unable to continue processing for database ‘SectorTestMBX-3’ bec
                                   ause an error occured on the target server: Continuous replication – block mode has
                                   been terminated. Error: the log file sector size does not match the current volume’s
                                    sector size (-546) [HResult: 0x80131500]. The copier will automatically retry after
                                    a short delay.

ErrorEventId                     : 2152
ExtendedErrorInfo                :
SuspendComment                   :
SinglePageRestore                : 0
ContentIndexState                : Healthy
ContentIndexErrorMessage         :
CopyQueueLength                  : 0
ReplayQueueLength                : 7
LatestAvailableLogTime           : 3/19/2013 11:13:05 AM
LastCopyNotificationedLogTime    : 3/19/2013 11:13:05 AM
LastCopiedLogTime                : 3/19/2013 11:13:05 AM
LastInspectedLogTime             : 3/19/2013 11:13:05 AM
LastReplayedLogTime              : 3/19/2013 10:24:24 AM
LastLogGenerated                 : 53
LastLogCopyNotified              : 53
LastLogCopied                    : 53
LastLogInspected                 : 53
LastLogReplayed                  : 46
LogsReplayedSinceInstanceStart   : 0
LogsCopiedSinceInstanceStart     : 0
LatestFullBackupTime             :
LatestIncrementalBackupTime      :
LatestDifferentialBackupTime     :
LatestCopyBackupTime             :
SnapshotBackup                   :
SnapshotLatestFullBackup         :
SnapshotLatestIncrementalBackup  :
SnapshotLatestDifferentialBackup :
SnapshotLatestCopyBackup         :
LogReplayQueueIncreasing         : False
LogCopyQueueIncreasing           : False
OutstandingDumpsterRequests      : {}
OutgoingConnections              :
IncomingLogCopyingNetwork        :
SeedingNetwork                   :
ActiveCopy                       : False

 

Using ERR we can verify the definition of the –546.

 

D:UtilitiesERR>err -546
# for decimal -546 / hex 0xfffffdde
  JET_errLogSectorSizeMismatch                                   esent98.h
# /* the log file sector size does not match the current
# volume’s sector size */
# 1 matches found for "-546"

 

In addition, the application event log may contain the following entries:

 

Log Name:      Application
Source:        MSExchangeRepl
Date:          3/19/2013 11:14:58 AM
Event ID:      2152
Task Category: Service
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      MBX-3.exchange.msft
Description:
The log copier was unable to continue processing for database ‘SectorTestMBX-3’ because an error occured on the target server: Continuous replication – block mode has been terminated. Error: the log file sector size does not match the current volume’s sector size (-546) [HResult: 0x80131500]. The copier will automatically retry after a short delay.
Event Xml:
<Event xmlns="
http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="MSExchangeRepl" />
    <EventID Qualifiers="49156">2152</EventID>
    <Level>2</Level>
    <Task>1</Task>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2013-03-19T18:14:58.000000000Z" />
    <EventRecordID>2502</EventRecordID>
    <Channel>Application</Channel>
    <Computer>MBX-3.exchange.msft</Computer>
    <Security />
  </System>
  <EventData>
    <Data>SectorTestMBX-3</Data>
    <Data>Continuous replication – block mode has been terminated. Error: the log file sector size does not match the current volume’s sector size (-546) [HResult: 0x80131500].</Data>
  </EventData>
</Event>

Why does this issue occur?  Each log file records in the header the sector size of the disk where a log file was created.  For example, this is the header of a log file on MBX-1 with a native 512 byte sector size:

 

Z:SectorTest>eseutil /ml E0100000001.log

Extensible Storage Engine Utilities for Microsoft(R) Exchange Server
Version 14.02
Copyright (C) Microsoft Corporation. All Rights Reserved.

Initiating FILE DUMP mode…

      Base name: E01
      Log file: E0100000001.log
      lGeneration: 1 (0x1)
      Checkpoint: (0x38,FFFF,FFFF)
      creation time: 03/19/2013 09:40:14
      prev gen time: 00/00/1900 00:00:00
      Format LGVersion: (7.3704.16.2)
      Engine LGVersion: (7.3704.16.2)
      Signature: Create time:03/19/2013 09:40:14 Rand:11019164 Computer:
      Env SystemPath: z:SectorTest
      Env LogFilePath: z:SectorTest
      Env Log Sec size: 512 (matches)
      Env (CircLog,Session,Opentbl,VerPage,Cursors,LogBufs,LogFile,Buffers)
          (    off,   1227,  61350,  16384,  61350,   2048,   2048,  44204)
      Using Reserved Log File: false
      Circular Logging Flag (current file): off
      Circular Logging Flag (past files): off
      Checkpoint at log creation time: (0x1,8,0)

      Last Lgpos: (0x1,A,0)

Number of database page references:  0

Integrity check passed for log file: E0100000001.log

Operation completed successfully in 0.62 seconds.

 

The sector size that is chosen is determined through one of two methods:

 

  • If the log stream is brand new, read the sector size from disk and utilize this sector size.
  • If the log stream already exists, use the sector size of the given log stream.

 

In theory, since the sector size of disks should not be changing across nodes and the sector size of all disks must match, this should not cause a problem.  In our example, and in some customer environments, these sector sizes are actually changing.  Since most of these databases already exist, the existing sector size of the log stream is utilized, which in turn causes a mismatch between DAG members.

 

When a mismatch occurs, the issue only prevents the successful use of block mode replication.  It does not affect file mode replication.  Block mode replication was introduced in Exchange 2010 Service Pack 1.  For more information on block mode replication, see http://technet.microsoft.com/en-us/library/ff625233(v=exchg.141).aspx.

 

Why does this only affect block mode replication?  When a log file is addressed we reference locations within a log file based off a log file position.  The log file position is a combination of the log generation, the sector, and offset within that sector.  For example, in the previous header dump you can see the “Last LGPOS” is (0x1,A,0) – this just happens to be the last log file position within the log.  Let us say we were creating a block for block mode replication within a log file generation 0x1A, sector 8, offset 1 – this would be reflected as an LGPOS of (0x1a,8,1).  When this block is transmitted to a host with an advanced sector size disk, the log position would actually have to be translated.  On an advanced format disk this same log position would be (0x1a,1,1).  As you can see, it could create significant problems if incorrect positions within a log file were written to or read from.

 

How do I go about correcting this condition?  First, ensure that the same sector sizes exist on all disks across all nodes that host Exchange data, and then reset the log stream.  The following steps can show you how to do this with minimal downtime.

 

0)  Ensure that Exchange 2010 Service Pack 2 or later is installed on all DAG members (Exchange 2010 Service Pack 1 and earlier do not support 512e volumes).

 

1)  Disable block mode replication on all hosts.  This step requires restarting the replication service on each node.  This will temporarily cause all copies to fail on passive nodes when the service is restarted on the active node.  When the service is restarted on the passive node only passive copies on that node will enter a failed state.  Databases that are mounted and client connections are not impacted by this activity.  Block mode replication should remain disabled until all steps have been completed on all DAG members.

 

  • Launch registry editor.
  • Navigate to HKLM –> Software –> Microsoft –> ExchangeServer –> V14 –> Replay –> Parameters
  • Right click in the parameters key and select NEW –> DWORD
  • The name for the DWORD is DisableGranularReplication
  • The value for the DWORD is 1

 

image

 

2)  Restart the Microsoft Exchange Replication service on each member.

 

  • Using Powershell issue the following:  Stop-Service MSExchangeRepl
  • Using PowerShell issue the following:  Start-Service MSExchangeRepl

 

PS C:> Stop-Service MSExchangeREPL
PS C:> Start-Service MSExchangeREPL

 

3)  Validate that all copies of databases across DAG members are healthy at this time by running Get-MailboxDatabaseCopyStatus.

 

  • Using the Exchange Management Shell execute get-mailboxdatabasecopystatus.

 

[PS] C:>Get-MailboxDatabaseCopyStatus *

Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                              Length    Length                             State
—-                                          ——          ——— ———– ——————–   ————
SectorTestMBX-1                              Mounted         0         0                                  Healthy
SectorTestMBX-2                              Healthy         0         0           3/19/2013 12:28:34 PM  Healthy
SectorTestMBX-3                              Healthy         0         0           3/19/2013 12:28:34 PM  Healthy

 

4)  Apply the appropriate hotfix for Windows Server 2008 or Windows Server 2008 R2 and Advanced Format Disks.  Windows Server 2012 does not require a hotfix.

 

 

5)  Repeat the procedure that caused the disk sector size to change.  For example, if the issue arose as a result of upgrading drivers and firmware on a host utilize your maintenance mode procedures to complete the driver and firmware upgrade on all hosts.  Note:  If your installation does not allow for you to use the same sector sizes across all DAG members, then the implementation is not supported.

 

6)  Utilize FSUTIL to ensure that the sector sizes match across all hosts for the log and database volumes. 

 

MBX-1:

C:>fsutil fsinfo ntfsinfo z:
NTFS Volume Serial Number :       0x18d0bc1dd0bbfed6
Version :                         3.1
Number Sectors :                  0x000000000fdfe7ff
Total Clusters :                  0x0000000001fbfcff
Free Clusters  :                  0x0000000001fac6e6
Total Reserved :                  0x0000000000000000
Bytes Per Sector  :               512
Bytes Per Physical Sector :       4096

Bytes Per Cluster :               4096
Bytes Per FileRecord Segment    : 1024
Clusters Per FileRecord Segment : 0
Mft Valid Data Length :           0x0000000000040000
Mft Start Lcn  :                  0x00000000000c0000
Mft2 Start Lcn :                  0x0000000000000002
Mft Zone Start :                  0x00000000000c0040
Mft Zone End   :                  0x00000000000cc840
RM Identifier:        EF486117-9094-11E2-BF55-00155D006BA1

 

 

MBX-2

C:>fsutil fsinfo ntfsinfo z:
NTFS Volume Serial Number :       0xfa6a794c6a790723
Version :                         3.1
Number Sectors :                  0x000000000fdfe7ff
Total Clusters :                  0x0000000001fbfcff
Free Clusters  :                  0x0000000001fac86f
Total Reserved :                  0x0000000000000000
Bytes Per Sector  :               512
Bytes Per Physical Sector :       4096

Bytes Per Cluster :               4096
Bytes Per FileRecord Segment    : 1024
Clusters Per FileRecord Segment : 0
Mft Valid Data Length :           0x0000000000040000
Mft Start Lcn  :                  0x00000000000c0000
Mft2 Start Lcn :                  0x0000000000000002
Mft Zone Start :                  0x00000000000c0040
Mft Zone End   :                  0x00000000000cc840
RM Identifier:        5F18A2FC-909E-11E2-8599-00155D006BA2

 

MBX-3

C:>fsutil fsinfo ntfsinfo z:
NTFS Volume Serial Number :       0x0ad44aafd44a9d37
Version :                         3.1
Number Sectors :                  0x000000000fdfe7ff
Total Clusters :                  0x0000000001fbfcff
Free Clusters  :                  0x0000000001fabfd6
Total Reserved :                  0x0000000000000000
Bytes Per Sector  :               512
Bytes Per Physical Sector :       4096

Bytes Per Cluster :               4096
Bytes Per FileRecord Segment    : 1024
Clusters Per FileRecord Segment : 0
Mft Valid Data Length :           0x0000000000040000
Mft Start Lcn  :                  0x00000000000c0000
Mft2 Start Lcn :                  0x0000000000000002
Mft Zone Start :                  0x00000000000c0040
Mft Zone End   :                  0x00000000000cc840
RM Identifier:        B9B00E32-90B2-11E2-94E9-00155D006BA3

 

At this point, the DAG should be stable, and replication should be occurring as expected between databases using file mode. In order to restore block mode replication and fully recognize the new disk sector sizes, the log stream must be reset.  Please note the following about resetting the log stream:

 

  • The log stream must be fully reset on all database copies.
  • All lagged database copies must be replayed to current log.
  • If backups are utilized as a recovery method this will introduce a gap in the log file sequence preventing  a full roll forward recovery from the last backup point.

 

To reset the log file stream the following steps can be utilized:

 

1)  Validate the existence of a replay queue using Get-MailboxDatabaseCopyStatus:

 

[PS] C:>Get-MailboxDatabaseCopyStatus *

Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                              Length    Length                             State
—-                                          ——          ——— ———– ——————–   ————
SectorTestMBX-1                              Mounted         0         0                                  Healthy
SectorTestMBX-2                              Healthy         0         0           3/19/2013 1:34:37 PM   Healthy
SectorTestMBX-3                              Healthy         0         138         3/19/2013 1:34:37 PM   Healthy

 

2)  Set the replay and truncation lag times values to 0 on all database copies.  This will ensure that logs replay to current while allowing the databases to remain online.  In my example MBX-3 is a lagged copy database. When the configuration change is detected, log replay will occur allowing the lagged copy to eventually catch up.  Note that depending on the replay lag time, this could take several hours before proceeding to next steps.

 

[PS] C:>Set-MailboxDatabaseCopy SectorTestMBX-3 -ReplayLagTime 0.0:0:0 -TruncationLagTime 0.0:0:0

 

3)  Validate that the replay queue has caught up and is near zero.

 

[PS] C:>Get-MailboxDatabaseCopyStatus *

Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                              Length    Length                             State
—-                                          ——          ——— ———– ——————–   ————
SectorTestMBX-1                              Mounted         0         0                                  Healthy
SectorTestMBX-2                              Healthy         0         0           3/19/2013 1:34:37 PM   Healthy
SectorTestMBX-3                              Healthy         0         0           3/19/2013 1:34:37 PM   Healthy

4)  Dismount the database.  This activity will cause a client interruption, which will continue until the database can be mounted.

 

[PS] C:>Dismount-Database SectorTest

Confirm
Are you sure you want to perform this action?
Dismounting database "SectorTest". This may result in reduced availability for mailboxes in the database.
[Y] Yes  [A] Yes to All  [N] No  [L] No to All  [?] Help (default is "Y"): y

[PS] C:>Get-MailboxDatabaseCopyStatus SectorTest*

Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                              Length    Length                             State
—-                                          ——          ——— ———– ——————–   ————
SectorTestMBX-1                              Dismounted      0         0                                  Healthy
SectorTestMBX-2                              Healthy         0         0           3/25/2013 5:41:54 AM   Healthy
SectorTestMBX-3                              Healthy         0         0           3/25/2013 5:41:54 AM   Healthy

 

5)  On each DAG member hosting a database copy, open a command prompt and navigate to the log file directory.  Execute eseutil /r ENN to perform a soft recovery.  This steps is necessary to ensure that all log files are played into all copies.

 

Z:SectorTest>eseutil /r e01

Extensible Storage Engine Utilities for Microsoft(R) Exchange Server
Version 14.02
Copyright (C) Microsoft Corporation. All Rights Reserved.

Initiating RECOVERY mode…
    Logfile base name: e01
            Log files: <current directory>
         System files: <current directory>

Performing soft recovery…
                      Restore Status (% complete)

          0    10   20   30   40   50   60   70   80   90  100
          |—-|—-|—-|—-|—-|—-|—-|—-|—-|—-|
          ……………………………………………

Operation completed successfully in 0.203 seconds.

 

6)  On each DAG member hosting a database copy open a command prompt and navigate to the database directory.  Execute eseutil /mh <EDB> against the database to dump the header.  You must validate that the following information is correct on all database copies:

 

  • All copies of the database show in clean shutdown.
  • All copies of the database show the same last detach information.
  • All copies of the database show the same last consistent information.

 

Here is example output of a full /mh dump followed by a comparison of the data across our three sample copies.

 

Z:SectorTest>eseutil /mh SectorTest.edb

Extensible Storage Engine Utilities for Microsoft(R) Exchange Server
Version 14.02
Copyright (C) Microsoft Corporation. All Rights Reserved.

Initiating FILE DUMP mode…
         Database: SectorTest.edb

DATABASE HEADER:
Checksum Information:
Expected Checksum: 0x010f4400
  Actual Checksum: 0x010f4400

Fields:
        File Type: Database
         Checksum: 0x10f4400
   Format ulMagic: 0x89abcdef
   Engine ulMagic: 0x89abcdef
Format ulVersion: 0x620,17
Engine ulVersion: 0x620,17
Created ulVersion: 0x620,17
     DB Signature: Create time:03/19/2013 09:40:15 Rand:11009066 Computer:
         cbDbPage: 32768
           dbtime: 601018 (0x92bba)
            State: Clean Shutdown
     Log Required: 0-0 (0x0-0x0)
    Log Committed: 0-0 (0x0-0x0)
   Log Recovering: 0 (0x0)
  GenMax Creation: 00/00/1900 00:00:00
         Shadowed: Yes
       Last Objid: 3350
     Scrub Dbtime: 0 (0x0)
       Scrub Date: 00/00/1900 00:00:00
     Repair Count: 0
      Repair Date: 00/00/1900 00:00:00
Old Repair Count: 0
  Last Consistent: (0x138,3FB,1A4)  03/19/2013 13:44:11
      Last Attach: (0x111,9,86)  03/19/2013 13:42:29
      Last Detach: (0x138,3FB,1A4)  03/19/2013 13:44:11
             Dbid: 1
    Log Signature: Create time:03/19/2013 09:40:14 Rand:11019164 Computer:
       OS Version: (6.1.7601 SP 1 NLS ffffffff.ffffffff)

Previous Full Backup:
        Log Gen: 0-0 (0x0-0x0)
           Mark: (0x0,0,0)
           Mark: 00/00/1900 00:00:00

Previous Incremental Backup:
        Log Gen: 0-0 (0x0-0x0)
           Mark: (0x0,0,0)
           Mark: 00/00/1900 00:00:00

Previous Copy Backup:
        Log Gen: 0-0 (0x0-0x0)
           Mark: (0x0,0,0)
           Mark: 00/00/1900 00:00:00

Previous Differential Backup:
        Log Gen: 0-0 (0x0-0x0)
           Mark: (0x0,0,0)
           Mark: 00/00/1900 00:00:00

Current Full Backup:
        Log Gen: 0-0 (0x0-0x0)
           Mark: (0x0,0,0)
           Mark: 00/00/1900 00:00:00

Current Shadow copy backup:
        Log Gen: 0-0 (0x0-0x0)
           Mark: (0x0,0,0)
           Mark: 00/00/1900 00:00:00

     cpgUpgrade55Format: 0
    cpgUpgradeFreePages: 0
cpgUpgradeSpaceMapPages: 0

       ECC Fix Success Count: none
   Old ECC Fix Success Count: none
         ECC Fix Error Count: none
     Old ECC Fix Error Count: none
    Bad Checksum Error Count: none
Old bad Checksum Error Count: none

  Last checksum finish Date: 03/19/2013 13:11:36
Current checksum start Date: 00/00/1900 00:00:00
      Current checksum page: 0

Operation completed successfully in 0.47 seconds.

MBX-1:

State: Clean Shutdown

Last Consistent: (0x138,3FB,1A4)  03/19/2013 13:44:11

Last Detach: (0x138,3FB,1A4)  03/19/2013 13:44:11

 

MBX-2:

State: Clean Shutdown

Last Consistent: (0x138,3FB,1A4)  03/19/2013 13:44:12

Last Detach: (0x138,3FB,1A4)  03/19/2013 13:44:12

 

MBX-3:

State: Clean Shutdown

Last Consistent: (0x138,3FB,1A4)  03/19/2013 13:44:13

Last Detach: (0x138,3FB,1A4)  03/19/2013 13:44:13

 

In this case, the values match across all copies so further steps can be performed.  If the values do not match across copies for any reason, do not continue and please contact Microsoft support.

 

7)  Reset the log file generation for the database.  Note: Use Get-MailboxDatabaseCopyStatus to record database locations and status prior to performing this activity.

 

Locate the log file directory for each ACTIVE (DISMOUNTED) database. Remove all log files from this directory first. Failure to remove log files from the ACTIVE (DISMOUNTED) database may result in the Replication service recopying log files, a failure of this procedure, and subsequent need to reseed all database copies. Note:  If log files are located in the same location as the database and catalog data folder, take precautions to not remove the database or the catalog data folder.

In our example MBX-1 hosts the ACTIVE (DISMOUNTED) copy.

[PS] C:>Get-MailboxDatabaseCopyStatus SectorTest*

Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                              Length    Length                             State
—-                                          ——          ——— ———– ——————–   ————
SectorTestMBX-1                              Dismounted      0         0                                  Healthy
SectorTestMBX-2                              Healthy         0         0           3/25/2013 5:41:54 AM   Healthy
SectorTestMBX-3                              Healthy         0         0           3/25/2013 5:41:54 AM   Healthy

Locate the log file directory for each PASSIVE database. Remove all log files from this directory. Failure to remove all log files could result in this procedure failing, and the need to reseed this or all database copies. If log files are located in the same location as the database and catalog data folder take precautions to not remove the database or the catalog data folder.

 

In our example MBX-2 and MBX-3 host the passive database copies.

 

[PS] C:>Get-MailboxDatabaseCopyStatus SectorTest*

Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                              Length    Length                             State
—-                                          ——          ——— ———– ——————–   ————
SectorTestMBX-1                              Dismounted      0         0                                  Healthy
SectorTestMBX-2                              Healthy         0         0           3/25/2013 5:41:54 AM   Healthy
SectorTestMBX-3                              Healthy         0         0           3/25/2013 5:41:54 AM   Healthy

 

8)  Mount the database using Mount-Database <DBNAME>, and verify the mount.

 

[PS] C:>Mount-Database SectorTest

[PS] C:>Get-MailboxDatabaseCopyStatus *

Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                              Length    Length                             State
—-                                          ——          ——— ———– ——————–   ————
SectorTestMBX-1                              Mounted         0         0                                  Healthy
SectorTestMBX-2                              Healthy         0         1           3/25/2013 5:57:28 AM   Healthy
SectorTestMBX-3                              Healthy         0         1           3/25/2013 5:57:28 AM   Healthy

9)  Suspend and resume all passive database copies.

 

Get-mailboxdatabasecopy DBNAME* | suspend-mailboxdatabasecopy  (Note:  The error on suspending the active database copy is expected.)

 

[PS] C:>Get-MailboxDatabaseCopyStatus SectorTest* | Suspend-MailboxDatabaseCopy
The suspend operation can’t proceed because database ‘SectorTest’ on Exchange Mailbox server ‘MBX-1’ is the active mailbox database copy.
    + CategoryInfo          : InvalidOperation: (SectorTestMBX-1:DatabaseCopyIdParameter) [Suspend-MailboxDatabaseCopy], InvalidOperationException
    + FullyQualifiedErrorId : 5083D28B,Microsoft.Exchange.Management.SystemConfigurationTasks.SuspendDatabaseCopy
    + PSComputerName        : mbx-1.exchange.msft

 

Get-mailboxdatabasecopy DBNAME* | resume-mailboxdatabasecopy (Note:  The warning on resuming the active database copy is expected.)

 

[PS] C:>Get-MailboxDatabaseCopyStatus SectorTest* | Resume-MailboxDatabaseCopy
WARNING: The Resume operation won’t have an effect on database replication because database ‘SectorTest’ hosted on server ‘MBX-1’ is the active mailbox database.

 

10)  Validate replication health.

 

[PS] C:>Get-MailboxDatabaseCopyStatus *

Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                              Length    Length                             State
—-                                          ——          ——— ———– ——————–   ————
SectorTestMBX-1                              Mounted         0         0                                  Healthy
SectorTestMBX-2                              Healthy         0         0           3/19/2013 1:56:12 PM   Healthy
SectorTestMBX-3                              Healthy         0         0           3/19/2013 1:56:12 PM   Healthy

 

11)  Using Set-MailboxDatabaseCopy, reconfigure any replay lag or truncation lag time on the database copy.  This example implements a 7 day replay lag time.

 

set-mailboxdatabasecopy –identity SectorTestMBX-3 –replayLagTime 7.0:0:0

 

12)  Repeat the previous steps for all databases in the DAG including those databases that have a single copy.  DO NOT proceed to the next step until all databases have been reset.

 

13)  Enable block mode replication. Using registry editor navigate to HKLM –> Software –> Microsoft –> ExchangeServer –> V14 –> Replay, and then remove the DisableGranularReplication DWORD.

 

14)  Restart the replication service on each DAG member.

 

Stop-Server MSExchangeREPL

Start-Service MSExchangeREPL

15)  Validate database health using Get-MailboxDatabaseCopyStatus.

 

[PS] C:>Get-MailboxDatabaseCopyStatus *

Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                              Length    Length                             State
—-                                          ——          ——— ———– ——————–   ————
SectorTestMBX-1                              Healthy         0         0           3/19/2013 2:25:56 PM   Healthy
SectorTestMBX-2                              Mounted         0         0                                  Healthy
SectorTestMBX-3                              Healthy         0         230         3/19/2013 2:25:56 PM   Healthy

16)  Dump the header of a log file and verify that the new sector size is reflected in the log file stream. To do this, open a command prompt and navigate to the log file directory for the database on the active node. Run eseutil /ml against any log within the directory, and verify that the sector size reflects 4096 and (matches).

 

Z:SectorTest>eseutil /ml E0100000001.log

Extensible Storage Engine Utilities for Microsoft(R) Exchange Server
Version 14.02
Copyright (C) Microsoft Corporation. All Rights Reserved.

Initiating FILE DUMP mode…

      Base name: E01
      Log file: E0100000001.log
      lGeneration: 1 (0x1)
      Checkpoint: (0x17B,FFFF,FFFF)
      creation time: 03/19/2013 13:56:11
      prev gen time: 00/00/1900 00:00:00
      Format LGVersion: (7.3704.16.2)
      Engine LGVersion: (7.3704.16.2)
      Signature: Create time:03/19/2013 13:56:11 Rand:2996669 Computer:
      Env SystemPath: z:SectorTest
      Env LogFilePath: z:SectorTest
      Env Log Sec size: 4096 (matches)
      Env (CircLog,Session,Opentbl,VerPage,Cursors,LogBufs,LogFile,Buffers)
          (    off,   1227,  61350,  16384,  61350,   2048,    256,  44204)
      Using Reserved Log File: false
      Circular Logging Flag (current file): off
      Circular Logging Flag (past files): off
      Checkpoint at log creation time: (0x1,1,0)

      Last Lgpos: (0x1,2,0)

Number of database page references:  0

Integrity check passed for log file: E0100000001.log

Operation completed successfully in 0.250 seconds.

 

If the above steps have been completed successfully, and the log file sequence recognizes a 4096 sector size, then this issue has been resolved.

 

========================================

This guidance was validated in the following configurations:

 

  • Windows 2008 R2 Enterprise with Exchange 2010 Service Pack 2
  • Windows 2008 R2 Enterprise with Exchange 2010 Service Pack 3
  • Windows 2008 SP2 Enterprise with Exchange 2010 Service Pack 3
  • Windows 2010 Datacenter with Exchange 2010 Service Pack 3

========================================

Exchange 2010: StopDagServerMaintenance.ps1 resets server and database suspension states…

In Exchange 2010, when a database availability group (DAG) member needs service, it can be placed into maintenance mode. To do this, Exchange 2010 includes scripts for this process called StartDagServerMaintenance.ps1 and StopDagServerMaintenace.ps1.  For a summary of what these scripts do see this post.

 

Within a DAG, it is not uncommon to have one or more databases or servers blocked from automatic activation by the system. Some customers configure entire servers to be blocked from activation, some block individual copies, and some do a combination of both, based on their business requirements. Administrators using the maintenance mode scripts will find their configured activation blocks reset to the unblocked.  This behavior is not a problem with the scripts; in fact, the scripts are working as designed.

 

Here is an example of a database copy that has had activation suspended:

 

[PS] C:Windowssystem32>Get-MailboxDatabaseCopyStatus DAG-DB0MBX-2 | fl name,activationSuspended

Name                             : DAG-DB0MBX-2
ActivationSuspended              : True

Here is an example of a server that has activation blocked:

 

[PS] C:Windowssystem32>Get-MailboxServer MBX-2 | fl name,databasecopyautoactivationpolicy

Name                             : MBX-2
DatabaseCopyAutoActivationPolicy : Blocked

 

When the administrator executes the stopDagServerMaintenance.ps1 script these states are reset back to their defaults.  Here is an example of the states post StopDagServerMaintenance.ps1:

 

[PS] C:Program FilesMicrosoftExchange ServerV14Scripts>Get-MailboxDatabaseCopyStatus DAG-DB0MBX-2 | fl name,activationSuspended

Name                             : DAG-DB0MBX-2
ActivationSuspended              : False

[PS] C:Program FilesMicrosoftExchange ServerV14Scripts>Get-MailboxServer MBX-2 | fl name,databasecopyautoactivationpolicy

Name                             : MBX-2
DatabaseCopyAutoActivationPolicy : Unrestricted

 

While the maintenance scripts behavior are by design, it can lead to undesirable scenarios, such as lagged database copies being activated. Of course, to eliminate these issues, an administrator can record database and server settings before and after maintenance and reconfigure any reset settings as needed.  To help with this, here is a link to a sample script I created that records database and server activation settings into a csv file which can then be used with the maintenance scripts to adjust the states automatically.

 

http://blogs.technet.com/b/exchange/archive/2013/04/10/preserving-activation-blocks-after-performing-dag-member-maintenance.aspx

 

At runtime, and on the server on which this script is run, two CSV files are created along with a transcript that contains the results of the command executed. The first CSV file contains all database copies assigned to the server and their activation suspension status. For example:

 

#TYPE Selected.Microsoft.Exchange.Management.SystemConfigurationTasks.DatabaseCopyStatusEntry
"Name","ActivationSuspended"
"DAG-DB0DAG-3","True"
"DAG-DB1DAG-3","True"

The second CSV file contains the database copy auto activation policy of the server. For example:

 

#TYPE Selected.Microsoft.Exchange.Data.Directory.Management.MailboxServer
"Name","DatabaseCopyAutoActivationPolicy"
"DAG-3","Blocked"

To use the script the administrator must relax the execution policy on the server to allow for unsigned scripts.  This can be accomplished by using powershell:

 

Set-ExecutionPolicy UNSIGNED

 

Once the execution policy has been adjusted the maintenance procedure is started using the following command:

 

maintenanceWrapper.ps1 –server <SERVERNAME> –action START

 

It is this command that is responsible for creating the CSV files containing the original database states and invoking the StartDagServerMaintenance.ps1 script.

 

When the maintenance work is completed the maintenance procedure can be stopped by using the following command:

 

maintenanceWrapper.ps1 –server <SERVERNAME> –action STOP

It is this command this is responsible for calling the StopDagServerMaintenance.ps1 script and then resetting the database and server activation states from the states recorded in the CSV file.


Give the script a try and see if it makes maintenance mode for activation-blocked servers and databases easier for you. I hope you find this useful, and I welcome any and all feedback.

 

*Special thanks to Scott Schnoll and Abram Jackson for reviewing this content and David Spooner for validating the script.

Part 6: Datacenter Activation Coordination – Who has a say?

I recently worked with a customer that had a three-member database availability group (DAG) that was extended to two sites in a site resilience configuration. During scheduled maintenance in the primary datacenter, the customer encountered an interesting situation. In this case, the customer had two DAG members deployed in their primary datacenter and the third member deployed in a remote datacenter. In addition, Datacenter Activation Coordination (DAC) mode was enabled for their DAG.

 

There was a need to shut down the servers in the primary datacenter.  After completing maintenance tasks the servers in the primary datacenter were powered on.  It was then noted that all of the databases were dismounted on the servers in the primary datacenter.  This was verified with get-mailboxdatabase –status | fl name,mounted:

 

[PS] C:>Get-MailboxDatabase -Status | fl name,mounted

Name    : Mailbox Database 1252068500
Mounted : False

Name    : Mailbox Database 1370762657
Mounted : False

Name    : Mailbox Database 1511135053
Mounted : False

Name    : Mailbox Database 1757981393
Mounted : False

 

So, the administrator issued a mount command, but an error was returned:

 

[PS] C:>Mount-Database "Mailbox Database 1370762657"
Couldn’t mount the database that you specified. Specified database: Mailbox Database 1370762657; Error code: An Active Manager operation failed. Error An Active Manager operation encountered an error. To perform this operation, the server must be a member of a database availability group, and the database availability group must have quorum. Error: Automount consensus not reached.. [Server: MBX-1.exchange.msft].
    + CategoryInfo          : InvalidOperation: (Mailbox Database 1370762657:ADObjectId) [Mount-Database], InvalidOperationException
    + FullyQualifiedErrorId : FE7E9C2B,Microsoft.Exchange.Management.SystemConfigurationTasks.MountDatabase

 

The error indicates that the DAG members must have quorum and automount consensus in order to mount databases.  Because the DAG had DAC mode enabled, in order for automount consensus to be reached:

 

  • The node must be a member of a cluster.
  • The cluster must have quorum.
  • The node must be able to contact another member with a DACP bit set to 1 <or> it must be able to contact all other servers on the started servers list.

 

When the DAG members in the primary datacenter were shut down, the remaining DAG member went into a lost quorum state.  Therefore the DACP bit of the third member changed to 0 in response to a cluster service state change.  When the servers in the primary datacenter restart, they are unable to contact another DAG member with a DACP bit set to 1. Reviewing the properties of the DAG we saw that all three DAG members were on the started Mailbox servers list:

 

[PS] C:>Get-DatabaseAvailabilityGroup -Identity DAG | fl name,startedmailboxservers,stoppedmailboxservers

Name                  : DAG
StartedMailboxServers : {MBX-3.exchange.msft, MBX-2.exchange.msft, MBX-1.exchange.msft}
StoppedMailboxServers : {}

 

It was possible for the servers in the primary datacenter to contact the Microsoft Exchange Replication service on all servers on the started Mailbox servers list.  So why then is automount consensus not reached?

 

When reviewing the status of servers in the cluster, we noted that the server in the remote datacenter was marked as down:

 

Import-Module FailoverClusters

Get-ClusterNode | fl name,state

Name  : mbx-1
State : Up

Name  : mbx-2
State : Up

Name  : mbx-3
State : Down

Why does MBX-3 report a status of down?  Traditionally, when a lost quorum condition is encountered we expect the Cluster service on the servers where quorum was lost to terminate.  Looking at the properties of the Cluster service we see that the default action is to restart when the service terminates.

 

image

 

In reviewing the application log on MBX-3 for events that occurred at the time MBX-1 and MBX-2 were shut down, we saw the following events:

 

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          8/6/2012 10:03:04 AM
Event ID:      1177
Task Category: Quorum Manager
Level:         Critical
Keywords:     
User:          SYSTEM
Computer:      MBX-3.exchange.msft
Description:
The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.
Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

 

Log Name:      System
Source:        Service Control Manager
Date:          8/6/2012 10:03:04 AM
Event ID:      7036
Task Category: None
Level:         Information
Keywords:      Classic
User:          N/A
Computer:      MBX-3.exchange.msft
Description:
The Cluster Service service entered the stopped state.

 

We saw that the Cluster service acknowledged that there were no longer enough servers to maintain quorum and that the Cluster service entered the stop state gracefully – it did not terminate.  When MBX-1 and MBX-2 were gracefully shutdown in this example they communicate with all servers in the cluster announcing that they will be leaving.  In other words the servers in the primary datacenter did not just unexpectedly disappear, as in the case of a network failure or other catastrophic failure.

 

Since MBX-3 was informed that the other servers were leaving, it then determined that not enough votes would remain to satisfy quorum, and gracefully stopped its Cluster service rather than terminating it.  When MBX-1 and MBX-2 were brought back online they subsequently formed a cluster using their votes (only 2 of 3 votes necessary) and then began the process of determining automount consensus.  Since MBX-3 was a member of a cluster, but did not have its Cluster service started, it had no response to the DACP bit inquiry.  The condition that we must contact all servers on the started servers list of the DAG when no servers advertise a DACP bit of 1 was not met.

 

To resolve this, the administrator simply needs to restart the Cluster service on MBX-3.  This will in most cases result in databases mounting automatically as it allows the criteria for automount consensus to be satisfied and reached.  Here is a sample showing databases automatically mounted after starting the cluster service on MBX-3.

 

PS C:> Get-ClusterNode | fl name,state

Name  : mbx-1
State : Up

Name  : mbx-2
State : Up

Name  : mbx-3
State : Up

[PS] C:>Get-MailboxDatabase -Status | fl name,mounted

Name    : Mailbox Database 1252068500
Mounted : True

Name    : Mailbox Database 1757981393
Mounted : True

Name    : Mailbox Database 1370762657
Mounted : True

Name    : Mailbox Database 1511135053
Mounted : True

To illustrate the difference, here is an example where MBX-1 and MBX-2 were powered off instead of being gracefully shut down.  The events on MBX-3 show that the servers left unexpectedly and the Cluster service was terminated due to a lost quorum condition.

 

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          8/6/2012 12:15:00 PM
Event ID:      1135
Task Category: Node Mgr
Level:         Critical
Keywords:     
User:          SYSTEM
Computer:      MBX-3.exchange.msft
Description:
Cluster node ‘MBX-2’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          8/6/2012 12:15:00 PM
Event ID:      1135
Task Category: Node Mgr
Level:         Critical
Keywords:     
User:          SYSTEM
Computer:      MBX-3.exchange.msft
Description:
Cluster node ‘MBX-1’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          8/6/2012 12:15:00 PM
Event ID:      1177
Task Category: Quorum Manager
Level:         Critical
Keywords:     
User:          SYSTEM
Computer:      MBX-3.exchange.msft
Description:
The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.
Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Log Name:      System
Source:        Service Control Manager
Date:          8/6/2012 12:15:00 PM
Event ID:      7036
Task Category: None
Level:         Information
Keywords:      Classic
User:          N/A
Computer:      MBX-3.exchange.msft
Description:
The Cluster Service service entered the stopped state.

Log Name:      System
Source:        Service Control Manager
Date:          8/6/2012 12:15:00 PM
Event ID:      7024
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      MBX-3.exchange.msft
Description:
The Cluster Service service terminated with service-specific error A quorum of cluster nodes was not present to form a cluster..

Log Name:      System
Source:        Service Control Manager
Date:          8/6/2012 12:15:00 PM
Event ID:      7031
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      MBX-3.exchange.msft
Description:
The Cluster Service service terminated unexpectedly.  It has done this 1 time(s).  The following corrective action will be taken in 60000 milliseconds: Restart the service.

Log Name:      System
Source:        Service Control Manager
Date:          8/6/2012 12:16:01 PM
Event ID:      7036
Task Category: None
Level:         Information
Keywords:      Classic
User:          N/A
Computer:      MBX-3.exchange.msft
Description:
The Cluster Service service entered the running state.

Based on the events present, the Cluster service was terminated and the Service Control Manager issued a restart.  When MBX-1 and MBX-2 come back up, MBX-3 will successfully join the cluster.  Thus in this scenario, each DAG member can contact all servers on the started mailbox servers list, receive a response, and automount consensus can be reached, and databases will automatically mount.

 

========================================================

Datacenter Activation Coordination Series:

 

Part 1:  My databases do not mount automatically after I enabled Datacenter Activation Coordination (https://aka.ms/F6k65e)
Part 2:  Datacenter Activation Coordination and the File Share Witness (https://aka.ms/Wsesft)
Part 3:  Datacenter Activation Coordination and the Single Node Cluster (https://aka.ms/N3ktdy)
Part 4:  Datacenter Activation Coordination and the Prevention of Split Brain (https://aka.ms/C13ptq)
Part 5:  Datacenter Activation Coordination:  How do I Force Automount Concensus? (https://aka.ms/T5sgqa)
Part 6:  Datacenter Activation Coordination:  Who has a say?  (https://aka.ms/W51h6n)
Part 7:  Datacenter Activation Coordination:  When to run start-databaseavailabilitygroup to bring members back into the DAG after a datacenter switchover.  (https://aka.ms/Oieqqp)
Part 8:  Datacenter Activation Coordination:  Stop!  In the Name of DAG… (https://aka.ms/Uzogbq)
Part 9:  Datacenter Activation Coordination:  An error cause a change in the current set of domain controllers (https://aka.ms/Qlt035)

========================================================

Part 4: Datacenter Activation Coordination and the Prevention of Split Brain

In Part 1 of this blog post series, I discussed the rules for mounting databases when Datacenter Activation Coordination (DAC) mode is enabled on a Database Availability Group (DAG).  In this post, I want to look at the specific scenarios that apply to the prevention of split brain after a Datacenter Switchover has been performed.

 

In the example below, we have a 4-member DAG that has two members in Datacenter-A and two in Datacenter-B.  There is a single domain controller installed in each datacenter, as well.

 

image

 

Imagine that an outage has occurred in Datacenter-A and the decision to perform a datacenter switchover has been made.  Currently the two remaining Exchange servers are online in Datacenter-B and the cluster is in a lost quorum state.   Using the integrated commands the administrator starts the switchover process by running stop-databaseavailabilitygroup for the DAG members in Datacenter-A, as illustrated below.

 

Stop-DatabaseAvailabilityGroup –identity DAG –activeDirectorySite:Datacenter-A –configurationOnly:$TRUE

 

The results of this command can be verified with get-databaseavailabilitygroup –identity DAGNAME | fl name,StartedMailboxServers,StoppedMailboxServers.  As expected, servers in Datacenter-B remain on the started servers list while servers in Datacenter-A are on the stopped servers list.

 

The administrator then stops the Cluster service on the surviving DAG members in Datacenter-B in preparation for the restore-databaseavailabilitygroup command.

 

Stop-Service CLUSSVC <or> net stop CLUSSVC

The final step is to run restore-databaseavailabilitygroup.  This task forces the Cluster services online on the remaining DAG members, evicts the DAG members on the stopped servers list, and configures the appropriate quorum model.

 

Restore-DatabaseAvailabilityGroup –identity DAGNAME –activeDirectorySite:Datacenter-B

 

At this point, the administrator might perform additional procedures for database activation, restoration of mail flow, and restoration of client access.

 

After several hours, the DAG members in Datacenter-A come back online but without network connectivity between Datacenter-A and Datacenter-B.  The cluster services on the DAG members in Datacenter-A will start and a cluster will be formed.  This is normal and expected, as two of the four members exist in Datacenter-A along with the witness server, thereby providing the 3 votes necessary to establish quorum.

 

You can verify the membership of the cluster as seen from a member in Datacenter-A:

 

Import-Module FailoverClusters

Get-ClusterNode | fl name,state

Name  : mbx-1
State : Up

Name  : mbx-2
State : Up

Name  : mbx-3
State : Down

Name  : mbx-4
State : Down

You can also verify the membership of the cluster as seen from a member in Datacenter-B:

 

Import-Module FailoverClusters

Get-ClusterNode | fl name,state

Name : mbx-3
State : Up

Name : mbx-4
State : Up

These outputs are both accurate when considering the steps that were taken during the datacenter switchover process.  We know that the restore-databaseavailabilitygroup command successfully forced the remaining members online in Datacenter-B and evicted members from Datacenter-A.  We also know that the members in Datacenter-A have not been able to establish communications with the members in Datacenter-B and they are therefore unaware that any cluster membership changes have occurred.

 

The stop-databaseavailabilitygroup cmdlet updated the stopped servers list on a domain controller in Datacenter-B, but a domain controller in Datacenter-A was not accessible at the time that command was issued.  Using the get-databaseavailabilitygroup command, we can verify the started and stopped servers list on the domain controller in Datacenter-A.

 

[PS] C:>Get-DatabaseAvailabilityGroup -Identity DAG | fl name,startedmailboxservers,stoppedmailboxservers

Name                  : DAG
StartedMailboxServers : {MBX-2.exchange.msft, MBX-1.exchange.msft, MBX-3.exchange.msft, MBX-4.exchange.msft}
StoppedMailboxServers : {}

 

Comparing the two lists, note that according to a domain controller in Datacenter-A, all servers are started and no servers are stopped. According to a domain controller in Datacenter-B, two of the servers are started and two of the servers are stopped.

 

Even though the members in Datacenter-A established quorum, we can verify with get-mailboxdatabase –status | fl name,mounted that they did not mount their databases:

 

[PS] C:>Get-MailboxDatabase -Status | fl name,mounted

Name    : Mailbox Database 1252068500
Mounted : False

Name    : Mailbox Database 1757981393
Mounted : False

Name    : Mailbox Database 1370762657
Mounted :

WARNING: Exchange can’t connect to the Information Store service on server MBX-4.exchange.msft. Make sure that the
service is running and that there is network connectivity to the server.
Name    : Mailbox Database 1511135053
Mounted :

WARNING: Exchange can’t connect to the Information Store service on server MBX-3.exchange.msft. Make sure that the
service is running and that there is network connectivity to the server
.

 

So why didn’t the databases mount in Datacenter-A, even though the members had quorum?

 

DAC mode works by using a bit stored in memory by Active Manager called the Datacenter Activation Coordination Protocol (DACP). DACP is simply a bit in memory set to either a 1 or a 0. A value of 1 means Active Manager can issue mount requests, and a value of 0 means it cannot.

 

The starting bit is always 0, and because the bit is held in memory, any time the Microsoft Exchange Replication service (MSExchangeRepl.exe) is stopped and restarted, the bit reverts to 0.  In the example of a lost data center the bit is set to 0 when the servers power on and the replication service initializes.   In order to change its DACP bit to 1 and be able to mount databases, a starting DAG member needs to either:

 

  • Be able to communicate with any other DAG member that has a DACP bit set to 1; or
  • Be able to communicate with all DAG members that are listed on the StartedMailboxServers list.

 

If either condition is true, Active Manager on a starting DAG member will issue mount requests for the active databases copies it hosts. If neither condition is true, Active Manager will not issue any mount requests.

 

In order for the DACP bit to be set to 1 (mount database allowed) the starting DAG member must also be a member of the DAG’s cluster, and the cluster must have quorum.

 

In this example MBX-1 can contact MBX-2 but no other members of the DAG.  MBX-2 does not have its DACP bit set to 1 and MBX-1 cannot contact all servers on the started servers list because AD has not replicated the updated started servers list from Datacenter-B and therefore all nodes in the DAG appear on the started servers list.

 

By enforcing the logic of contacting another member with a DACP bit set to 1 or contacting all servers on the started servers list, a split brain condition is prevented even when a quorum of nodes exist and the cluster service functions.

 

========================================================

Datacenter Activation Coordination Series:

 

Part 1:  My databases do not mount automatically after I enabled Datacenter Activation Coordination (https://aka.ms/F6k65e)
Part 2:  Datacenter Activation Coordination and the File Share Witness (https://aka.ms/Wsesft)
Part 3:  Datacenter Activation Coordination and the Single Node Cluster (https://aka.ms/N3ktdy)
Part 4:  Datacenter Activation Coordination and the Prevention of Split Brain (https://aka.ms/C13ptq)
Part 5:  Datacenter Activation Coordination:  How do I Force Automount Concensus? (https://aka.ms/T5sgqa)
Part 6:  Datacenter Activation Coordination:  Who has a say?  (https://aka.ms/W51h6n)
Part 7:  Datacenter Activation Coordination:  When to run start-databaseavailabilitygroup to bring members back into the DAG after a datacenter switchover.  (https://aka.ms/Oieqqp)
Part 8:  Datacenter Activation Coordination:  Stop!  In the Name of DAG… (https://aka.ms/Uzogbq)
Part 9:  Datacenter Activation Coordination:  An error cause a change in the current set of domain controllers (https://aka.ms/Qlt035)

========================================================

Part 3: Datacenter Activation Coordination and the Single Node Cluster

In part 2 of this series, I provided a high level overview of how the witness server for a database availability group (DAG) is used with Datacenter Activation Coordination (DAC) mode in the scenario when only a single member remains after a datacenter switchover.  The witness server, and in particular its boot time, are very important when the system determines whether or not automount consensus has been reached.

 

When a datacenter switchover is performed and a single DAG member remains in the recovered datacenter, the restore-databaseavailabilitygroup cmdlet configures the cluster with a Node and File Share Majority quorum.  The witness server in use is the alternate witness server configured for the DAG or specified when running restore-databaseavailabilitygroup.  You can use get-databaseavailabilitygroup –status | fl name,*witness* to verify the witness in use, as illustrated below.

 

C:>Get-DatabaseAvailabilityGroup -Identity DAG -Status | fl name,*witness*
Name                      : DAG
WitnessServer             : dc-1.exchange.msft
WitnessDirectory          : C:DAGFileShareWitnessesDAG.exchange.msft
AlternateWitnessServer    : dc-2.exchange.msft
AlternateWitnessDirectory : C:DAGFileShareWitnessesDAG.exchange.msft
WitnessShareInUse      : Alternate

This configuration, although automatic, may seem strange at first.  In general, a Windows failover cluster only leverages the Node and File Share Majority quorum model when the cluster contains an even number of members.  Although this configuration is normal and expected in this case, it can raise some concerns when administrators use Failover Cluster Manager to verify the status of the remaining cluster.  Even more confusing is the warning on the cluster summary page that states that a failure of a node to access the defined file share witness may result in cluster instability.

 

image

 

The main concern for DAGs is the risk of losing the witness server (either by failure or reboot) but still have the Exchange environment stay running.  Fortunately, in this configuration, when a single DAG member exists and the cluster is configured to use the Node and File Share Majority quorum model, the status of the witness server is ignored.  Thus, even if the witness server is lost for any reason, the Exchange environment will continue to function.  The decision to use this quorum model is partially based on the importance of the witness server in determining automount consensus in a single surviving member configuration.  By configuring the quorum model as node and file share majority, we can leverage the cluster’s ability to monitor the witness server, and to accurately report status of this machine.

 

Thus, it is expected that after a datacenter switchover, DAGs with a single surviving member will leverage the Node and File Share Majority quorum model.

 

========================================================

Datacenter Activation Coordination Series:

 

Part 1:  My databases do not mount automatically after I enabled Datacenter Activation Coordination (https://aka.ms/F6k65e)
Part 2:  Datacenter Activation Coordination and the File Share Witness (https://aka.ms/Wsesft)
Part 3:  Datacenter Activation Coordination and the Single Node Cluster (https://aka.ms/N3ktdy)
Part 4:  Datacenter Activation Coordination and the Prevention of Split Brain (https://aka.ms/C13ptq)
Part 5:  Datacenter Activation Coordination:  How do I Force Automount Concensus? (https://aka.ms/T5sgqa)
Part 6:  Datacenter Activation Coordination:  Who has a say?  (https://aka.ms/W51h6n)
Part 7:  Datacenter Activation Coordination:  When to run start-databaseavailabilitygroup to bring members back into the DAG after a datacenter switchover.  (https://aka.ms/Oieqqp)
Part 8:  Datacenter Activation Coordination:  Stop!  In the Name of DAG… (https://aka.ms/Uzogbq)
Part 9:  Datacenter Activation Coordination:  An error cause a change in the current set of domain controllers (https://aka.ms/Qlt035)

========================================================

Part 5: Datacenter Activation Coordination: How do I force automount consensus?

In order for any database availability group (DAG) that has Datacenter Activation Coordination (DAC) mode to be able to mount any databases, the DAG itself much reach automount consensus. In the following blog posts, I discuss and outline situations where automount consensus may not be achieved, resulting in dismounted databases.  A DAG in DAC mode which cannot achieve automount consensus will not automatically mount databases, and administrators are blocked from manually mounting them, as well.

 

Let’s look at an example to illustrate this scenario.  In this example, I have a 4-member DAG comprised of servers DAG-1, DAG-2, DAG-3, and DAG-4

 

image

 

DAG-4 experiences a hardware failure, goes down, and is scheduled for replacement.  Although the server is offline, it is still a member of the DAG and it’s underlying cluster.

 

image

 

Next, a power failure occurs resulting in DAG-1, DAG-2, and DAG-3 going offline.

 

image

 

After power is restored, DAG-1, DAG-2, and DAG-3 are booted up.

 

image

 

The cluster service is successfully initialized and a cluster is formed.  You can verify the nodes are operational by running

 

Windows 2008:

PS C:> cluster node
Listing status for all available nodes:

Node           Node ID Status
————– ——- ———————
DAG-4                1 Down
DAG-3                2 Up
DAG-2                3 Up
DAG-1                4 Up

 

 

Windows 2008 R2:

PS C:> Import-Module FailoverClusters

PS C:> Get-ClusterNode | fl name,state

Name  : DAG-1
State : Up

Name  : DAG-2
State : Up

Name  : DAG-3
State : Up

Name  : DAG-4
State : Down

 

Using the Get-MailboxDatabase cmdlet, we can see that, even though the servers are up and the cluster is running, no databases are mounted:

 

[PS] C:>Get-MailboxDatabase -Status | fl name,mounted

Name    : Mailbox Database 1252068500
Mounted : False

Name    : Mailbox Database 1757981393
Mounted : False

Name    : Mailbox Database 1370762657
Mounted : False

Name    : Mailbox Database 1511135053
Mounted : False

 

Attempting to mount a database (with or without the –Force parameter) results in the following error:

 

[PS] C:>Mount-Database "Mailbox Database 1252068500"
Couldn’t mount the database that you specified. Specified database: Mailbox Database 1252068500; Error code: An Active Manager operation failed. Error An Active Manager operation encountered an error. To perform this operation, the server must be a member of a database availability group, and the database availability group must have quorum. Error: Automount consensus not reached.. [Server: DAG-1.exchange.msft].
    + CategoryInfo          : InvalidOperation: (Mailbox Database 1252068500:ADObjectId) [Mount-Database], InvalidOperationException
    + FullyQualifiedErrorId : FD5FDFE8,Microsoft.Exchange.Management.SystemConfigurationTasks.MountDatabase

So what needs to be done to achieve automount consensus and mount the databases?  It’s simple: the administrator can run Start-DatabaseAvailabilityGroup.  The cmdlet can be run using the –MailboxServer parameter to specify one of the mailbox servers that is functional.  After the cmdlet is run, the specified server will have its DACP bit set to 1.  The other servers in the DAG will see that the DACP bit on the specified server is set to 1 and this will cause them to set their DACP bits to 1.  And once all of the DAG members have DACP set to 1, automount consensus is reached, and the databases can be mounted. Here is an example:

 

[PS] C:>Start-DatabaseAvailabilityGroup -Identity DAG -MailboxServer MBX-1

[PS] C:>Get-MailboxDatabase -Status | fl name,mounted

Name    : Mailbox Database 1252068500
Mounted : True

Name    : Mailbox Database 1757981393
Mounted : True

Name    : Mailbox Database 1370762657
Mounted : True

Name    : Mailbox Database 1511135053
Mounted : True

 

Start-DatabaseAvailabilityGroup can also be used in scenarios where the remaining survivor of a DAG failure is a single member that is using the witness server boot time to achieve automount consensus.  In this case the cmdlet forces the member’s DACP bit to 1 and updates the boot times in the registry for future restarts.

 

As a side note of caution, administrators should be careful when using Start-DatabaseAvailabilityGroup.  It is possible that a split brain condition could be created by using this cmdlet to start the wrong DAG members, or it by using it at the wrong stage of a datacenter switchover. To ensure your datacenter switchovers go smoothly, I recommend using the recently released Datacenter Switchover Tool.

 

========================================================

Datacenter Activation Coordination Series:

 

Part 1:  My databases do not mount automatically after I enabled Datacenter Activation Coordination (https://aka.ms/F6k65e)
Part 2:  Datacenter Activation Coordination and the File Share Witness (https://aka.ms/Wsesft)
Part 3:  Datacenter Activation Coordination and the Single Node Cluster (https://aka.ms/N3ktdy)
Part 4:  Datacenter Activation Coordination and the Prevention of Split Brain (https://aka.ms/C13ptq)
Part 5:  Datacenter Activation Coordination:  How do I Force Automount Concensus? (https://aka.ms/T5sgqa)
Part 6:  Datacenter Activation Coordination:  Who has a say?  (https://aka.ms/W51h6n)
Part 7:  Datacenter Activation Coordination:  When to run start-databaseavailabilitygroup to bring members back into the DAG after a datacenter switchover.  (https://aka.ms/Oieqqp)
Part 8:  Datacenter Activation Coordination:  Stop!  In the Name of DAG… (https://aka.ms/Uzogbq)
Part 9:  Datacenter Activation Coordination:  An error cause a change in the current set of domain controllers (https://aka.ms/Qlt035)

========================================================

Exchange 2010: Adding database copies to databases with a single copy results in copy status failed after seeding is completed…

In recent weeks, I’ve worked with customers that have experienced database copies that go into a failed state immediately after adding a copy.  In all cases the following circumstances were noted:

 

  • The database previously only had a single copy.
  • The single copy database had a backup performed on it <or>
  • The single copy database utilized circular logging.

 

Let us explore the first scenario – single copy databases that have had a full backup performed. 

 

Using eseutil /k ENN I examined the log files that were present in the log directory.  Here is a sample output:

Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000001.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000002.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000003.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000004.log – OK

Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000020.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000021.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000022.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000023.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000024.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000025.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000026.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000027.log – OK

Using Window Server Backup, a full backup was performed of the single copy database.  This results in log truncation of the log stream when the backup completes successfully.  This is noted by the log truncation event present in the application log.

 

Log Name:      Application
Source:        ESE
Date:          7/1/2012 11:59:18 AM
Event ID:      224
Task Category: ShadowCopy
Level:         Information
Keywords:      Classic
User:          N/A
Computer:      MBX-1.exchange.msft
Description:
Information Store (3080) Mailbox Database 0958497998: Deleting log files C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000001.log to C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000028.log.
 

For more information, click http://www.microsoft.com/contentredirect.asp.

 

Using eseutil /k ENN the log sequence after backup was verified.  Here is the sample output:

 

Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000029.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E000000002A.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E000000002B.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E000000002C.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E000000002D.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E000000002E.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E000000002F.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000030.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000031.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000032.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000033.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000034.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000035.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000036.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000037.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000038.log – OK

 

As confirmed by the event and verification of the log sequence log truncation occurred successfully.

 

Using the Exchange Management Shell a copy of the database was added to the other DAG member.  The Exchange Management Shell confirms that the database seeding was successful.  The following is a sample copy status:

 

[PS] C:>Get-MailboxDatabaseCopyStatus "Mailbox Database 0958497998MBX-2" | fl

RunspaceId                       : 14523823-d7fe-4e42-9558-9395f418e6d0
Identity                         : Mailbox Database 0958497998MBX-2
Name                             : Mailbox Database 0958497998MBX-2
DatabaseName                     : Mailbox Database 0958497998
Status                           : Failed
MailboxServer                    : MBX-2
ActiveDatabaseCopy               : mbx-1
ActivationSuspended              : False
ActionInitiator                  : Service
ErrorMessage                     : The required log file 32 for Mailbox Database 0958497998MBX-2 is missing on the act
                                   ive copy. If you removed the log file, please replace it. If the log file is lost, t
                                   he database copy will need to be reseeded using Update-MailboxDatabaseCopy.

ErrorEventId                     : 2059
ExtendedErrorInfo                :
SuspendComment                   :
SinglePageRestore                : 0
ContentIndexState                : Healthy
ContentIndexErrorMessage         :
CopyQueueLength                  : 72
ReplayQueueLength                : 0
LatestAvailableLogTime           : 7/1/2012 12:09:32 PM
LastCopyNotificationedLogTime    : 7/1/2012 12:09:32 PM
LastCopiedLogTime                :
LastInspectedLogTime             :
LastReplayedLogTime              :
LastLogGenerated                 : 72
LastLogCopyNotified              : 72
LastLogCopied                    : 0
LastLogInspected                 : 0
LastLogReplayed                  : 71
LogsReplayedSinceInstanceStart   : 0
LogsCopiedSinceInstanceStart     : 0
LatestFullBackupTime             :
LatestIncrementalBackupTime      :
LatestDifferentialBackupTime     :
LatestCopyBackupTime             :
SnapshotBackup                   :
SnapshotLatestFullBackup         :
SnapshotLatestIncrementalBackup  :
SnapshotLatestDifferentialBackup :
SnapshotLatestCopyBackup         :
LogReplayQueueIncreasing         : False
LogCopyQueueIncreasing           : False
OutstandingDumpsterRequests      : {}
OutgoingConnections              :
IncomingLogCopyingNetwork        :
SeedingNetwork                   :
ActiveCopy                       : False

The application log on the server on which the copy was added also has the following event:

 

Log Name:      Application
Source:        MSExchangeRepl
Date:          7/1/2012 12:09:35 PM
Event ID:      2059
Task Category: Service
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      MBX-2.exchange.msft
Description:
The required log file 32 for Mailbox Database 0958497998MBX-2 is missing on the active copy. If you removed the log file, please replace it. If the log file is lost, the database copy will need to be reseeded using Update-MailboxDatabaseCopy.

 

Note:  The log file name displayed in this event is a decimal value.  The value is converted from decimal to hexadecimal in order to identify the log file.  Log file names are in hexadecimal format.

 

If the seed is attempted again using update-mailboxdatabasecopy the same error is displayed.  In this case log file 32 (0x20) is necessary to satisfy the seeding request and the lack of the log file will prevent the seed from being successful.  Why is this log file necessary and why is it missing from the active copy?  We’ll answer that question a little later…

 

Let us explore the second scenario – single copy databases that is using circular logging.

 

In this case there is a database with a single copy.  This database has had circular logging enabled. 

 

Using eseutil /k I examined the log files that were present in the log file directory.  Here is a sample output:

 

Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000049.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E000000004A.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E000000004B.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E000000004C.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E000000004D.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E000000004E.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E000000004F.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000050.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000051.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000052.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000053.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000054.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000055.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000056.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000057.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000058.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000059.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E000000005A.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E000000005B.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E000000005C.log – OK

 

At this point circular logging was disabled, and the database was dismounted and then mounted.  This is a requirement for databases with a single copy before adding an additional copy.

 

Using eseutil /k I examined the log files that were present in the directory prior to adding the copy.  Here is a sample output:

 

Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E000000005D.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E000000005E.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E000000005F.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000060.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000061.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000062.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000063.log – OK

Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E000000007C.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E000000007D.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E000000007E.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E000000007F.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000080.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000081.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000082.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000083.log – OK
Log file: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E0000000084.log – OK

Using the Exchange Management Shell a copy of the database was added to the other DAG member.  Exchange Management Shell confirms that the database seeding was successful. The following is a sample copy status:

 

[PS] C:>Get-MailboxDatabaseCopyStatus "Mailbox Database 0958497998MBX-2" | fl

RunspaceId                       : 14523823-d7fe-4e42-9558-9395f418e6d0
Identity                         : Mailbox Database 0958497998MBX-2
Name                             : Mailbox Database 0958497998MBX-2
DatabaseName                     : Mailbox Database 0958497998
Status                           : Failed
MailboxServer                    : MBX-2
ActiveDatabaseCopy               : mbx-1
ActivationSuspended              : False
ActionInitiator                  : Service
ErrorMessage                     : The required log file 73 for Mailbox Database 0958497998MBX-2 is missing on the act
                                   ive copy. If you removed the log file, please replace it. If the log file is lost, t
                                   he database copy will need to be reseeded using Update-MailboxDatabaseCopy.

ErrorEventId                     : 2059
ExtendedErrorInfo                :
SuspendComment                   :
SinglePageRestore                : 0
ContentIndexState                : Healthy
ContentIndexErrorMessage         :
CopyQueueLength                  : 72
ReplayQueueLength                : 0
LatestAvailableLogTime           : 7/1/2012 2:12:32 PM
LastCopyNotificationedLogTime    : 7/1/2012 2:12:32 PM
LastCopiedLogTime                :
LastInspectedLogTime             :
LastReplayedLogTime              :
LastLogGenerated                 : 132
LastLogCopyNotified              : 131
LastLogCopied                    : 0
LastLogInspected                 : 0
LastLogReplayed                  : 131
LogsReplayedSinceInstanceStart   : 0
LogsCopiedSinceInstanceStart     : 0
LatestFullBackupTime             :
LatestIncrementalBackupTime      :
LatestDifferentialBackupTime     :
LatestCopyBackupTime             :
SnapshotBackup                   :
SnapshotLatestFullBackup         :
SnapshotLatestIncrementalBackup  :
SnapshotLatestDifferentialBackup :
SnapshotLatestCopyBackup         :
LogReplayQueueIncreasing         : False
LogCopyQueueIncreasing           : False
OutstandingDumpsterRequests      : {}
OutgoingConnections              :
IncomingLogCopyingNetwork        :
SeedingNetwork                   :
ActiveCopy                       : False

The application log on the member where the copy was added also has the following event:

 

Log Name:      Application
Source:        MSExchangeRepl
Date:          7/1/2012 2:12:35 PM
Event ID:      2059
Task Category: Service
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      MBX-2.exchange.msft
Description:
The required log file 73 for Mailbox Database 0958497998MBX-2 is missing on the active copy. If you removed the log file, please replace it. If the log file is lost, the database copy will need to be reseeded using Update-MailboxDatabaseCopy.

 

If the seed is attempted again using update-mailboxdatabasecopy the same error is displayed. In this case log file 73 (0x49) is necessary to satisfy the seeding request and the lack of the log file will prevent the seed from being successful.

 

Why does this issue occur?

When a database copy is created and subsequently seeded between nodes the Replication services establish a global truncation point.  The global truncation point represents the log files that are safe to truncate across all copies of a given database that exist within the DAG.  This process takes into account a feature introduced in Exchange 2010 SP1 known as checkpoint at log creation.  For an example of how checkpoint at log creation works in Exchange 2010 refer to http://blogs.technet.com/b/timmcmic/archive/2012/03/12/exchange-2010-log-truncation-and-checkpoint-at-log-creation-in-a-database-availability-group.aspx.

 

The Replication service establishes the global truncation point based on the checkpoint at log creation of the lowest order log file on the active copy.  Running eseutil /ml on the lowest order log file on the active copy shows the following header information:

 

Extensible Storage Engine Utilities for Microsoft(R) Exchange Server
Version 14.02
Copyright (C) Microsoft Corporation. All Rights Reserved.

Initiating FILE DUMP mode…

      Base name: E00
      Log file: E00000001E6.log
      lGeneration: 486 (0x1E6)
      Checkpoint: (0x360,80,0)
      creation time: 08/05/2012 06:25:24
      prev gen time: 08/05/2012 06:25:24
      Format LGVersion: (7.3704.16.2)
      Engine LGVersion: (7.3704.16.2)
      Signature: Create time:07/02/2012 08:31:36 Rand:2868451 Computer:
      Env SystemPath: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998
      Env LogFilePath: C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998
      Env Log Sec size: 4096 (matches)
      Env (CircLog,Session,Opentbl,VerPage,Cursors,LogBufs,LogFile,Buffers)
          (    off,   1227,  61350,  16384,  61350,   2048,    256,  29487)
      Using Reserved Log File: false
      Circular Logging Flag (current file): off
      Circular Logging Flag (past files): on
      Checkpoint at log creation time: (0x186,1,0)
      1 C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998Mailbox Database 0958497998.edb
                 dbtime: 964150 (0-964150)
                 objidLast: 4617
                 Signature: Create time:07/02/2012 08:31:36 Rand:2859643 Computer:
                 MaxDbSize: 0 pages
                 Last Attach: (0x16D,1,270)
                 Last Consistent: (0x16C,1,1F)

      Last Lgpos: (0x1e6,FF,0)

Number of database page references:  922

Integrity check passed for log file: E00000001E6.log

Operation completed successfully in 0.78 seconds.

 

In this case the checkpoint at log creation when the log file 0x1E6 was created was in log file 0x186.  Therefore the global truncation point is established at 0x186 (390).  The event establishing the global truncation point is present in the active copy’s High Availability TruncationDebug log:

 

Log Name:      Microsoft-Exchange-HighAvailability/TruncationDebug
Source:        Microsoft-Exchange-HighAvailability
Date:          8/5/2012 7:07:04 AM
Event ID:      224
Task Category: TruncationSource
Level:         Information
Keywords:     
User:          SYSTEM
Computer:      MBX-1.exchange.msft
Description:
Global truncation point for database ’46f2fd7f-4a8e-4c56-81ac-51707746dfe2′ calculated ‘390’ with the value from server ‘390’.


When the seeding of the database file is successful the Replication service on the passive copy establishes a log shipping process.  At this time the global truncation point is read to determine where log file copying must start.

 

37683 00000000 1268 Cluster.Replay LogTruncater RequestGlobalTruncationCoordination for db 46f2fd7f-4a8e-4c56-81ac-51707746dfe2 notified our lowest is 0x234, learned the global truncation is 0x186

 

The Replication service establishes that the first log file that must be copied is the checkpoint at log creation of the lowest log file found on the active copy or in simpler terms the log file identified as the global truncation point.

 

37686 01E6E81B 1268 Cluster.Replay ReplicaInstance Log copying will start from generation 390

 

The Replication service then queries the source Replication service to determine the range of log files present on the active copy.

 

37758 009BAD9B 1268 Cluster.Replay LogCopyClient LogCopyClient:QueryLogRange: 0x1e6 .. 37a

 

At this time it is determined that the necessary log files do not exist on the active copy and the copy status is set to failed even though the database seeding was successful.

 

37760 00000000 1268 Cluster.Replay ReplayApi FormatMessage( 3221489675 ) was successful. Message: The required log file 390 for Mailbox Database 0958497998MBX-2 is missing on the active copy. If you removed the log file, please replace it. If the log file is lost, the database copy will need to be reseeded using Update-MailboxDatabaseCopy.
.

Where did this log file go?  If you review the backup events from the successful backup of the single copy database the following event is noted:

 

Log Name:      Application
Source:        ESE
Date:          8/5/2012 6:37:32 AM
Event ID:      224
Task Category: ShadowCopy
Level:         Information
Keywords:      Classic
User:          N/A
Computer:      MBX-1.exchange.msft
Description:
Information Store (3172) Mailbox Database 0958497998: Deleting log files C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E000000016E.log to C:Program FilesMicrosoftExchange ServerV14MailboxMailbox Database 0958497998E00000001E5.log.
 

 

In this case the log file necessary for a successful seed operation was removed as a result of a successful full backup.  Remember that when a database has a single copy log truncation as a result of a full backup, incremental backup, or circular logging is under control of the Information Store service.  Whenever a database has multiple copies the Replication service handles these tasks.  In this instance the Information Store service utilizes different rules for log truncation than the Replication service does.  The Information Store service does not account for the global truncation point when performing log truncation.

 

This behavior cannot be altered.

 

How does one work around this issue?

 

Unfortunately once a database is converted from single copy to multiple copies and a seed operation has been completed the target database cannot be utilized.  The process must be started over again with a few modifications.

 

First the database copy should be added to the remote node with the seedingPostponed flag set to true.  This will update Active Directory with the additional database copy and will attempt to inform the target Replication service that a copy was added. 

 

add-mailboxdatabasecopy “DBName” –mailboxServer “TargetServer” –seedingPostponed:$TRUE

 

Second circular logging should be enabled on the database.  This does not require the database to be dismounted and mounted since the database now has multiple copies and when using continuous replication circular logging this flag can be set dynamically. 

 

set-mailboxdatabase “DBNAME” –circularLogging:$TRUE

 

It is very important at this step that sufficient time exists for the circular logging flag to replicate through Active Directory and for the target Replication service to acknowledge that circular logging is enabled.  If subsequent steps are attempted prior to all nodes acknowledging the change the missing log file failure condition will be encountered.

 

Third the database must be seeded to the passive node.

 

update-mailboxdatabasecopy “DBNameServer” –deleteExistingFiles:$TRUE

 

Lastly the health of the copy can be verified with get-mailboxdatabsecopystatus.  Here is an example of a successful seeding operation of our example database that experienced this issue and utilized this process:

 

[PS] C:Windowssystem32>Get-MailboxDatabaseCopyStatus

Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                              Length    Length                             State
—-                                          ——          ——— ———– ——————–   ————
Mailbox Database 1608324257MBX-2             Mounted         0         0                                  Healthy
Mailbox Database 0958497998MBX-2             Healthy         0         0           8/5/2012 7:47:08 AM    Healthy

Why does this process work?

 

At the end of a seed operation where circular logging is enabled the replication service simply begins log copying at the first log file found.  It is not concerned with maintaining the same log states between nodes as it is when circular logging is not enabled.  Therefore enabling circular logging in this situation allows us to bypass this problem.

 

Once seeding is successful circular logging can be dynamically disabled using the set-mailboxdatabase command.

 

set-mailboxdatabase “DBNAME” –circularLogging:$FALSE

 

When circular logging is disabled a backup should be taken if desired.  When circular logging was utilized by this process log files were truncated off of both copies leaving previous backups taken unable to be utilized for roll forward recovery.  The next backup taken would become the recovery point for roll forward recovery should a restoration be necessary.  Until the backup is taken the database availability group should be sufficient to provide primary data protection.

Exchange 2010: The mystery of the 9223372036854775766 copy queue…

Database Copy Self-Protection Mode

 

What would you say if I told you that a copy queue length (CQL) of 9 quintillion log files was not a bad thing? By the way, if you are wondering, that works out to a CQL of 9,223,372,036,854,775,766, a number so big that it cannot be entered into Windows Calculator, and so big that if it represented actual transaction log files, it would require 8 yottabytes of storage, and sadly, as of this year, no storage or network system on the planet has reached even one thousandth of a yottabyte (which would be a zettabyte, by the way) of information.

 

But fear not, as you don’t need to start collecting all of the storage on the planet to plan for a CQL this large, but it is a real and valid value for a passive copy’s CQL, and in recent weeks, I’ve worked with a few customers who have experienced this.

 

Background

 

During regular operations, the Microsoft Exchange Information Store service (store.exe) and the Microsoft Exchange Replication service (msexchangerepl.exe) on DAG members hosting an active database copy write two values to the cluster registry at HKLMClusterExchangeActiveManagerLastLog:

 

  • DatabaseGUID with a decimal value representing the last log generated by the active copy
  • DatabaseGUID_TIME with the system time of when that log file was generated

 

Here is an example of these entries in the cluster registry:

 

image

 

To decipher these entries, you can use Get-MailboxDatabase to get the GUID for a mailbox database:

 

[PS] C:Windowssystem32>Get-MailboxDatabase dag-db0 | fl name,*guid*

Name : DAG-DB0
Guid : 2abcac37-1b5d-4b9c-8472-e33c65379698

These values are written to the cluster registry on the server hosting the active copy, and native cluster registry replication is used to propagate this information to all other DAG members. DAG members that host a passive copy of the database use this information (the last log file generated by the active copy) along with information about the last log file replicated to the passive copy, to determine the CQL for the database copy. Thus, it is critical that all DAG members have up-to-date values, as the CQL is used by Active Manager to evaluating whether or not to mount a database in response to a failover.

 

A Larger-than-Life CQL

 

In Exchange 2010 SP1, we changed how we determine the CQL. In addition to tracking the last log generated by the active copy, we also track the last time that log file was generated. This was done specifically to prevent situations in which a passive copy is not aware of log files generated by the active copy and makes automount decisions based on stale data.

 

We use the DATABASEGUID_Time entry for this purpose. If the difference between the timestamp recorded in the cluster registry and the system time on the server hosting the passive copy is off by more than 12 minutes, the Microsoft Exchange Replication service on the server hosting the passive copy places the database copy into a self-protection mode. This is done by setting the CQL for that passive copy to 9223372036854775766. Because a passive copy can be activated and automatically mounted only when its CQL is equal to or less than the configured value for AutoDatabaseMountDial, this has the effect of preventing the passive copy from ever mounting automatically. After all, a value of 9223372036854775766 will always be higher than any possible value for AutoDatabaseMountDial.

 

Where Did the Time Go?

 

So why would a condition exist that causes the time discrepancy to be greater than 12 minutes in the first place? This can actually happen for a few reasons:

 

  • The Cluster service on the server hosting the active copy might be having a problem writing updates even though the node remains in cluster membership.
  • The Cluster service on the server hosting the passive copy might be having a problem receiving updates even though they remain in cluster membership.
  • The Information Store service and Exchange Replication service could be stopped on the server hosting the active copy. (Remember that a copy that is “active” simply signifies the node that owns the copy not the actual mount / dismount state of the database).
  • A datacenter switchover is being performed and more than 12 minutes have elapsed between the time when the failed DAG members were stopped and when the remote DAG members were activated.

 

What to Do When CQL Reaches 9223372036854775766?

 

To recover from this situation, an administrator can simply perform a database switchover. Note, though, that the switchover will need to be performed using the Exchange Management Shell, as the administrator will need to force the move by using multiple switches and the MountDialOverride parameter. Because the following command skips all of the built-in safety checks for a passive copy, it should be used only when you know that the copies to be activated were healthy prior to the copy going into self-protection mode.

 

Attempted move without overrides:

 

Move-ActiveMailboxDatabase DAG-DB0 -ActivateOnServer DAG-2

Confirm
Are you sure you want to perform this action?
Moving mailbox database "DAG-DB0" from server "DAG-1.domain.com" to server "DAG-2.domain.com".
[Y] Yes  [A] Yes to All  [N] No  [L] No to All  [?] Help (default is "Y"): y

Identity        ActiveServerAtS ActiveServerAtE Status     NumberOfLogsLost   RecoveryPoint MountStatus MountStatus
                tart            nd                                            Objective     AtMoveStart AtMoveEnd
——–        ————— ————— ——     —————-   ————- ———– ———–
DAG-DB0         dag-1           dag-1           Failed                                      Dismounted  Dismounted
An Active Manager operation failed. Error The database action failed. Error: An error occurred while trying to validate the specified database copy for possible activation. Error: Database copy ‘DAG-DB0’ on server ‘DAG-2.domain.com’ has a copy queue length of 9223372036854725486 logs, which is too high to enable automatic recovery. You can use the Move-ActiveMailboxDatabase cmdlet with the -SkipLagChecks and -MountDialOverride parameters to move the database with loss. If the database isn’t mounted after successfully running Move-ActiveMailboxDatabase, use the Mount-Database cmdlet to mount the database.. [Database: DAG-DB0, Server: DAG-2.domain.com]
    + CategoryInfo          : InvalidOperation: (DAG-DB0:ADObjectId) [Move-ActiveMailboxDatabase], AmDbActionWrapperException
    + FullyQualifiedErrorId : 3F936D4B,Microsoft.Exchange.Management.SystemConfigurationTasks.MoveActiveMailboxDatabase

 

Successful move with overrides:

 

Move-ActiveMailboxDatabase DAG-DB0 -ActivateOnServer DAG-2 -SkipHealthChecks -SkipActiveCopyChecks -SkipClientExperienceChecks -SkipLagChecks -MountDialOverride:BESTEFFORT

Confirm
Are you sure you want to perform this action?
Moving mailbox database "DAG-DB0" from server "DAG-1.domain.com" to server "DAG-2.domain.com".
[Y] Yes  [A] Yes to All  [N] No  [L] No to All  [?] Help (default is "Y"): y

Identity        ActiveServerAtS ActiveServerAtE Status     NumberOfLogsLost   RecoveryPoint MountStatus MountStatus
                tart            nd                                            Objective     AtMoveStart AtMoveEnd
——–        ————— ————— ——     —————-   ————- ———– ———–
DAG-DB0         dag-1           dag-2           Succeeded  922337203685472… 5/29/2012 … Dismounted  Dismounted

 

At this point, the database can be mounted on the remote server after moving the active copy.

Of course, the ultimate question is: why 9223372036854725486? The value for CQL must be a 64-bit integer that cannot be null; therefore, we chose a value close to maxInt64 that is so large that it prevents a potentially out-of-date copy from being activated.