Monthly Archives: March 2012

Backups fail due to consistency check failure…

Last week I had the opportunity to work with a customer who was experiencing issues backing up their Exchange 2010 databases.  The issue they experienced though is relevant to both Exchange 2007 and Exchange 2003 installations (that leverage VSS based backups and consistency checking enabled).

 

After reviewing the logs it was apparent that the VSS process was functioning appropriately.  All relevant events regarding the snapshot process were present.  In this case the backup job was configured for consistency check, and relevant consistency check events were noted.  In almost all backup jobs the following error was present in the logs:

 

Log Name: Application
Source: Storage Group Consistency Check
Event ID: 403
Task Category: Termination
Level: Error
Keywords: Classic
Description:
Instance: The physical consistency check successfully validated 0 out of xxxxxxxx pages of database ‘DATABASE’. Because some database pages were either not validated or failed validation, the consistency check has been considered unsuccessful.

 

In general this event would indicate that consistency check encountered an error when scanning the pages of an Exchange database.  In most cases this would mean that there is page level corruption in the database such that the validation checks performed by consistency check would fail and the backup would be terminated.  This is by design.

 

In theory corruption of this type would not be present in the environment configured.  The customer was utilizing a Database Availability Group which has protections in it to self heal databases from this type of corruption.  Replication was healthy and there were no indication that any page corrections were performed.

 

If you look at the event in greater detail you will see that it provides the number of pages that were successfully scanned before the issue occurred.  When reviewing the application logs it was noted that on the same database the failure occurred after scanning a different number of pages.  For example, in one failure the failure occurred after scanning 28000 pages and another failure 42456 pages.

 

At this point when reviewing the system log the following error was noted:

 

Time:     1/9/2012 12:40:56 PM
ID:       36
Level:    Error
Source: volsnap
Machine:  server.company.com
Message:  The shadow copies of volume F: were aborted because the shadow copy storage could not grow due to a user imposed limit.

 

This error would imply that while attempting to store differential changes while the snapshot existed the allotted snapshot storage space was exhausted and could not be grown.  When reviewing vssadmin list shadowstorage it was noted that the shadow storage space assigned to the volume hosting the database was 321 megabytes.

 

vssadmin list shadowstorage

Shadow Copy Storage association
   For volume: (F:)\?Volume{0ecc7a68-be78-4c40-baf6-4d0d3b0b6693}
   Shadow Copy Storage volume: (H:)\?Volume{ed074b1d-b500-465b-a720-d2f733f49761}
   Used Shadow Copy Storage space: 0 B (0%)
   Allocated Shadow Copy Storage space: 0 B (0%)
   Maximum Shadow Copy Storage space: 321 MB (0%)

 

This is an extremely small shadow copy storage space.  By default the allotted space is generally 10% of volume size.  To correct this issue we can utilize the vssadmin command in order to reset the shadow storage space.

 

vssadmin Resize ShadowStorage /For=F: /On=F: /maxsize=20%

Successfully resized the shadow copy storage association

In our case the in-ability to continue to store differential changes in the shadow storage space caused the shadow copy to be removed.  This subsequently caused consistency check to fail resulting in a failure of the backup job.  Once the shadow copy storage was was allocated to an appropriate size, and differential changes could be successfully stored for the entire duration of the backup operation, the backups proceeded successfully.

Exchange 2010: Log Truncation and Checkpoint At Log Creation in a Database Availability Group

In previous versions of Exchange, when a backup was completed, almost all log files prior to the current log file were truncated from the system.  Administrators monitoring the directory would originally see many logs, and post backup note that only a few logs remained.  In Exchange 2010 Service Pack 1 and later administrators note that multiple log files remain on the disk post backup or the appearance that no log files have truncated at all.  In many cases this leads to a belief that logs are actually not truncating successfully or that there is an issue with backups.

 

Why do we see logs remaining on disk for longer?  Exchange 2010 SP1 and newer introduces a change in the behavior of log truncation.  The changes were taken to ensure that replicated copies of databases within a database availability group always had the appropriate log files on the source server to complete an incremental resynchronization. 

 

The change to log truncation is the tracking of Checkpoint At Log Creation.  Remember that in a database availability group we can expect the checkpoint to be approximately 100 logs (or slightly more) off the current log file – this is known as checkpoint depth.  As Exchange creates new log files we stamp into the header of the new log files what log file the checkpoint was pointing at when the current log was created.  For example, let us say that log file 0xA679 (42617) was just created as the current ENN.log.  We can expect that the checkpoint at log creation value stamped within the header of this log file would be approximately 0xA16 (42517).  You can see the checkpoint at log creation value by using eseutil /ml <logfilename> to dump the header of a log file.

 

[PS] P:DAGDAG-DB0DAG-DB0-Logs>eseutil /ml .E020000A67E.log

Extensible Storage Engine Utilities for Microsoft(R) Exchange Server
Version 14.02
Copyright (C) Microsoft Corporation. All Rights Reserved.

Initiating FILE DUMP mode…

      Base name: E02
      Log file: .E020000A67E.log
      lGeneration: 42622 (0xA67E)
      Checkpoint: (0xA679,8,0)
      creation time: 03/11/2012 06:00:48
      prev gen time: 03/11/2012 04:01:17
      Format LGVersion: (7.3704.16.2)
      Engine LGVersion: (7.3704.16.2)
      Signature: Create time:05/02/2010 18:04:08 Rand:399094376 Computer:
      Env SystemPath: d:DAGDAG-DB0DAG-DB0-Logs
      Env LogFilePath: d:DAGDAG-DB0DAG-DB0-Logs
      Env Log Sec size: 512 (matches)
      Env (CircLog,Session,Opentbl,VerPage,Cursors,LogBufs,LogFile,Buffers)
          (    off,   1027,  51350,  16384,  51350,   2048,   2048,  29487)
      Using Reserved Log File: false
      Circular Logging Flag (current file): off
      Circular Logging Flag (past files): off
      Checkpoint at log creation time: (0xA679,8,0)
      1 d:DAGDAG-DB0DAG-DB0-DatabaseDAG-DB0.edb
                 dbtime: 18078306 (0-18078306)
                 objidLast: 2957
                 Signature: Create time:05/02/2010 18:04:08 Rand:399127765 Computer:
                 MaxDbSize: 0 pages
                 Last Attach: (0xA348,9,86)
                 Last Consistent: (0xA346,9,B5)

      Last Lgpos: (0xa67e,252,0)

Number of database page references:  770

Integrity check passed for log file: .E020000A67E.log

Operation completed successfully in 0.265 seconds.

 

In the previous example the checkpoint at log creation is 0xA679.

 

Within a DAG all servers that contain a replicated copy of a database report the maximum log file that is eligible for truncation.  These values are reported to the active node which subsequently calculates the maximum log file for truncation.  In simplest terms the following process occurs:

 

  • Passive copy on Node-2 reports OK to truncate log 0xA679 (42617).
  • Passive copy on Node-3 reports OK to truncate log file 0xA678 (42616)
  • Passive copy on Node-4 reports ok to truncate log 0xA679 (42617).
  • The active node determines that the best log file eligible for truncation based on the passive copies is 0xA678 (42616).  [This is essentially the minimum of all reported OK logs to truncate.]
  • The active node then looks at the checkpoint at log creation of 0xA678 (42616) and determines that value is 0xA614 (42516).  In this example that would be 100 logs off the best log reported for truncation of the passive copies.
  • The active node sets the truncation point to be log 0xA614 (42516).
  • Therefore after a successful backup logs prior to 0x614 (42516). would truncate.

 

This essentially means that 100 additional logs that would have previously truncated prior to this change do not truncate.

 

Taking into account checkpoint at log creation administrators can better understand how log files are truncated and why log files remain on disk after a backup that might have in prior versions been truncated.

 

============================

Update 5/16/2012

Corrected hex conversions in example.

============================

Verifying the file share witness server / directory in use for Exchange 2010

If you’ve read my blog post on file share witness oddities you might be asking yourself “how can I actually verify what file share witness is in use in my environment?” 

 

There are three different ways to verify the witness in use for Exchange 2010.

 

1)  Exchange Management Shell

 

The Get-databaseavailabilitygroup –status command returns all the settings for the DAG from Active Directory.  By adding the –status switch we also query additional values from the Cluster service and Replication service.  These are not queried by default as they can delay the command from returning normal configuration values.  Here is an example of the output:

 

[PS] C:>Get-DatabaseAvailabilityGroup -Identity DAG -Status | fl

RunspaceId                             : 717eb01d-a17b-4e8e-b018-acf00a0d748d
Name                                   : DAG
Servers                                : {DAG-4, DAG-3, DAG-2, DAG-1}
WitnessServer                          : mbx-1.domain.com
WitnessDirectory                       : c:DAG-FSW
AlternateWitnessServer                 : mbx-2.domain.com
AlternateWitnessDirectory              : c:DAG-FSW

NetworkCompression                     : Enabled
NetworkEncryption                      : Enabled
DatacenterActivationMode               : DagOnly
StoppedMailboxServers                  : {}
StartedMailboxServers                  : {DAG-3.domain.com, DAG-4.domain.com, DAG-2.domain.com, DAG-1.domain.com}
DatabaseAvailabilityGroupIpv4Addresses : {10.0.0.24}
DatabaseAvailabilityGroupIpAddresses   : {10.0.0.24}
AllowCrossSiteRpcClientAccess          : False
OperationalServers                     : {DAG-1, DAG-2, DAG-4, DAG-3}
PrimaryActiveManager                   : DAG-1

ServersInMaintenance                   : {}
ThirdPartyReplication                  : Disabled
ReplicationPort                        : 64327
NetworkNames                           : {DAG-4-iSCSI, DAG-MAPI, DAG-REPL-A, DAG-REPL-B}
WitnessShareInUse                      : Primary
AdminDisplayName                       :
ExchangeVersion                        : 0.10 (14.0.100.0)
DistinguishedName                      : CN=DAG,CN=Database Availability Groups,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=domain Home,CN=Microsoft Exchange`,CN=Services,CN=Configuration,DC=home,DC=domain,DC=com
Identity                               : DAG
Guid                                   : 72c87136-6721-46e6-ac43-2ad5f6bd66d2
ObjectCategory                         : domain.com/Configuration/Schema/ms-Exch-MDB-Availability-Group
ObjectClass                            : {top, msExchMDBAvailabilityGroup}
WhenChanged                            : 1/29/2012 5:34:25 PM
WhenCreated                            : 9/19/2009 6:16:52 PM
WhenChangedUTC                         : 1/29/2012 10:34:25 PM
WhenCreatedUTC                         : 9/19/2009 10:16:52 PM
OrganizationId                         :
OriginatingServer                      : DC-5.domain.com
IsValid                                : True

In this example you can see the attribute WitnessShareInUse with a value of Primary.  This lets the administrator know that the current witness configured for cluster use is the primary file share witness (in this case the witness server and witness directory).

 

[PS] C:>Get-DatabaseAvailabilityGroup -Identity DAG -Status | fl name,servers,witnessserver,witnessdirectory,alternatew
itnessserver,alternatewitnessdirectory,operationalservers,primaryactivemanager,witnessshareinuse

Name                      : DAG
Servers                   : {DAG-4, DAG-3, DAG-2, DAG-1}
WitnessServer             : mbx-1.domain.com
WitnessDirectory          : c:DAG-FSW
AlternateWitnessServer    : mbx-2.domain.com
AlternateWitnessDirectory : c:DAG-FSW
OperationalServers        : {DAG-1, DAG-2, DAG-4, DAG-3}
PrimaryActiveManager      : DAG-1
WitnessShareInUse         : Alternate

In this example you can see the attribute WitnessShareInUse with a value of Alternate.  This is an example of where the AlternateWitnessServer and AlternateWitnessDirectory are configured for cluster use.

 

[PS] C:>Get-DatabaseAvailabilityGroup -Identity DAG -Status | fl name,servers,witnessserver,witnessdirectory,alternatew
itnessserver,alternatewitnessdirectory,operationalservers,primaryactivemanager,witnessshareinuse
WARNING: The witness server and directory currently in use by database availability group ‘DAG’ doesn’t match the
configured primary or alternate witness server. This may be due to Active Directory replication latency. If this
condition persists, please use the Set-DatabaseAvailabilityGroup cmdlet to correct the configuration.

 

Name                      : DAG
Servers                   : {DAG-4, DAG-3, DAG-2, DAG-1}
WitnessServer             : mbx-1.domain.com
WitnessDirectory          : c:DAG-FSW
AlternateWitnessServer    : mbx-2.domain.com
AlternateWitnessDirectory : c:DAG-FSW
OperationalServers        : {DAG-1, DAG-2, DAG-4, DAG-3}
PrimaryActiveManager      : DAG-1
WitnessShareInUse         : InvalidConfiguration

In this example you can see the attribute WitnessShareInUse with a value of InvalidConfiguration. There is also a warning displayed indicating that the witness server in use does not match either the primary or alternate witness.   This is an indication that the file share witness was modified outside of Exchange and the settings currently in use are not correct.  Administrators can correct this by running the set-databaseavailabilitygroup command.

 

 

2)  Utilize cluster commands

 

Windows 2008 / Windows 2008 R2

 

Using the command prompt execute the cluster <DAGNAME> res command.  This will output all the resources within the cluster.

 

[PS] C:>cluster dag.domain.com res
Listing status for all available resources:

Resource             Group                Node            Status
——————– ——————– ————— ——
Cluster IP Address   Cluster Group        DAG-1           Online
Cluster Name         Cluster Group        DAG-1           Online
File Share Witness   Cluster Group        DAG-1           Online

Highlighted in red you can see the display name of the File Share Witness resource.  With this information you can run the command cluster <DAGNAME> res “file share witness display name” /priv

 

[PS] C:>cluster dag.domain.com res "File Share Witness" /priv

Listing private properties for ‘File Share Witness’:

T  Resource             Name                           Value
— ——————– —————————— ———————–
S  File Share Witness   SharePath                      \mbx-1.domain.comDAG.domain.com
D  File Share Witness   ArbitrationDelay               6 (0x6)

This command lists the private properties within cluster associated with the file share witness resource.  In our case we are interested in the SharePath.  According to this output the current file share witness server is MBX-1.

 

Windows 2008 R2

 

Using powershell import the FailoverClusters modules.

 

[PS] C:>Import-Module FailoverClusters

 

Issue the command Get-ClusterQuorum –cluster <DAGNAME> | fl

 

[PS] C:>Get-ClusterQuorum -Cluster DAG.home.e-mcmichael.com | fl

Cluster        : DAG
QuorumResource : File Share Witness
QuorumType     : NodeAndFileShareMajority

Highlighted in red is the display name of the file share witness resource.  Using the command Get-ClusterResource “Display Name” –cluster <DAGNAME> | Get-ClusterParameter

 

[PS] C:>Get-ClusterResource "File Share Witness" -Cluster DAG.home.e-mcmichael.com | Get-ClusterParameter

Object                        Name                          Value                         Type
——                        —-                          —–                         —-
File Share Witness            SharePath                     \mbx-1.domain.c… String
File Share Witness            ArbitrationDelay              6                             UInt32

This command lists the private properties within cluster associated with the file share witness resource. In our case we are interested in the SharePath. According to this output the current file share witness server is MBX-1.

 

 

3)  Failover Cluster Manager

 

Using Failover Cluster Manager connect to the cluster service.  You can connect to either a node or specify the DAG name as the connection point.   

Click on the cluster name in the upper left hand corner of the utility.

In the center window information is displayed regarding the cluster configuration.

One piece of information is the “Quorum Configuration”.  This will list the type of quorum in use and if a file share witness is configured the server and share name utilized as the witness.

 

image

 

In this example you can see that the cluster is configured for a quorum type of Node and File Share Majority with the file share witness server MBX-1.

Exchange and VSS — My Exchange writer is in a failed retryable state…

In Exchange 2007 and Exchange 2010 many customers are leveraging VSS based backups to retain and protect their Exchange data.  By default Exchange provides two different VSS writers that share the same VSS writer ID but are loaded by two different services.  The first is the Exchange Information Store VSS writer and the second is the Exchange Replication Service VSS writer.  The Information Store writer allows for the backup of active / mounted databases and the replication service writer allows for the backup of passive databases (should a replicated database model be utilized).  You can see the writers by running the command VSSADMIN LIST WRITERS from a command prompt.

 

Here is a sample put of a VSSAdmin List Writers from a Windows 2008 R2 SP1 server with Exchange 2010 SP1.  Note how both writers share the same writer ID within the VSS framework.

 

Writer name: ‘Microsoft Exchange Replica Writer’
   Writer Id: {76fe1ac4-15f7-4bcd-987e-8e1acb462fb7}
   Writer Instance Id: {17e8df11-a8a2-4ee3-a3fb-e552b7da2d83}
   State: [1] Stable
   Last error: No error

 

Writer name: ‘Microsoft Exchange Writer’
   Writer Id: {76fe1ac4-15f7-4bcd-987e-8e1acb462fb7}
   Writer Instance Id: {e0ad4b68-8938-4be5-9b88-4c74df2b2d65}
   State: [1] Stable
   Last error: No error

In the course of protecting Exchange servers there maybe conditions that cause a backup job to fail.  When an Exchange backup job fails the VSS framework aborts the backup and subsequently Exchange clears the backup in progress settings.  When a failure is encountered either a single Exchange writer or both Exchange writers maybe left in a FAILED RETRYABLE state.  We can utilize VSSAdmin List Writers again to query the writer status and see these results.  Here is an example showing the Exchange Replication Service writer with a status 8 FAILED last error RETRYABLE.

 

Writer name: ‘Microsoft Exchange Replica Writer’
   Writer Id: {76fe1ac4-15f7-4bcd-987e-8e1acb462fb7}
   Writer Instance Id: {17e8df11-a8a2-4ee3-a3fb-e552b7da2d83}
   State: [8] Failed
   Last error: Retryable error

 

Writer name: ‘Microsoft Exchange Writer’
   Writer Id: {76fe1ac4-15f7-4bcd-987e-8e1acb462fb7}
   Writer Instance Id: {e0ad4b68-8938-4be5-9b88-4c74df2b2d65}
   State: [1] Stable
   Last error: No error

 

Now the typical question that comes up at this point is how do I actually deal with an Exchange writer that consistently disallows backups.  The answer – restart the service that the writer was associated with and/or fix whatever configuration issue is causing the failures.  For example, given the above output I would restart the Exchange Replication Service in an attempt to return the writer to a Stable No Error state.  (If it would have been the Microsoft Exchange Writer I would have restarted the Exchange Information Store Service).

The real question though is do I need to deal with a writer that is in a failed state?  Unfortunately many administrators find themselves having to deal with a writer in a failed state because their experience is that while the writer is in a failed state subsequent backup jobs fail.  If reviewing the issues carefully what you’ll find is that the backup jobs are not failing because of a VSS failure but rather they are failing because a writer was found in a failed state.  From an Exchange / VSS perspective this is unexpected –> after all although the writer is failed the error is RETRYABLE –> essentially saying “hey…something failed but come on back and try me again…”

 

Let’s take a look at why this might be happening….

 

Within the VSS framework there are two states that we are interested in –> the Session State and the Current State.  When a VSS session is in progress, and an administrator runs VSSAdmin List Writers, the state that is displayed is the current session state.  When the VSS snapshot creation has completed, the current state becomes a session specific state and the status of the most recently completed session is copied to the current state.  At this point when the administrator runs VSSAdmin List Writers the state of the most recently completed session is displayed.  This is an important distinction  –>  the SESSION STATE AT THIS POINT REFLECTS THE STATUS OF THE LAST SESSION!  The status of the last session does not imply anything in regards to the success <or> failure of future sessions.

Now that we know where VSSAdmin List Writers gets its information we’ll take a look at how the backup process should progress.  (I’m going to attempt to present an overly simplified timeline of an expected backup)

The process starts with the VSS requester establishing a VSS session. 

 

image

 

After the session is established the VSS requester requests metadata from the VSS framework.

 

image

 

At this point the VSS request and VSS framework further progress the snap shot process by determining components and preparing the snapshot set.

 

image

 

Once the components and snapshot sets have been prepared the VSS requester issues a PrepareForBackup.  This in turns causes the VSS framework to prepare the components for backup.

 

image

 

After prepare backup is called the individual application level writers are now responsible for current writer status.  The VSS requester is now allowed to call GatherWriterStatus.  This call in turn should return the current writer status.  For example, current writer status at this stage could be FREEZE / THAW / etc.  This is regardless of if the previous status was FAILED or HEALTHY.  This is the status that the VSS requester should be utilizing to make logic decisions at this point.

 

image

 

Once the snapshot is created the contents can then be transferred to the backup media.  Once the transfer is complete, the VSS requester can inform the VSS framework that a backup has completed successfully and subsequently the VSS session ended.

 

image

 

In summary if the VSS requester is performing operations in an order that is expected, the writer status should be queried after the framework has received a prepare for backup event.  This will ensure the writer status reflects that of the CURRENT SESSION IN PROGRESS and not the SESSION STATE OF THE PREVIOUS BACKUP.

 

The administrator can verify the functionality of the Exchange writer by utilizing the VSHADOW or DISKSHADOW utilities.  These utilities utilize the workflow outlined in the successful handling of a failed retryable writer case.  If either of these utilities are successful in creating the backup, and the writer in turn is returned to a healthy state you might consider following up with the backup vendor to ensure VSS calls are being made appropriately.  Microsoft can also assist you in verifying the calls are made appropriately through assisting with both Exchange and OS VSS tracing.