Author Archives: TIMMCMIC

Inconsistent results when enabling standby continuous replication (SCR) in Exchange 2007 SP1.

A common question that I receive from customers is why do I experience inconsistent results when I enable a storage group in Exchange 2007 SP1 for standby continuous replication.  Usually the conversation focuses on why replication instances initially show failed and then soon after go healthy, or why the replication service reports that databases are not configured for standby continuous replication even though the command was run and successful.

 

Standby continuous replication was introduced in Exchange 2007 SP1 as a way to replicate databases from any mailbox role source to an independent mailbox role target.  Most commonly customers implement this technology as part of a broader site resiliency plan.  Information regarding standby continuous replication can be found here:  http://technet.microsoft.com/en-us/library/bb676502.aspx.

 

The command that is used to enable standby continuous replication is enable-storagegroupcopy -identity <storagegroup> -standbymachine <target>.   More information on this commandlet can be found here:  http://technet.microsoft.com/en-us/library/bb123684.aspx.

 

When the enable-storagegroupcopy command is used an attribute on the storage group is updated in the active directory.  The attribute is msExchStandbyCopyMachines.  This is a muti-valued attribute to reflect that a database can be replicated to multiple SCR targets.  When the command is successfully run, the target name used is populated in the attribute, along with values representing TruncationLagTime and ReplayLagTime.  Here is a sample LDP dump of a storage group enabled for SCR.

 

========================================

Expanding base ‘CN=2008-MBX1-SG1,CN=InformationStore,CN=2008-MBX1,CN=Servers,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=Exchange,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=exchange,DC=msft’…
Getting 1 entries:
Dn: CN=2008-MBX1-SG1,CN=InformationStore,CN=2008-MBX1,CN=Servers,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=Exchange,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=exchange,DC=msft
cn: 2008-MBX1-SG1;
distinguishedName: CN=2008-MBX1-SG1,CN=InformationStore,CN=2008-MBX1,CN=Servers,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=Exchange,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=exchange,DC=msft;
dSCorePropagationData: 0x0 = (  );
instanceType: 0x4 = ( WRITE );
msExchESEParamBaseName: E00;
msExchESEParamCheckpointDepthMax: 20971520;
msExchESEParamCircularLog: 0;
msExchESEParamCommitDefault: 0;
msExchESEParamCopyLogFilePath: f:2008-MBX12008-MBX1-SG1-Logs-LCR;
msExchESEParamCopySystemPath: e:2008-MBX12008-MBX1-SG1-System-LCR;
msExchESEParamDbExtensionSize: 256;
msExchESEParamEnableIndexChecking: TRUE;
msExchESEParamEnableOnlineDefrag: TRUE;
msExchESEParamEventSource: MSExchangeIS;
msExchESEParamLogFilePath: d:2008-MBX12008-MBX1-SG1-Logs;
msExchESEParamLogFileSize: 1024;
msExchESEParamPageFragment: 8;
msExchESEParamPageTempDBMin: 0;
msExchESEParamSystemPath: d:2008-MBX12008-MBX1-SG1-System;
msExchESEParamZeroDatabaseDuringBackup: 0;
msExchHasLocalCopy: 1;
msExchMinAdminVersion: -2147453113;
msExchStandbyCopyMachines: 2008-MBX2.exchange.msft;1;1.00:00:00;00:00:00;
msExchVersion: 4535486012416;
name: 2008-MBX1-SG1;
objectCategory: CN=ms-Exch-Storage-Group,CN=Schema,CN=Configuration,DC=exchange,DC=msft;
objectClass (3): top; container; msExchStorageGroup;
objectGUID: 7dd4c453-9052-43c6-9e18-845f8e616520;
showInAdvancedViewOnly: TRUE;
systemFlags: 0x40000000 = ( CONFIG_ALLOW_RENAME );
uSNChanged: 57771;
uSNCreated: 33269;
whenChanged: 9/17/2008 4:55:12 PM Eastern Standard Time;
whenCreated: 9/15/2008 6:11:32 PM Eastern Standard Time;

———–

========================================

The enable-storagegroupcopy command does not interact directly with the replication service in order to start a new replication instance.  Internal to each replication service is a configuration update process.  When the configuration update process runs, the replication service determines by reading the active directory which database instances need to be replicated.  A list is generated and compared to the instances the replication service is already running.  When a new replication instance is found, the replication service will spawn the instance.  When an instance already exists and is no longer replicated, the replication service will destroy that instance. 

For standby continuous replication the configuration update process runs on the source every 30 seconds – on the target every 3 minutes.

On the source machine, when the configuration update process runs and determines that a database on the source has been enabled for SCR the replication service will create the file shares necessary for the target to access the source and replicate logs.

On the target machine, when the configuration update process runs and determines that a database is enabled for SCR and replicated to that machine, the instance is added to the replication service and the replication service begins the process of copying logs etc.

This is where customers start to experience inconsistent results.  Standby continuous replication is dependant on reading an active directory attribute, it is also dependant on the time it takes that attribute to replicate to a domain controller in the source and a domain controller in the target.  Until that attribute replicates to both locations, and the configuration update process runs in both locations, standby continuous replication will not be fully enabled for this storage group.

Let me provide an example.

There are three different examples that I have found that show inconsistent results.

Example #1:

In this example we have a standalone mailbox server in SiteA.  The SCR target for this standalone mailbox server is in SiteB.  SiteA and SiteB are both different active directory sites with a 15 minute replication delay between them.  On the SCR target machine the administrator runs enable-storagegroupcopy -standbymachine – the command completes successfully.  After a few minutes the get-storagegroupcopystatus -standbymachine command is run and it is noted that all replicated storage groups appear in a FAILED state. 

 

Name                      SummaryCopySt CopyQueueLeng ReplayQueueL LastInspecte
                          atus          th            ength        dLogTime   
—-                      ————- ————- ———— ————
2008-MBX1-SG1             Failed        0             0                       
2008-MBX1-SG2             Failed        0             0
                       

 

A review of the source shows that the shares necessary to replicate log files were not created.  At this time the admin steps away for 30 minutes and comes back to check replication again with get-storagegroupcopystatus -standbymachine.  It is noted that all storage groups appear healthy, and that the shares necessary to copy logs exist on the source.

 

Name                      SummaryCopySt CopyQueueLeng ReplayQueueL LastInspecte
                          atus          th            ength        dLogTime   
—-                      ————- ————- ———— ————
2008-MBX1-SG1             Healthy        5             25                       
2008-MBX1-SG2             Healthy        1             50
                       

 

The behavior here is by design.  When the administrator enabled SCR on the target machine it stamped the msExchStandbyCopyMachines attribute on the domain controller in SiteB.  Within a 3 minute window the replication service on the target machine runs the configuration update process.  The new replication instance is detected and the replication service starts to attempt to copy logs.  The attribute though has not replicated to a domain controller in SiteA, therefore the replication service in SiteA does not know to create the shares necessary to service replication.  This results in the replication instances being marked FAILED.  After waiting 30 minutes, active directory replication has had time to occur and the configuration update process on the source has run, the new replication instance detected, and the shares created.  At this point the replication service can now access the logs on the source and the replication instances are marked HEALTHY.  (Note that the same example applies to a single copy cluster [scc] source.)

 

Example #2:

 

In this example we have a standalone mailbox server in SiteA.  The SCR target for this standalone mailbox server is in SiteB.  SiteA and SiteB are both different active directory sites with a 15 minute replication delay between them.  On the SCR source machine the administrator runs enable-storagegroupcopy -standbymachine – the command completes successfully.  After a few minutes the get-storagegroupcopystatus -standbymachine command is run and it is noted that all replicated storage groups appear in a NOTCONFIGURED state.

 

Name                      SummaryCopySt CopyQueueLeng ReplayQueueL LastInspecte
                          atus          th            ength        dLogTime   
—-                      ————- ————- ———— ————
2008-MBX1-SG1             NotConfigured 0             0                       
2008-MBX1-SG2             NotConfigured 0             0
                       

 

A review of the source shows that the shares necessary to replicate log files are created.  At this time the admin steps away for 30 minutes and comes back to check replication again with get-storagegroupcopystatus -standbymachine.  It is noted that all storage groups appear healthy.

 

Name                      SummaryCopySt CopyQueueLeng ReplayQueueL LastInspecte
                          atus          th            ength        dLogTime   
—-                      ————- ————- ———— ————
2008-MBX1-SG1             Healthy        5             25                       
2008-MBX1-SG2             Healthy        1             50
                       

 

The behavior here is by design.  When the administrator enabled SCR on the source machine it stamped the msExchStandbyCopyMachines attributes on the domain controller in SiteA.  Within a 30 second window the replication service on the source machine runs the configuration update process.  The new replication instance is detected, and the replication service creates the shares on the source.  The attribute though has not replicated to a domain controller in SiteB, therefore the replication service in SiteB is not aware of the replication instances and responds NotConfigured when queried for status.  After waiting 30 minutes, active directory replication has had time to occur and the configuration update process on the target has run, the new replication instances detected, and the replication process started.  At this point the replication service is aware of the instances, and responds with a healthy status when queried.  (Note that the same example applies to a single copy cluster [scc] source.)

 

Example #3:

 

In this example we have a cluster continuous replication source in SiteA.  The SCR target for this CCR source is located in SiteB.  SiteA and SiteB are both different active directory sites with a 15 minute replication delay between them.  On the SCR target machine, the administrator runs enable-storagegroupcopy -standbymachine – the command completes successfully.  After a few minutes the get-storagegroupcopystatus -standbymachine command is run and it is noted that all replicated storage groups appear HEALTHY.

 

Name                      SummaryCopySt CopyQueueLeng ReplayQueueL LastInspecte
                          atus          th            ength        dLogTime   
—-                      ————- ————- ———— ————
2008-MBX1-SG1             Healthy        5             25                       
2008-MBX1-SG2             Healthy        1             50
                       

 

This is different from the example outlined in Example #1.  In this example the source is a CCR cluster.  In order for a CCR cluster to replicate log files between the two source nodes, the shares must exist.  Since the shares already exist, we only have to wait for the replication service configuration update process to run on the target machine.  AD replication here is not a factor when a target domain controller is used for the enable-storagegroupcopy command.

 

This information should help administrators explain some of the results of the SCR process and make decisions on where enabling should be performed.

Exchange 2007 SP1 CCR- Windows 2008 Clusters – File Share Witness (FSW) failures.

Exchange 2007 SP1 Cluster Continuous Replication (CCR) clusters built on Windows 2008 most commonly uses the quorum model "Node Majority with File Share Witness".

 

In the past few weeks administrators have noticed that the file share resource is sometimes in a failed state.  Under normal circumstances this is not an issue since both nodes of the cluster are always available (meaning that there are two votes) and the cluster services stay running.  Where this has become and issue is where the file share witness resource is failed and a node of the cluster is also unavailable.

 

Some background….

 

In Windows 2003 the file share witness is a private property of the "majority node set" quorum resource.  If the file share witness was unavailable, an event would be logged but the resource would continue to remain online.  (Reference http://support.microsoft.com/?kbid=921181 for information on the file share witness on Windows 2003).  The file share witness resource is only used when necessary to maintain quorum for the cluster.  If at that time the witness was still unavailable, the cluster would loose quorum and the cluster service would be terminated.

 

In Windows 2008, when using the node majority with file share witness quorum model, the file share witness resource is enumerated as an actual resource.  It can be seen in failover cluster manager -> cluster core resources.

 

image

*Cluster core resources.

 

Now that the file share witness exists as a resource cluster can do additional health checking.  One of the health checks is to ensure that the file share witness folder is online and accessible.  In the event that the FSW folder is not accessible, cluster will fail the FSW resource and attempt to online the FSW on the other node.  If the FSW folder continues to remain unavailable, cluster will fail the resource.  By default, cluster will attempt to restart any resource that is failed every 60 minutes (1 hour).  If the resource continues to fail to come online, it will remain on the node it failed on until administrator intervention is taken or the resource can be brought online during one of the 60 minute intervals.  

 

image

*Failed file share witness resource in cluster core resources.

 

Just like Windows 2003 the file share witness in Windows 2008 is only accessed when it is necessary for the cluster to maintain quorum.  If the file share witness resource is in a failed state, and the use of the FSW becomes necessary to maintain quorum, Windows 2008 will not attempt to access the file share witness resulting in a lost quorum state and termination of cluster services. 

 

Impact to Exchange…

 

Where this seems to impact Exchange administrators the most is during patch management.  During patch management, patches are applied to the hub transport server owning the file share which supports the FSW resource in the cluster.  When the server is rebooted, the file share witness is no longer available.  Cluster performs status checking, and determines that the FSW is not available.  At this point the Core Cluster Group, containing the FSW, is failed over to the second node.  If the hub transport server is available, the file share witness will come online.  Most commonly the hub transport server is not available resulting in the file share witness failing, and remaining failed, on the second node.  If left alone for 60 minutes, the resource will be automatically restarted and by this time the hub transport server will be available. 

 

In my experience the issue arises during that 60 minutes.  It is possible that a reboot or loss of a cluster node could cause the cluster to loose quorum.  If the file share witness resource is failed, and I reboot the node not owning the cluster core resources, this leaves one vote in the cluster.  Since cluster requires a majority of votes to be available, this results in the termination of the cluster service on the remaining node.

 

It is important that administrators understand this difference between Windows 2003 and Windows 2008, and account for it in how they manage their cluster.

 

What can be done to alleviate this condition?

 

As you have already read cluster automatically attempts to restart failed resources every 60 minutes.  This time limit is the default setting for all resources.  The rational here is that if a resource is failing, the administrator should have the opportunity to troubleshoot, identify, and correct the issue causing the resource to fail.  From a monitoring standpoint, here is the entire process — monitoring software bubbles up the alert, helpdesk notifies the admin, the admin accesses the machine, the admin reviews the logs, and the admin takes appropriate action(s).  On the other hand, this process may not work as the admin may be unavailable etc, in which case the cluster will still try to self heal if no administrator intervention is taken.  So essentially the first method of alleviating this condition is to understand the defaults, how the solution operates, what to look for before rebooting nodes etc (all cluster core resources healthy), and make no configuration changes.

 

On the properties of each resource is the retry interval to restart failed resources.  The minimum value that is allowed here is 15 minutes.  This would cause the cluster, in the event that the resource is failed, to be more aggressive in terms of attempting to online the resource.  From an Exchange 2-Node perspective, this would limit the failure window to 15 minutes verses 60 minutes (assuming the witness is available after 15 minutes).  To change this value:

 

1)  Open the Failover Cluster Manager and connect to the cluster.

2)  Select the cluster name at the top of the left hand pane.

 

image

 

3)  In the center pane of the MMC, expand Cluster Core Resources.

 

image

 

4)  Get the properties of the File Share Witness (\PathShare) resource.

5)  Select the Policies tab on the resource.

 

image

 

6)  On the policies tab, you will see an option "If all the restart attempts fail, begin restarting again after the specified period (hh:mm) with a default of 01:00 (1 hour / 60 minutes).  Here you could adjust this value to 15 minutes using the input box.  If a change is made, select apply -> ok to exit properties.

 

image

 

Another method would be someway to issue an online command to the group, for example through a script.  Post reboot it would be possible to issue a command, from the server with the failover cluster manager installed, similar to this:  cluster.exe "cluster name" group "Cluster Group" /online [note, cluster group is the name of the group holding the cluster core resources, only cluster name has to be configured to the FQDN of the cluster management name.]  For example, cluster.exe 2008-Cluster3.exchange.msft group "Cluster Group" /online.  If this command is run manually the following output would be returned in the command window:

 

Bringing resource group ‘Cluster Group’ online…

Group                  Node            Status
——————– —————  ——
Cluster Group      2008-Node5   Online

 

DO NOT use any other distributed method, such as distributed file systems, to host the file share witness.

 

What will be done to correct this condition?

 

We have worked with the Windows Product Team to bring this behavior to their attention.  We are working on a possible design change in the behavior of the file share witness on Windows 2008.  As this progresses I will continue to update this blog with more information.

 

Relevant Event IDs…

 

  • System Log – Event ID 1562 – Indicates that the file share witness is unavailable.

 

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          1/7/2009 8:07:24 AM
Event ID:      1562
Task Category: File Share Witness Resource
Level:         Warning
Keywords:     
User:          SYSTEM
Computer:      2008-Node1.exchange.msft
Description:
File share witness resource ‘File Share Witness (\2008-dc1MNS_FSW_2008-Cluster1)’ failed a periodic health check on file share ‘\2008-dc1MNS_FSW_2008-Cluster1’. Please ensure that file share ‘\2008-dc1MNS_FSW_2008-Cluster1’ exists and is accessible by the cluster.

 

  • System Log – Event ID 1069 – Indicates the file share witness resource is in failed state.

 

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          1/7/2009 8:07:24 AM
Event ID:      1069
Task Category: Resource Control Manager
Level:         Error
Keywords:     
User:          SYSTEM
Computer:      2008-Node1.exchange.msft
Description:
Cluster resource ‘File Share Witness (\2008-dc1MNS_FSW_2008-Cluster1)’ in clustered service or application ‘Cluster Group’ failed.

 

  • System Log – Event ID 1564 – Indicates that the cluster cannot access the file share witness directory.

 

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          1/7/2009 8:07:25 AM
Event ID:      1564
Task Category: File Share Witness Resource
Level:         Critical
Keywords:     
User:          SYSTEM
Computer:      2008-Node1.exchange.msft
Description:
File share witness resource ‘File Share Witness (\2008-dc1MNS_FSW_2008-Cluster1)’ failed to arbitrate for the file share ‘\2008-dc1MNS_FSW_2008-Cluster1’. Please ensure that file share ‘\2008-dc1MNS_FSW_2008-Cluster1’ exists and is accessible by the cluster.

 

  • System Log – Event ID 1205 – Indicates that the cluster core resources group ("Cluster Group") is not completely online or offline due to a failure of the file share witness resource.

 

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          1/7/2009 8:07:25 AM
Event ID:      1205
Task Category: Resource Control Manager
Level:         Error
Keywords:     
User:          SYSTEM
Computer:      2008-Node1.exchange.msft
Description:
The Cluster service failed to bring clustered service or application ‘Cluster Group’ completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered service or application.

 

*Thanks to Chuck Timon, Sr Support Escalation Engineer, Platforms CSS for assisting in reviewing and modifying this information.

 

=======================================

Updated Wednesday – 08/19/09

Jeff Guillet – a Microsoft Windows MVP – has posted a sample batch file for starting the FSW and moving the cluster core resources group to a desired node.  The instructions also include how to schedule the batch file as a startup script.

http://www.expta.com/2009/06/failure-of-fsw-causes-cluster-group-to.html

=======================================

 

=======================================

Updated Wednesday 12/14/2011

I failed to update this blog post previously with a windows hotfix that corrects this behavior and makes work arounds not necessary.

http://blogs.technet.com/b/timmcmic/archive/2010/02/15/kb978790-update-to-windows-2008-to-change-the-failure-behavior-of-the-file-share-witness-quorum-resource.aspx

=======================================

The Windows Cluster service encountered and error during function OpenCluster.

When users attempt to use Exchange Management Shell (PowerShell) to perform management tasks against an Exchange 2007 SP1 Cluster on Windows 2008, certain functions return "The Windows Cluster service encountered an error during function OpenCluster".

 

In the following example I am using a new account I created in active directory.  This account is a member of Domain Administrators and a member of Exchange Organization Administrators.  The user account is logging onto a machine where the Exchange 2007 SP1 RU5 management tools are installed as well as the Windows 2008 Failover Cluster Management tool.

 

Using the account, I attempt to create a mailbox on the clustered exchange instance.  Below is a copy of the verbose output of this command:

 

[PS] C:WindowsSystem32>New-Mailbox -Name Test -Database 2008-MBX32008-MBX3-SG1-DB1 -UserPrincipalName test@exchange.msft -Verbose

cmdlet New-Mailbox at command pipeline position 1
Supply values for the following parameters:
Password:

VERBOSE: New-Mailbox : Beginning processing.
VERBOSE: New-Mailbox : Searching objects "exchange.msft/Users" of type "ExchangeOrganizationalUnit" under the root "$null".
VERBOSE: New-Mailbox : Previous operation run on domain controller ‘2008-DC1.exchange.msft’.
VERBOSE: New-Mailbox : Searching objects "2008-MBX32008-MBX3-SG1-DB1" of type "MailboxDatabase" under the root "$null".
VERBOSE: New-Mailbox : Previous operation run on domain controller ‘2008-DC1.exchange.msft’.
VERBOSE: New-Mailbox : Administrator Active Directory session settings are: View Entire Forest: ‘False’, Default Scope: ‘exchange.msft’, Configuration Domain Controller: ‘2008-DC1.exchange.msft’,
VERBOSE: New-Mailbox : Searching objects of type "ADRecipient" with filter "(|((SamAccountName Equal test)))", scope "SubTree" under the root "exchange.msft".
VERBOSE: New-Mailbox : Previous operation run on domain controller ‘2008-DC1.exchange.msft’.
VERBOSE: New-Mailbox : Searching objects of type "ADRecipient" with filter "(|((Alias Equal test)))", scope "SubTree" under the root "$null".
VERBOSE: New-Mailbox : Previous operation run on global catalog server ‘2008-DC2.exchange.msft’.
VERBOSE: New-Mailbox : Applying RUS policy to the given recipient "exchange.msft/Users/Test" with the home domain controller "$null".
New-Mailbox : The Windows Cluster service encountered an error during function OpenCluster:.
At line:1 char:12
+ New-Mailbox  <<<< -Name Test -Database 2008-MBX32008-MBX3-SG1-DB1 -UserPrincipalName test@exchange.msft -Verbose
VERBOSE: New-Mailbox : Ending processing.

 

The information here is not overly helpful in actually determining the reason why the open cluster failed.  To further determine the reason we can dump the message associated with the exception.

 

To get the exception information, type the following in the Exchange Management Shell:  $error[0].exception.stacktrace.  The output of this command is:

 

[PS] C:WindowsSystem32>$error[0].exception.stacktrace
   at Microsoft.Exchange.Common.ExCluster.GetActiveCmsOnNode(String nodeName)
   at Microsoft.Exchange.Data.Directory.NativeHelpers.GetLocalComputerFqdn(Boolean throwOnException)
   at Microsoft.Exchange.Data.Directory.SystemConfiguration.ADSystemConfigurationSession.ReadLocalServer()
   at Microsoft.Exchange.Data.Directory.Recipient.RecipientUpdateService.FindE12RusServer()
   at Microsoft.Exchange.Data.Directory.Recipient.RecipientUpdateService.LocateServer()
   at Microsoft.Exchange.Configuration.Tasks.RecipientTaskHelper.ApplyRusPolicy(RecipientUpdateService rus, ADRecipientSession recipientSession, ADRecipient recipient, PolicyType[] policyTypes, TaskVerboseLoggingDelegate logHandler, TaskWarningLoggingDelegate writeWarning)
   at Microsoft.Exchange.Configuration.Tasks.RecipientTaskHelper.ApplyRusPolicy(ADSystemConfigurationSession configurationSession, ADRecipientSession recipientSession, ADRecipient recipient, Fqdn domainController, String serverName, TaskVerboseLoggingDelegate logHandler, TaskWarningLoggingDelegate writeWarning)
   at Microsoft.Exchange.Configuration.Tasks.RecipientTaskHelper.ApplyRusPolicy(ADSystemConfigurationSession configurationSession, ADRecipientSession recipientSession, ADRecipient recipient, Fqdn domainController, String serverName, TaskVerboseLoggingDelegate logHandler, TaskErrorLoggingDelegate writeError, TaskErrorLoggingDelegate throwTerminatingError, TaskWarningLoggingDelegate writeWarning)
   at Microsoft.Exchange.Configuration.Tasks.NewRecipientObjectTask`1.PrepareDataObject()
   at Microsoft.Exchange.Configuration.Tasks.SetTaskBase`1.InternalValidate()
   at Microsoft.Exchange.Management.RecipientTasks.NewUserBase.InternalValidate()
   at Microsoft.Exchange.Configuration.Tasks.Task.ProcessRecord()

 

The stack information, although helpful, is still not showing us exactly why we were unable to complete our open cluster call.  To further determine the reason we can dump the inner exception. 

 

To get the inner exception information, type the following in the Exchange Management Shell:  $error[0].exception.innerexception.  The output of this command is:

 

[PS] C:WindowsSystem32>$error[0].exception.innerexception
Access is denied

 

Between the stack trace above, and the inner exception, we can reasonably determine that we are receiving access denied when we attempt to open a connection to the cluster.  (Note:  It did require slightly more code review then just looking at the above to determine this…but for now we’ll agree that that’s what this information says).

 

On the surface it does not appear that there should be any reason we receive an access denied.  The account in question is a Domain Administrator and an Exchange Organization Administrator.  One important note though is that the account is NOT the built in administrator.  Since the user is a domain administrator, and domain administrators are members of local administrators, and local administrators have full control of a Windows 2008 cluster for management purposes it appears we should not be receiving an access denied.  To view the permissions assigned to a cluster:

 

  • Open failover cluster managment.  If necessary, select manage a cluster from the right task pane and connect to a cluster.
  • Right click on the cluster name in the left task pane, select properties.

 

image

 

  • Select the cluster permissions tab.

 

image

 

The reason that the user account is receiving access denied is due to User Account Control (UAC).  Earlier in the post I pointed out that the account we were using was not the built in administrator account.  When we use the built in administrator account UAC is disabled by default.  Without UAC enabled administrator groups are included in the users security token.  In this case we are using a Domain Administrator that is not the built in administrator.  When we do this, UAC is enabled by default for this user so administrator groups are not automatically added to the users security token (unless running elevated).  Therefore, when this user makes the open cluster call, the cluster returns access denied.

 

In order for this account to be able to complete cluster functions, they will have to run Exchange Management Shell elevated.  There are two ways to handle this:

 

  • Right click on the Exchange Management Shell shortcut and select Run As Administrator:

 

image

 

  • Get the properties of the Exchange Management Shell shortcut.  On the short cut tab, select the advanced button.

 

 

image

 

  • Select the "Run as administrator" checkbox and press the OK button.  (Note that you will be prompted for administrator rights and to save the changes.)

 

image

 

Next time that the Exchange Management Shell is launched, you will receive the dialog to from User Account Control requesting permissions to continue.  Select continue to proceed.

 

image

 

Attempt to run the same command as before.  This time the command should end successfully.

 

[PS] C:WindowsSystem32>New-Mailbox -Name Test -Database 2008-MBX32008-MBX3-SG1-DB1 -UserPrincipalName test@exchange.msft -Verbose

cmdlet New-Mailbox at command pipeline position 1
Supply values for the following parameters:
Password:

VERBOSE: New-Mailbox : Beginning processing.
VERBOSE: New-Mailbox : Searching objects "exchange.msft/Users" of type "ExchangeOrganizationalUnit" under the root "$null".
VERBOSE: New-Mailbox : Previous operation run on domain controller ‘2008-DC1.exchange.msft’.
VERBOSE: New-Mailbox : Searching objects "2008-MBX32008-MBX3-SG1-DB1" of type "MailboxDatabase" under the root "$null".
VERBOSE: New-Mailbox : Previous operation run on domain controller ‘2008-DC1.exchange.msft’.
VERBOSE: New-Mailbox : Administrator Active Directory session settings are:  View Entire Forest: ‘False’, Default Scope: ‘exchange.msft’, Configuration Domain Controller: ‘2008-DC1.exchange.msft’,
VERBOSE: New-Mailbox : Searching objects of type "ADRecipient" with filter "(|((SamAccountName Equal test)))", scope "SubTree" under the root "exchange.msft".
VERBOSE: New-Mailbox : Previous operation run on domain controller ‘2008-DC1.exchange.msft’.
VERBOSE: New-Mailbox : Searching objects of type "ADRecipient" with filter "(|((Alias Equal test)))", scope "SubTree" under the root "$null".
VERBOSE: New-Mailbox : Previous operation run on global catalog server ‘2008-DC2.exchange.msft’.
VERBOSE: New-Mailbox : Applying RUS policy to the given recipient "exchange.msft/Users/Test" with the home domain controller "$null".
VERBOSE: New-Mailbox : The RUS server that will apply policies on the specified recipient is "2008-MBX3.exchange.msft".
VERBOSE: New-Mailbox : Processing object "exchange.msft/Users/Test".
VERBOSE: New-Mailbox : Searching objects of type "ADRecipient" with filter "(&((|((UserPrincipalName Equal test@exchange.msft)))(Id NotEqual exchange.msft/Users/Test)))", scope "SubTree" under the root "$null".
VERBOSE: New-Mailbox : Previous operation run on global catalog server
‘2008-DC2.exchange.msft’.
VERBOSE: New-Mailbox : Searching objects of type "ADRecipient" with filter "(&((|((SamAccountName Equal test)))(Id NotEqual exchange.msft/Users/Test)))", scope "SubTree" under the root "exchange.msft".
VERBOSE: New-Mailbox : Previous operation run on domain controller ‘2008-DC1.exchange.msft’.
VERBOSE: Creating Mailbox "Test" on Database "2008-MBX32008-MBX3-SG1-DB1" with UserPrincipalName "test@exchange.msft", Organizational Unit "exchange.msft/Users".
VERBOSE: New-Mailbox : Saving object "exchange.msft/Users/Test" of type "ADUser" and state "New".
VERBOSE: New-Mailbox : Previous operation run on domain controller ‘2008-DC1.exchange.msft’.
VERBOSE: New-Mailbox : Setting password for the created user "exchange.msft/Users/Test".
VERBOSE: New-Mailbox : The properties changed are: "{ PasswordLastSetRaw=’-1′, UserAccountControl=’NormalAccount’ }".
VERBOSE: New-Mailbox : Saving object "exchange.msft/Users/Test" of type "ADUser" and state "Changed".
VERBOSE: New-Mailbox : Previous operation run on domain controller ‘2008-DC1.exchange.msft’.
VERBOSE: New-Mailbox : Reading new object "exchange.msft/Users/Test" of type "ADUser".
VERBOSE: New-Mailbox : Previous operation run on domain controller ‘2008-DC1.exchange.msft’.

Name                      Alias                ServerName       ProhibitSendQuo
                                                                ta            
—-                      —–                ———-       —————
Test                      test                 2008-mbx3        unlimited
     

VERBOSE: New-Mailbox : Ending processing.

 

(I want to thank Brad Hughes and Ben Winzenz for their contributions to this post.)

Exchange Replication Service (Exchange 2007 SP1) and Windows 2008 Clusters

If you are reading this blog post I also suggest you read the the following white paper:  "White Paper:  Continuous Replication Deep Dive" which can be found at http://technet.microsoft.com/en-us/library/cc535020.aspx.  This white paper does a fantastic job of covering the operations of the replication service.

 

Windows 2008 and Windows 2008 clustering introduce changes to the operating system which effect the operation of the Exchange Replication Service.  Specifically there are two items of interest.  The first is the ability to programmatically create a share to a specific host and the integration of file sharing operations with the cluster service when shared storage is used.  (These same concepts also effect Exchange Offline Address List Generation, see Dave Goldman’s blog for more info on these topics – http://blogs.msdn.com/dgoldman/archive/2008/12/11/fix-for-oab-generation-failing-on-ccr-and-scc-clusters.aspx.).

 

Lets take a look at how this effects Exchange 2007 SP1 Cluster Continuous Replication (CCR) clusters.

 

With Exchange 2007 SP1 CCR clusters there are no shared storage devices.  File shares that are created, which are necessary for the replication service to replicate log files, are created on the host.

 

In Windows 2003 clustering, when a file share was created on a local node it was accessible by any name the resolved to that node.  (If the name did not exist as either the node name, or a virtual network name in cluster, it would be necessary to disable strict name checking in order for the host to accept the request – http://support.microsoft.com/kb/281308 – this could possibly happen when using a host file to map a different name or using a CNAME DNS record to map to a host).  For example, if I create a share on the host named Files, I could access the share at \NodeNameFiles or \CMSNameFiles if the CMS (clustered mailbox server) was active on that node.  In addition, if I used enable-continuousreplicationhostnames to create replication networks to replicate logs on an alternate network interface, I could also access the share at \ContinuousReplicationHostNameFiles

 

In the below example on Windows 2003, on the root of C, I created a share called Files and placed two text files in the share.  There is a cluster network name that runs on the node where I created the share.  From another machine in the environment, you can see that the share can be accessed at both \NodeName and \CMSName.

 

image 

image

 

In Windows 2008, if I created the same share on the local node, it would only be accessible by \NodeNameFiles.  If attempting to access the share at \CMSNameFiles an error is displayed.  (In this example node name is 2008-Node1 and CMSName is 2008-MBX3).

 

image

 

image

 

On Windows 2008, in server manager-> roles -> file services -> share and storage management, the Files share appears in the list of available shares.

image

 

If you get the properties of the share, you can see that it’s specifically scoped to the Node Name (review the share path).

image

 

In a default configuration for CCR, scoped shares do not have any impact how the replication service replicates log files.  The main impact comes when using continuous replication host names.  In order for this to function, the replication service has to specifically create the share and have it scoped at both the NodeName and CMSName. 

 

When using continuous replication hostnames, if you review the Share and Storage Management console, you will see two (or more depending on how many continuous replication host names exist) instances of the shares for each storage group. 

 

image

 

Each instance of the share uses the same share name and the same physical path on storage.  The difference is in the properties of each of these shares.  When reviewing the properties, you will see one share specifically scoped to the node name, the other share specifically scoped to the continuous replication host name.  (In this example 2008-MBX3-ReplC is the continuous replication host name.)

 

image
*Example of share scoped to continuous replication host name.

 

image

*Example of share scoped to node name.

 

By creating scoped shares at both endpoints, the replication service is able to access logs using both the NodeName and ContinuousReplicationHostName.

 

Lets take a look at how this effects Exchange 2007 SP1 Single Copy Clusters (SCC).

 

With Exchange 2007 SP1 SCC all databases and log files reside on a shared storage device.  This requires that the shares necessary for creation be created against folders that exist on shared volumes.  In this instance, there is no local replication activity or replication between nodes of the cluster.  The replication service is only used against a single copy cluster when the SCC cluster is acting as a standby continuous replication source.

 

With Windows 2003 the replication service would create the shares necessary for SCR replication to occur.  These shares, by default, were available at both \NodeName and \CMSName.  There was no tight integration between the sharing functions of the operating system and cluster where shared storage was concerned.  (Remember that an SCR target replicates log files from an SCC SCR source at \CMSNameStorageGroupGuid$).

 

In Windows 2008 cluster is more aware of shares created on shared storage.  By default, when a share is created on a shared disk, the cluster service will automatically intercept that share and scope it only to the virtual name associated with the client access point that owns the disk.  This happens when manually creating a share through the operating system or programmatically creating a share (regardless of the endpoint passed into the sharing function).  Let’s take a look at an example.

 

In this example I created a empty service or application.  In the empty service or application, I created a new client access point.  I created a new shared disk on my SAN, added the disk to available storage, and then moved it to the client access point.  In my case the disk is the H volume (Cluster Disk 9).

 

image

 

On the node that owns the Empty-Group with the the client access point and disk, through the operating system, I created a folder on the H drive and shared it.  After completing the sharing wizard, you can see the share is immediately scoped to the name used in the client access point.  In this example my client access point name is EMPTY-GROUP, so the share is available at \Empty-GroupFiles.

 

image

 

When reviewing share and storage management, you will see the share.  It’s properties also reference the share created to the client access point name.

image

 

image

 

In additional to the above information, in the cluster administrator, in the service or application that I created, a new FileServer resource is created.  The file server resource is dependant on the physical disk that the share resides on, and the client access point name.  The share that was created is also viewable in cluster administrator.

 

image

 

How is this handled programmatically?

 

A common method to create shares was the Share_Info_502 Structure.  (http://msdn.microsoft.com/en-us/library/bb525410(VS.85).aspx)  When this structure is used on Windows 2008 cluster, the share is automatically created against the node name (as long as the shared folder does not reside on a shared disk).  If the folder that is being shared resides on shared disk, cluster automatically intercepts this sharing requests and scopes the share to the client access point that owns the shared disk resource.

 

A new sharing method was introduced with Windows 2008.  This is the Share_Info_503 structure.  (http://msdn.microsoft.com/en-us/library/cc462916(VS.85).aspx)  This structure allows the programmer to specify the server name as part of the sharing call.  Here is an excerpt from the MSDN page.

shi503_servername

A pointer to a string that specifies the DNS or NetBIOS name of the remote server on which the shared resource resides. A value of "*" indicates no configured server name.

When using this sharing structure, the programmer can specify to create the share against the node name, the cms name (in the Exchange case), or both.  The only exception is when the folder to be shared resides on a shared disk.  Cluster will intercept this sharing call and allow the share to only be scoped to the client access point that owns the physical disk resource.

 

*Note:  When creating shares on a shared storage device in Windows 2008 you should install KB 955733.

 

Exchange 2007 SP1 RU5 – Error regarding replication between computers in different domains when using standby continuous replication (SCR).

Exchange 2007 SP1 RU5 introduces an error into the enable-storagegroupcopy -standbymachine commandlet when attempting to enable SCR on a storage group.

The error is only present when there is a parent domain -> child domain active directory domain structure.  The issue does not occur if the active directory domain / forest is flat.

When attempting to use the enable-storagegroupcopy -standbymachine <scrTarget> commandlet, the following error is returned:

Enable-StorageGroupCopy:  Standby continuous replication is not supported between computers in different Active Directory domains.  The target node is in domain <child.parent.com>, which is different from the source domain <parent.com>.

This error is correct if the machines are actually in different domains, as SCR targets must be in the same domain as their sources.  In this case the error is being thrown incorrectly when the machines are actually members of the same domain.

To correct this issue:

1)  Verify that all machines are actually members of the same domain.

 

2)  From the machine where the commandlet is being run, uninstall RU5.  (Note:  If any of the machines are involved in replication, and RU5 is removed, it needs to be reinstalled after replication is established.  All machines involved in replication should run the same release rollup update).

 

or

 

3)  From a machine with just the management tools installed, with any RU prior to RU5, run the enable-storagegroupcopy -standbymachine commandlet.  (This would be preferred as it does not involve uninstalling and re-installing any already applied RUs.)  If any of the servers involved are cluster servers, the the command must be run from a management tools workstation that is the same version.  For example, if Windows 2008 is the operating system hosting Exchange clustering, then the tools workstation must also run the Exchange 2007 SP1 management tools and the Windows 2008 failover cluster manager.

 

The issue is scheduled to be corrected in Exchange 2007 SP1 RU8.

 

***UPDATE:  As of today this has been rescheduled for Exchange 2007 SP1 RU7***

Windows 2008 / Exchange 2007 SP1 – ESE 522 errors on CCR passive or SCR target machine.

There are some users that are experiencing errors on CCR (cluster continuous replication) passive nodes or SCR (standby continuous replication) target machines. 

Here is an example error:

Event ID     : 522
Raw Event ID : 522
Record Nr.   : 3965
Category     : General
Source       : ESE
Type         : Error
Generated    : 6/4/2008 4:48:58 PM
Written      : 6/4/2008 4:48:58 PM
Machine      : server.domain.com
Message      : Microsoft.Exchange.Cluster.ReplayService (7012) Log Verifier e0a 31573001: An attempt to open the device name "\sourceshare$" containing "\sourceshare$" failed with system error 5 (0x00000005): "Access is denied. ".  The operation will fail with error -1032 (0xfffffbf8).

For more information, click http://www.microsoft.com/contentredirect.asp.

The error -1032 (0xfffffbf8) translates to Jet_errFileAccessDenied.

The errors most commonly occur when:

1)  Replication is paused and resumed automatically to accomidate a backup operation.

2)  Replication is paused and resumed using suspend-storagegroupcopy and resume-storagegroupcopy (or through the GUI equivalents in the Exchange Management Console).

3)  The replication service is restarted or the entire node / target is restarted.

4)  Databases are mounted and dismounted using dismount-mailboxdatabase and mount-mailboxdatabase (or though the GUI equivalents in Exchange Management Console).

5)  The CMS (clustered mailbox server) is stopped and restarted using stop-clusteredmailboxserver and start-clusteredmailboxserver (or though the GUI equivalents in Exchange Management Console).

If you manually review the log file directories on the passive or target machines they are relatively in sync with the source machine. 

Likewise, when using any replication health check including get-storagegroupcopystatus, get-storagegroupcopystatus -standbymachine, and test-replicationhealth no errors are displayed.

Here is an example get-storagegroupcopystatus on a two node Exchange 2007 SP1 / Windows 2008 CCR cluster.

 

Name                  SummaryCopySt
atus
CopyQueueLeng
th
ReplayQueueL
ength
LastInspecte
dLogTime
—- ————-     ————-   ————      ————
2008-MBX3-SG1 Healthy           0               0               12/21/200…
2008-MBX3-SG2  Healthy           0               0               12/21/200…

 

Here is an example of test-replicationHealth

                         

Server                   Check                          Result Error                   
——          —–                          —— —-                   
2008-NODE1 PassiveNodeUp                  Passed
2008-NODE1 ClusterNetwork                 Passed 
2008-NODE1 QuorumGroup                    Passed
2008-NODE1 FileShareQuorum                Passed
2008-NODE1 CmsGroup                       Passed 
2008-NODE1 NodePaused                     Passed 
2008-NODE1 DnsRegistrationStatus          Passed 
2008-NODE1 ReplayService                  Passed
2008-NODE1 DBMountedFailover              Passed 
2008-NODE1 SGCopySuspended                Passed
2008-NODE1 SGCopyFailed                   Passed
2008-NODE1 SGInitializing                 Passed
2008-NODE1 SGCopyQueueLength              Passed 
2008-NODE1 SGReplayQueueLength            Passed                  

 

The error is caused by an invalid function call from the replication service to the operating system.  This causes the operating system to respond with access denied, and the replication service to respond by logging the ESE 522 event.

This issue is scheduled to be corrected in Exchange 2007 SP1 RU7.  There is currently no incremental update for the issue.

The issue can be considered benign if:

1)  Visual inspection of the log file directories show that replication is occurring.

2)  Get-storagegroupcopystatus or get-storagegroupcopystatus -standbymachine shows all storage groups in healthy state (some storage groups may be in an initializing state, in which case logs will need to be generated to determine status).

3)  Test-replicationhealth when run on the passive node or target server shows all tests passed.