Monthly Archives: January 2009

Get-storagegroupcopystatus = Initializing

A question that I see a lot of, and causes come confusion out there, is why does get-storagegroupcopystatus report a status of initializing for replicated storage groups.  This status occurs when using any form of replication (LCR = local continuous replication / CCR = cluster continuous replication / SCR = standby continuous replication).

 

Some history…

 

In Exchange 2007 RTM there was no initializing status.  When the replication service was restarted for any reason, the replication service would display a status of healthy for each storage group.  The issue here is that the storage groups may not have actually been healthy.  It would be possible that two databases copies could be diverged, but this would not be detected until the first log on the source was generated, copied, inspected, and compared to the passive database instance.  While waiting for this to happen, the administrator is given a false sense that all is well because all instances by default were assumed healthy.

 

An example of how this is not the most desired approach can be seen with CCR installations.  In this configuration we’ll assume there is a reason that the passive database copy was diverged from the active copy.  When the customer issues a move-clusteredmailboxserver, we are able to move the instance between the two nodes.  Since the replication service reported the instances as healthy, nothing prevented us from moving the resources between the two nodes.

 

The change…

 

In Exchange 2007 SP1 we took a change to the replication service to now display a status of initializing.  What initializing is telling the administrator is that:

 

  • The replication service has not received a notification to copy a log.
  • The replication service has not yet copied a log.
  • The replication service has not yet inspected a log.
  • The replication service has not placed a log out for replay and determined divergence information.

 

When the above criteria have been met the replication service will change the instances from initializing to either healthy or failed.  Here is an example of get-storagegroupcopystatus showing initializing status:

 

Name                      SummaryCopySt CopyQueueLeng ReplayQueueL LastInspecte
                          atus          th            ength        dLogTime
—-                      ————- ————- ———— ————
2008-MBX3-SG1             Initializing  0             0
2008-MBX3-SG2             Initializing  0             0

 

Where I most commonly hear this question is in test labs where activities like move-clusteredmailboxserver do not work.  When storage groups are in an initializing state commandlets report that action cannot be taken because replication is failed.  It really is not about replication being failed but more that replication is in any other state but healthy.  Other conditions that we see the initializing state is post VSS backups of the passive database copies or when an replication instance is suspended and resumed.

 

Here is an example of a move-clusteredmailboxserver command results where storage group copy status is initializing:

 

Move-ClusteredMailboxServer : Continuous replication is in a failed, seeding, or suspended state on ‘2008-MBX3-SG1’. Move-clusteredMailboxServer cannot be performed if one or more of the server’s storage group copies are in failed, seeding or suspended states.
At line:1 char:27
+ Move-ClusteredMailboxServer <<<<

 

To change replication to a healthy state I usually do one of two steps:

 

  1. Dismount and re-mount all databases -> this should hopefully cause log roll to occur and the replication service to replicate a log.
  2. Create test mailboxes in each store and send mail to them.  Mail flow will create log files.  If the mail flow is not significant enough to roll the logs, automatic log roll will occur at a later time even though the log file is not full.

 

Monitoring…

 

There are two reliable ways to monitor continuous replication status for initializing state. 

 

The first method is to run get-storagegroupcopystatus.

 

The second method is to use performance monitor and the "MSExchange Replication" performance object.  Under this object is a counter named "Initializing".  If the counter displays a value of 1 then the storage group instance is in initializing state.  If the counter displays a value of 0 the the storage group is not in the initializing state.  Below is an example of performance monitor showing databases in an initializing state moving from Initializing to Healthy.

 

image

 

Information on monitoring continuous replication can be found here -> http://technet.microsoft.com/en-us/library/bb629521.aspx

Cluster stability issues with Exchange 2007 SP1 RU5 Single Copy Clusters (SCC) enabled for Standby Continuous Replication (SCR) on Windows 2008

In the recent weeks I’ve worked several cases where Exchange 2007 SP1 Single Copy Clusters with storage groups enabled for Standby Continuous Replication on Windows 2008 have had performance and network connectivity issues.  These issues have manifested themselves with the following symptoms:

 

  • Outlook clients appear to have no issues connecting to or using the Exchange instances.
  • Access to the host machine console is slow.  RDP may connect unreliably.
  • Use of management tools may fail fail to connect, for example event viewer and failover cluster manager will fail to connect.
  • Get-storagegroupcopystatus -standbymachine may report failed replication instances between source and target.
  • Review of share management under server management shows that SCR shares are being deleted and recreated on the source cluster.

 

In each case the following was common to all the situations I worked.

 

  • Operating system is Windows 2008
  • Source cluster is an Exchange 2007 SP1 RU5 Single Copy Cluster (SCC)
  • One or all storage groups are enabled for Standby Continuous Replication (SCR)
  • Network interfaces drivers were older then July 2008.
  • Network teaming was used on the public cluster interfaces and configured for Fault Tolerance with Load Balancing
  • If standby continuous replication was disabled on any enabled storage group cluster stability could be maintained.

 

The following was performed to resolve the stability issues.  Note:  All steps are considered part of the solution.

 

1)  Upgrade network interface drivers to a revision July 2008 or newer.

 

2)  Reconfigure all network teams for Fault Tolerance Only.

 

3)  Install KB 955733 to all clustered nodes.

 

This kb article corrects known issues with status codes returned by the Windows operating system for certain function calls.  Note that when downloading the fix it is marked as Vista x64 although the fix is for Windows 2008 x64.

 

4)  Open a case with Microsoft CSS – Request Exchange interim update KB957834.  (Recommend that customers upgrade to Exchange 2007 SP1 RU5 and deploy the RU5 IU).  This article can be found at:  http://support.microsoft.com/kb/955733

 

This update is available for Exchange 2007 SP1 RU4 and Exchange 2007 SP1 RU5 at this time.  This update corrects issues with share endpoint checking for servicing SCR.  For more information regarding the replication service and shares see http://blogs.technet.com/timmcmic/archive/2008/12/23/exchange-replication-service-exchange-2007-sp1-and-windows-2008-clusters.aspx

Inconsistent results when enabling standby continuous replication (SCR) in Exchange 2007 SP1.

A common question that I receive from customers is why do I experience inconsistent results when I enable a storage group in Exchange 2007 SP1 for standby continuous replication.  Usually the conversation focuses on why replication instances initially show failed and then soon after go healthy, or why the replication service reports that databases are not configured for standby continuous replication even though the command was run and successful.

 

Standby continuous replication was introduced in Exchange 2007 SP1 as a way to replicate databases from any mailbox role source to an independent mailbox role target.  Most commonly customers implement this technology as part of a broader site resiliency plan.  Information regarding standby continuous replication can be found here:  http://technet.microsoft.com/en-us/library/bb676502.aspx.

 

The command that is used to enable standby continuous replication is enable-storagegroupcopy -identity <storagegroup> -standbymachine <target>.   More information on this commandlet can be found here:  http://technet.microsoft.com/en-us/library/bb123684.aspx.

 

When the enable-storagegroupcopy command is used an attribute on the storage group is updated in the active directory.  The attribute is msExchStandbyCopyMachines.  This is a muti-valued attribute to reflect that a database can be replicated to multiple SCR targets.  When the command is successfully run, the target name used is populated in the attribute, along with values representing TruncationLagTime and ReplayLagTime.  Here is a sample LDP dump of a storage group enabled for SCR.

 

========================================

Expanding base ‘CN=2008-MBX1-SG1,CN=InformationStore,CN=2008-MBX1,CN=Servers,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=Exchange,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=exchange,DC=msft’…
Getting 1 entries:
Dn: CN=2008-MBX1-SG1,CN=InformationStore,CN=2008-MBX1,CN=Servers,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=Exchange,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=exchange,DC=msft
cn: 2008-MBX1-SG1;
distinguishedName: CN=2008-MBX1-SG1,CN=InformationStore,CN=2008-MBX1,CN=Servers,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=Exchange,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=exchange,DC=msft;
dSCorePropagationData: 0x0 = (  );
instanceType: 0x4 = ( WRITE );
msExchESEParamBaseName: E00;
msExchESEParamCheckpointDepthMax: 20971520;
msExchESEParamCircularLog: 0;
msExchESEParamCommitDefault: 0;
msExchESEParamCopyLogFilePath: f:2008-MBX12008-MBX1-SG1-Logs-LCR;
msExchESEParamCopySystemPath: e:2008-MBX12008-MBX1-SG1-System-LCR;
msExchESEParamDbExtensionSize: 256;
msExchESEParamEnableIndexChecking: TRUE;
msExchESEParamEnableOnlineDefrag: TRUE;
msExchESEParamEventSource: MSExchangeIS;
msExchESEParamLogFilePath: d:2008-MBX12008-MBX1-SG1-Logs;
msExchESEParamLogFileSize: 1024;
msExchESEParamPageFragment: 8;
msExchESEParamPageTempDBMin: 0;
msExchESEParamSystemPath: d:2008-MBX12008-MBX1-SG1-System;
msExchESEParamZeroDatabaseDuringBackup: 0;
msExchHasLocalCopy: 1;
msExchMinAdminVersion: -2147453113;
msExchStandbyCopyMachines: 2008-MBX2.exchange.msft;1;1.00:00:00;00:00:00;
msExchVersion: 4535486012416;
name: 2008-MBX1-SG1;
objectCategory: CN=ms-Exch-Storage-Group,CN=Schema,CN=Configuration,DC=exchange,DC=msft;
objectClass (3): top; container; msExchStorageGroup;
objectGUID: 7dd4c453-9052-43c6-9e18-845f8e616520;
showInAdvancedViewOnly: TRUE;
systemFlags: 0x40000000 = ( CONFIG_ALLOW_RENAME );
uSNChanged: 57771;
uSNCreated: 33269;
whenChanged: 9/17/2008 4:55:12 PM Eastern Standard Time;
whenCreated: 9/15/2008 6:11:32 PM Eastern Standard Time;

———–

========================================

The enable-storagegroupcopy command does not interact directly with the replication service in order to start a new replication instance.  Internal to each replication service is a configuration update process.  When the configuration update process runs, the replication service determines by reading the active directory which database instances need to be replicated.  A list is generated and compared to the instances the replication service is already running.  When a new replication instance is found, the replication service will spawn the instance.  When an instance already exists and is no longer replicated, the replication service will destroy that instance. 

For standby continuous replication the configuration update process runs on the source every 30 seconds – on the target every 3 minutes.

On the source machine, when the configuration update process runs and determines that a database on the source has been enabled for SCR the replication service will create the file shares necessary for the target to access the source and replicate logs.

On the target machine, when the configuration update process runs and determines that a database is enabled for SCR and replicated to that machine, the instance is added to the replication service and the replication service begins the process of copying logs etc.

This is where customers start to experience inconsistent results.  Standby continuous replication is dependant on reading an active directory attribute, it is also dependant on the time it takes that attribute to replicate to a domain controller in the source and a domain controller in the target.  Until that attribute replicates to both locations, and the configuration update process runs in both locations, standby continuous replication will not be fully enabled for this storage group.

Let me provide an example.

There are three different examples that I have found that show inconsistent results.

Example #1:

In this example we have a standalone mailbox server in SiteA.  The SCR target for this standalone mailbox server is in SiteB.  SiteA and SiteB are both different active directory sites with a 15 minute replication delay between them.  On the SCR target machine the administrator runs enable-storagegroupcopy -standbymachine – the command completes successfully.  After a few minutes the get-storagegroupcopystatus -standbymachine command is run and it is noted that all replicated storage groups appear in a FAILED state. 

 

Name                      SummaryCopySt CopyQueueLeng ReplayQueueL LastInspecte
                          atus          th            ength        dLogTime   
—-                      ————- ————- ———— ————
2008-MBX1-SG1             Failed        0             0                       
2008-MBX1-SG2             Failed        0             0
                       

 

A review of the source shows that the shares necessary to replicate log files were not created.  At this time the admin steps away for 30 minutes and comes back to check replication again with get-storagegroupcopystatus -standbymachine.  It is noted that all storage groups appear healthy, and that the shares necessary to copy logs exist on the source.

 

Name                      SummaryCopySt CopyQueueLeng ReplayQueueL LastInspecte
                          atus          th            ength        dLogTime   
—-                      ————- ————- ———— ————
2008-MBX1-SG1             Healthy        5             25                       
2008-MBX1-SG2             Healthy        1             50
                       

 

The behavior here is by design.  When the administrator enabled SCR on the target machine it stamped the msExchStandbyCopyMachines attribute on the domain controller in SiteB.  Within a 3 minute window the replication service on the target machine runs the configuration update process.  The new replication instance is detected and the replication service starts to attempt to copy logs.  The attribute though has not replicated to a domain controller in SiteA, therefore the replication service in SiteA does not know to create the shares necessary to service replication.  This results in the replication instances being marked FAILED.  After waiting 30 minutes, active directory replication has had time to occur and the configuration update process on the source has run, the new replication instance detected, and the shares created.  At this point the replication service can now access the logs on the source and the replication instances are marked HEALTHY.  (Note that the same example applies to a single copy cluster [scc] source.)

 

Example #2:

 

In this example we have a standalone mailbox server in SiteA.  The SCR target for this standalone mailbox server is in SiteB.  SiteA and SiteB are both different active directory sites with a 15 minute replication delay between them.  On the SCR source machine the administrator runs enable-storagegroupcopy -standbymachine – the command completes successfully.  After a few minutes the get-storagegroupcopystatus -standbymachine command is run and it is noted that all replicated storage groups appear in a NOTCONFIGURED state.

 

Name                      SummaryCopySt CopyQueueLeng ReplayQueueL LastInspecte
                          atus          th            ength        dLogTime   
—-                      ————- ————- ———— ————
2008-MBX1-SG1             NotConfigured 0             0                       
2008-MBX1-SG2             NotConfigured 0             0
                       

 

A review of the source shows that the shares necessary to replicate log files are created.  At this time the admin steps away for 30 minutes and comes back to check replication again with get-storagegroupcopystatus -standbymachine.  It is noted that all storage groups appear healthy.

 

Name                      SummaryCopySt CopyQueueLeng ReplayQueueL LastInspecte
                          atus          th            ength        dLogTime   
—-                      ————- ————- ———— ————
2008-MBX1-SG1             Healthy        5             25                       
2008-MBX1-SG2             Healthy        1             50
                       

 

The behavior here is by design.  When the administrator enabled SCR on the source machine it stamped the msExchStandbyCopyMachines attributes on the domain controller in SiteA.  Within a 30 second window the replication service on the source machine runs the configuration update process.  The new replication instance is detected, and the replication service creates the shares on the source.  The attribute though has not replicated to a domain controller in SiteB, therefore the replication service in SiteB is not aware of the replication instances and responds NotConfigured when queried for status.  After waiting 30 minutes, active directory replication has had time to occur and the configuration update process on the target has run, the new replication instances detected, and the replication process started.  At this point the replication service is aware of the instances, and responds with a healthy status when queried.  (Note that the same example applies to a single copy cluster [scc] source.)

 

Example #3:

 

In this example we have a cluster continuous replication source in SiteA.  The SCR target for this CCR source is located in SiteB.  SiteA and SiteB are both different active directory sites with a 15 minute replication delay between them.  On the SCR target machine, the administrator runs enable-storagegroupcopy -standbymachine – the command completes successfully.  After a few minutes the get-storagegroupcopystatus -standbymachine command is run and it is noted that all replicated storage groups appear HEALTHY.

 

Name                      SummaryCopySt CopyQueueLeng ReplayQueueL LastInspecte
                          atus          th            ength        dLogTime   
—-                      ————- ————- ———— ————
2008-MBX1-SG1             Healthy        5             25                       
2008-MBX1-SG2             Healthy        1             50
                       

 

This is different from the example outlined in Example #1.  In this example the source is a CCR cluster.  In order for a CCR cluster to replicate log files between the two source nodes, the shares must exist.  Since the shares already exist, we only have to wait for the replication service configuration update process to run on the target machine.  AD replication here is not a factor when a target domain controller is used for the enable-storagegroupcopy command.

 

This information should help administrators explain some of the results of the SCR process and make decisions on where enabling should be performed.

Exchange 2007 SP1 CCR- Windows 2008 Clusters – File Share Witness (FSW) failures.

Exchange 2007 SP1 Cluster Continuous Replication (CCR) clusters built on Windows 2008 most commonly uses the quorum model "Node Majority with File Share Witness".

 

In the past few weeks administrators have noticed that the file share resource is sometimes in a failed state.  Under normal circumstances this is not an issue since both nodes of the cluster are always available (meaning that there are two votes) and the cluster services stay running.  Where this has become and issue is where the file share witness resource is failed and a node of the cluster is also unavailable.

 

Some background….

 

In Windows 2003 the file share witness is a private property of the "majority node set" quorum resource.  If the file share witness was unavailable, an event would be logged but the resource would continue to remain online.  (Reference http://support.microsoft.com/?kbid=921181 for information on the file share witness on Windows 2003).  The file share witness resource is only used when necessary to maintain quorum for the cluster.  If at that time the witness was still unavailable, the cluster would loose quorum and the cluster service would be terminated.

 

In Windows 2008, when using the node majority with file share witness quorum model, the file share witness resource is enumerated as an actual resource.  It can be seen in failover cluster manager -> cluster core resources.

 

image

*Cluster core resources.

 

Now that the file share witness exists as a resource cluster can do additional health checking.  One of the health checks is to ensure that the file share witness folder is online and accessible.  In the event that the FSW folder is not accessible, cluster will fail the FSW resource and attempt to online the FSW on the other node.  If the FSW folder continues to remain unavailable, cluster will fail the resource.  By default, cluster will attempt to restart any resource that is failed every 60 minutes (1 hour).  If the resource continues to fail to come online, it will remain on the node it failed on until administrator intervention is taken or the resource can be brought online during one of the 60 minute intervals.  

 

image

*Failed file share witness resource in cluster core resources.

 

Just like Windows 2003 the file share witness in Windows 2008 is only accessed when it is necessary for the cluster to maintain quorum.  If the file share witness resource is in a failed state, and the use of the FSW becomes necessary to maintain quorum, Windows 2008 will not attempt to access the file share witness resulting in a lost quorum state and termination of cluster services. 

 

Impact to Exchange…

 

Where this seems to impact Exchange administrators the most is during patch management.  During patch management, patches are applied to the hub transport server owning the file share which supports the FSW resource in the cluster.  When the server is rebooted, the file share witness is no longer available.  Cluster performs status checking, and determines that the FSW is not available.  At this point the Core Cluster Group, containing the FSW, is failed over to the second node.  If the hub transport server is available, the file share witness will come online.  Most commonly the hub transport server is not available resulting in the file share witness failing, and remaining failed, on the second node.  If left alone for 60 minutes, the resource will be automatically restarted and by this time the hub transport server will be available. 

 

In my experience the issue arises during that 60 minutes.  It is possible that a reboot or loss of a cluster node could cause the cluster to loose quorum.  If the file share witness resource is failed, and I reboot the node not owning the cluster core resources, this leaves one vote in the cluster.  Since cluster requires a majority of votes to be available, this results in the termination of the cluster service on the remaining node.

 

It is important that administrators understand this difference between Windows 2003 and Windows 2008, and account for it in how they manage their cluster.

 

What can be done to alleviate this condition?

 

As you have already read cluster automatically attempts to restart failed resources every 60 minutes.  This time limit is the default setting for all resources.  The rational here is that if a resource is failing, the administrator should have the opportunity to troubleshoot, identify, and correct the issue causing the resource to fail.  From a monitoring standpoint, here is the entire process — monitoring software bubbles up the alert, helpdesk notifies the admin, the admin accesses the machine, the admin reviews the logs, and the admin takes appropriate action(s).  On the other hand, this process may not work as the admin may be unavailable etc, in which case the cluster will still try to self heal if no administrator intervention is taken.  So essentially the first method of alleviating this condition is to understand the defaults, how the solution operates, what to look for before rebooting nodes etc (all cluster core resources healthy), and make no configuration changes.

 

On the properties of each resource is the retry interval to restart failed resources.  The minimum value that is allowed here is 15 minutes.  This would cause the cluster, in the event that the resource is failed, to be more aggressive in terms of attempting to online the resource.  From an Exchange 2-Node perspective, this would limit the failure window to 15 minutes verses 60 minutes (assuming the witness is available after 15 minutes).  To change this value:

 

1)  Open the Failover Cluster Manager and connect to the cluster.

2)  Select the cluster name at the top of the left hand pane.

 

image

 

3)  In the center pane of the MMC, expand Cluster Core Resources.

 

image

 

4)  Get the properties of the File Share Witness (\PathShare) resource.

5)  Select the Policies tab on the resource.

 

image

 

6)  On the policies tab, you will see an option "If all the restart attempts fail, begin restarting again after the specified period (hh:mm) with a default of 01:00 (1 hour / 60 minutes).  Here you could adjust this value to 15 minutes using the input box.  If a change is made, select apply -> ok to exit properties.

 

image

 

Another method would be someway to issue an online command to the group, for example through a script.  Post reboot it would be possible to issue a command, from the server with the failover cluster manager installed, similar to this:  cluster.exe "cluster name" group "Cluster Group" /online [note, cluster group is the name of the group holding the cluster core resources, only cluster name has to be configured to the FQDN of the cluster management name.]  For example, cluster.exe 2008-Cluster3.exchange.msft group "Cluster Group" /online.  If this command is run manually the following output would be returned in the command window:

 

Bringing resource group ‘Cluster Group’ online…

Group                  Node            Status
——————– —————  ——
Cluster Group      2008-Node5   Online

 

DO NOT use any other distributed method, such as distributed file systems, to host the file share witness.

 

What will be done to correct this condition?

 

We have worked with the Windows Product Team to bring this behavior to their attention.  We are working on a possible design change in the behavior of the file share witness on Windows 2008.  As this progresses I will continue to update this blog with more information.

 

Relevant Event IDs…

 

  • System Log – Event ID 1562 – Indicates that the file share witness is unavailable.

 

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          1/7/2009 8:07:24 AM
Event ID:      1562
Task Category: File Share Witness Resource
Level:         Warning
Keywords:     
User:          SYSTEM
Computer:      2008-Node1.exchange.msft
Description:
File share witness resource ‘File Share Witness (\2008-dc1MNS_FSW_2008-Cluster1)’ failed a periodic health check on file share ‘\2008-dc1MNS_FSW_2008-Cluster1’. Please ensure that file share ‘\2008-dc1MNS_FSW_2008-Cluster1’ exists and is accessible by the cluster.

 

  • System Log – Event ID 1069 – Indicates the file share witness resource is in failed state.

 

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          1/7/2009 8:07:24 AM
Event ID:      1069
Task Category: Resource Control Manager
Level:         Error
Keywords:     
User:          SYSTEM
Computer:      2008-Node1.exchange.msft
Description:
Cluster resource ‘File Share Witness (\2008-dc1MNS_FSW_2008-Cluster1)’ in clustered service or application ‘Cluster Group’ failed.

 

  • System Log – Event ID 1564 – Indicates that the cluster cannot access the file share witness directory.

 

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          1/7/2009 8:07:25 AM
Event ID:      1564
Task Category: File Share Witness Resource
Level:         Critical
Keywords:     
User:          SYSTEM
Computer:      2008-Node1.exchange.msft
Description:
File share witness resource ‘File Share Witness (\2008-dc1MNS_FSW_2008-Cluster1)’ failed to arbitrate for the file share ‘\2008-dc1MNS_FSW_2008-Cluster1’. Please ensure that file share ‘\2008-dc1MNS_FSW_2008-Cluster1’ exists and is accessible by the cluster.

 

  • System Log – Event ID 1205 – Indicates that the cluster core resources group ("Cluster Group") is not completely online or offline due to a failure of the file share witness resource.

 

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          1/7/2009 8:07:25 AM
Event ID:      1205
Task Category: Resource Control Manager
Level:         Error
Keywords:     
User:          SYSTEM
Computer:      2008-Node1.exchange.msft
Description:
The Cluster service failed to bring clustered service or application ‘Cluster Group’ completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered service or application.

 

*Thanks to Chuck Timon, Sr Support Escalation Engineer, Platforms CSS for assisting in reviewing and modifying this information.

 

=======================================

Updated Wednesday – 08/19/09

Jeff Guillet – a Microsoft Windows MVP – has posted a sample batch file for starting the FSW and moving the cluster core resources group to a desired node.  The instructions also include how to schedule the batch file as a startup script.

http://www.expta.com/2009/06/failure-of-fsw-causes-cluster-group-to.html

=======================================

 

=======================================

Updated Wednesday 12/14/2011

I failed to update this blog post previously with a windows hotfix that corrects this behavior and makes work arounds not necessary.

http://blogs.technet.com/b/timmcmic/archive/2010/02/15/kb978790-update-to-windows-2008-to-change-the-failure-behavior-of-the-file-share-witness-quorum-resource.aspx

=======================================

The Windows Cluster service encountered and error during function OpenCluster.

When users attempt to use Exchange Management Shell (PowerShell) to perform management tasks against an Exchange 2007 SP1 Cluster on Windows 2008, certain functions return "The Windows Cluster service encountered an error during function OpenCluster".

 

In the following example I am using a new account I created in active directory.  This account is a member of Domain Administrators and a member of Exchange Organization Administrators.  The user account is logging onto a machine where the Exchange 2007 SP1 RU5 management tools are installed as well as the Windows 2008 Failover Cluster Management tool.

 

Using the account, I attempt to create a mailbox on the clustered exchange instance.  Below is a copy of the verbose output of this command:

 

[PS] C:WindowsSystem32>New-Mailbox -Name Test -Database 2008-MBX32008-MBX3-SG1-DB1 -UserPrincipalName test@exchange.msft -Verbose

cmdlet New-Mailbox at command pipeline position 1
Supply values for the following parameters:
Password:

VERBOSE: New-Mailbox : Beginning processing.
VERBOSE: New-Mailbox : Searching objects "exchange.msft/Users" of type "ExchangeOrganizationalUnit" under the root "$null".
VERBOSE: New-Mailbox : Previous operation run on domain controller ‘2008-DC1.exchange.msft’.
VERBOSE: New-Mailbox : Searching objects "2008-MBX32008-MBX3-SG1-DB1" of type "MailboxDatabase" under the root "$null".
VERBOSE: New-Mailbox : Previous operation run on domain controller ‘2008-DC1.exchange.msft’.
VERBOSE: New-Mailbox : Administrator Active Directory session settings are: View Entire Forest: ‘False’, Default Scope: ‘exchange.msft’, Configuration Domain Controller: ‘2008-DC1.exchange.msft’,
VERBOSE: New-Mailbox : Searching objects of type "ADRecipient" with filter "(|((SamAccountName Equal test)))", scope "SubTree" under the root "exchange.msft".
VERBOSE: New-Mailbox : Previous operation run on domain controller ‘2008-DC1.exchange.msft’.
VERBOSE: New-Mailbox : Searching objects of type "ADRecipient" with filter "(|((Alias Equal test)))", scope "SubTree" under the root "$null".
VERBOSE: New-Mailbox : Previous operation run on global catalog server ‘2008-DC2.exchange.msft’.
VERBOSE: New-Mailbox : Applying RUS policy to the given recipient "exchange.msft/Users/Test" with the home domain controller "$null".
New-Mailbox : The Windows Cluster service encountered an error during function OpenCluster:.
At line:1 char:12
+ New-Mailbox  <<<< -Name Test -Database 2008-MBX32008-MBX3-SG1-DB1 -UserPrincipalName test@exchange.msft -Verbose
VERBOSE: New-Mailbox : Ending processing.

 

The information here is not overly helpful in actually determining the reason why the open cluster failed.  To further determine the reason we can dump the message associated with the exception.

 

To get the exception information, type the following in the Exchange Management Shell:  $error[0].exception.stacktrace.  The output of this command is:

 

[PS] C:WindowsSystem32>$error[0].exception.stacktrace
   at Microsoft.Exchange.Common.ExCluster.GetActiveCmsOnNode(String nodeName)
   at Microsoft.Exchange.Data.Directory.NativeHelpers.GetLocalComputerFqdn(Boolean throwOnException)
   at Microsoft.Exchange.Data.Directory.SystemConfiguration.ADSystemConfigurationSession.ReadLocalServer()
   at Microsoft.Exchange.Data.Directory.Recipient.RecipientUpdateService.FindE12RusServer()
   at Microsoft.Exchange.Data.Directory.Recipient.RecipientUpdateService.LocateServer()
   at Microsoft.Exchange.Configuration.Tasks.RecipientTaskHelper.ApplyRusPolicy(RecipientUpdateService rus, ADRecipientSession recipientSession, ADRecipient recipient, PolicyType[] policyTypes, TaskVerboseLoggingDelegate logHandler, TaskWarningLoggingDelegate writeWarning)
   at Microsoft.Exchange.Configuration.Tasks.RecipientTaskHelper.ApplyRusPolicy(ADSystemConfigurationSession configurationSession, ADRecipientSession recipientSession, ADRecipient recipient, Fqdn domainController, String serverName, TaskVerboseLoggingDelegate logHandler, TaskWarningLoggingDelegate writeWarning)
   at Microsoft.Exchange.Configuration.Tasks.RecipientTaskHelper.ApplyRusPolicy(ADSystemConfigurationSession configurationSession, ADRecipientSession recipientSession, ADRecipient recipient, Fqdn domainController, String serverName, TaskVerboseLoggingDelegate logHandler, TaskErrorLoggingDelegate writeError, TaskErrorLoggingDelegate throwTerminatingError, TaskWarningLoggingDelegate writeWarning)
   at Microsoft.Exchange.Configuration.Tasks.NewRecipientObjectTask`1.PrepareDataObject()
   at Microsoft.Exchange.Configuration.Tasks.SetTaskBase`1.InternalValidate()
   at Microsoft.Exchange.Management.RecipientTasks.NewUserBase.InternalValidate()
   at Microsoft.Exchange.Configuration.Tasks.Task.ProcessRecord()

 

The stack information, although helpful, is still not showing us exactly why we were unable to complete our open cluster call.  To further determine the reason we can dump the inner exception. 

 

To get the inner exception information, type the following in the Exchange Management Shell:  $error[0].exception.innerexception.  The output of this command is:

 

[PS] C:WindowsSystem32>$error[0].exception.innerexception
Access is denied

 

Between the stack trace above, and the inner exception, we can reasonably determine that we are receiving access denied when we attempt to open a connection to the cluster.  (Note:  It did require slightly more code review then just looking at the above to determine this…but for now we’ll agree that that’s what this information says).

 

On the surface it does not appear that there should be any reason we receive an access denied.  The account in question is a Domain Administrator and an Exchange Organization Administrator.  One important note though is that the account is NOT the built in administrator.  Since the user is a domain administrator, and domain administrators are members of local administrators, and local administrators have full control of a Windows 2008 cluster for management purposes it appears we should not be receiving an access denied.  To view the permissions assigned to a cluster:

 

  • Open failover cluster managment.  If necessary, select manage a cluster from the right task pane and connect to a cluster.
  • Right click on the cluster name in the left task pane, select properties.

 

image

 

  • Select the cluster permissions tab.

 

image

 

The reason that the user account is receiving access denied is due to User Account Control (UAC).  Earlier in the post I pointed out that the account we were using was not the built in administrator account.  When we use the built in administrator account UAC is disabled by default.  Without UAC enabled administrator groups are included in the users security token.  In this case we are using a Domain Administrator that is not the built in administrator.  When we do this, UAC is enabled by default for this user so administrator groups are not automatically added to the users security token (unless running elevated).  Therefore, when this user makes the open cluster call, the cluster returns access denied.

 

In order for this account to be able to complete cluster functions, they will have to run Exchange Management Shell elevated.  There are two ways to handle this:

 

  • Right click on the Exchange Management Shell shortcut and select Run As Administrator:

 

image

 

  • Get the properties of the Exchange Management Shell shortcut.  On the short cut tab, select the advanced button.

 

 

image

 

  • Select the "Run as administrator" checkbox and press the OK button.  (Note that you will be prompted for administrator rights and to save the changes.)

 

image

 

Next time that the Exchange Management Shell is launched, you will receive the dialog to from User Account Control requesting permissions to continue.  Select continue to proceed.

 

image

 

Attempt to run the same command as before.  This time the command should end successfully.

 

[PS] C:WindowsSystem32>New-Mailbox -Name Test -Database 2008-MBX32008-MBX3-SG1-DB1 -UserPrincipalName test@exchange.msft -Verbose

cmdlet New-Mailbox at command pipeline position 1
Supply values for the following parameters:
Password:

VERBOSE: New-Mailbox : Beginning processing.
VERBOSE: New-Mailbox : Searching objects "exchange.msft/Users" of type "ExchangeOrganizationalUnit" under the root "$null".
VERBOSE: New-Mailbox : Previous operation run on domain controller ‘2008-DC1.exchange.msft’.
VERBOSE: New-Mailbox : Searching objects "2008-MBX32008-MBX3-SG1-DB1" of type "MailboxDatabase" under the root "$null".
VERBOSE: New-Mailbox : Previous operation run on domain controller ‘2008-DC1.exchange.msft’.
VERBOSE: New-Mailbox : Administrator Active Directory session settings are:  View Entire Forest: ‘False’, Default Scope: ‘exchange.msft’, Configuration Domain Controller: ‘2008-DC1.exchange.msft’,
VERBOSE: New-Mailbox : Searching objects of type "ADRecipient" with filter "(|((SamAccountName Equal test)))", scope "SubTree" under the root "exchange.msft".
VERBOSE: New-Mailbox : Previous operation run on domain controller ‘2008-DC1.exchange.msft’.
VERBOSE: New-Mailbox : Searching objects of type "ADRecipient" with filter "(|((Alias Equal test)))", scope "SubTree" under the root "$null".
VERBOSE: New-Mailbox : Previous operation run on global catalog server ‘2008-DC2.exchange.msft’.
VERBOSE: New-Mailbox : Applying RUS policy to the given recipient "exchange.msft/Users/Test" with the home domain controller "$null".
VERBOSE: New-Mailbox : The RUS server that will apply policies on the specified recipient is "2008-MBX3.exchange.msft".
VERBOSE: New-Mailbox : Processing object "exchange.msft/Users/Test".
VERBOSE: New-Mailbox : Searching objects of type "ADRecipient" with filter "(&((|((UserPrincipalName Equal test@exchange.msft)))(Id NotEqual exchange.msft/Users/Test)))", scope "SubTree" under the root "$null".
VERBOSE: New-Mailbox : Previous operation run on global catalog server
‘2008-DC2.exchange.msft’.
VERBOSE: New-Mailbox : Searching objects of type "ADRecipient" with filter "(&((|((SamAccountName Equal test)))(Id NotEqual exchange.msft/Users/Test)))", scope "SubTree" under the root "exchange.msft".
VERBOSE: New-Mailbox : Previous operation run on domain controller ‘2008-DC1.exchange.msft’.
VERBOSE: Creating Mailbox "Test" on Database "2008-MBX32008-MBX3-SG1-DB1" with UserPrincipalName "test@exchange.msft", Organizational Unit "exchange.msft/Users".
VERBOSE: New-Mailbox : Saving object "exchange.msft/Users/Test" of type "ADUser" and state "New".
VERBOSE: New-Mailbox : Previous operation run on domain controller ‘2008-DC1.exchange.msft’.
VERBOSE: New-Mailbox : Setting password for the created user "exchange.msft/Users/Test".
VERBOSE: New-Mailbox : The properties changed are: "{ PasswordLastSetRaw=’-1′, UserAccountControl=’NormalAccount’ }".
VERBOSE: New-Mailbox : Saving object "exchange.msft/Users/Test" of type "ADUser" and state "Changed".
VERBOSE: New-Mailbox : Previous operation run on domain controller ‘2008-DC1.exchange.msft’.
VERBOSE: New-Mailbox : Reading new object "exchange.msft/Users/Test" of type "ADUser".
VERBOSE: New-Mailbox : Previous operation run on domain controller ‘2008-DC1.exchange.msft’.

Name                      Alias                ServerName       ProhibitSendQuo
                                                                ta            
—-                      —–                ———-       —————
Test                      test                 2008-mbx3        unlimited
     

VERBOSE: New-Mailbox : Ending processing.

 

(I want to thank Brad Hughes and Ben Winzenz for their contributions to this post.)