Category Archives: Uncategorized

Running setup.com /clearLocalCMS on a Windows 2008 cluster disables the machine accounted (VCO) associated with the CMS name.

The setup.com /clearLocalCMS is responsible for purging the clustered configuration for Exchange resources without making any changes to the active directory for the Exchange instance.  This command is commonly used to clean the cluster configuration on the source Exchange cluster prior to running setup.com /recoverCMS or to clear the clustered resources from the target Exchange cluster (for example – standby continuous replication using a single node cluster target).

The setup.com /clearLocalCMS removes the Exchange resources and if possible deletes the remaining clustered group.  If the cluster is a single copy cluster, the physical disk resources will be maintained and the Exchange CMS group renamed.

On Windows 2003, the deleting of clustered resources does not effect the status of the AD machine account object associated with the CMS name.  When the deletion is processed either programmatically or though cluster administrator, the machine account associated with the CMS remains enabled in the active directory.

On Windows 2008, the deleting of clustered resources does effect the status of the AD machine account object associated with the CMS name (machine account = VCO or Virtual Computer Object).  When the deletion is processed either programmatically or through failover cluster management, the VCO associated with the CMS becomes disabled in the active directory.

When there is not a CMS online utilizing this machine account this is generally not an issue.  There are times though when a CMS is online and servicing clients, and setup.com /clearLocalCMS is run on another cluster that originally owned the same CMS.  When that is the case, administrators need to take a manual step to re-enable the VCO in order to ensure that the other online CMS continues to function properly.

Additional steps to re-enable the machine account are documented at http://technet.microsoft.com/en-us/library/bb738150.aspx.

“Because the cluster nodes are running Windows Server 2008, after you run Setup /ClearLocalCMS, the virtual computer object (VCO) will be disabled. To re-enable the VCO, click Start, point to All Programs, point to Administrative Tools, and then click Active Directory Users and Computers. Expand the domain, expand Computers, right-click the CMS VCO, and then click Enable Account.”

Here are sample LDP dumps showing the enabled account and disabled account.

  • Enabled VCO:

Expanding base ‘CN=TEST-CLUSTER,CN=Computers,DC=exchange,DC=msft’…
Getting 1 entries:
Dn: CN=TEST-CLUSTER,CN=Computers,DC=exchange,DC=msft
accountExpires: 9223372036854775807 (never);
badPasswordTime: 0 (never);
badPwdCount: 0;
cn: TEST-CLUSTER;
codePage: 0;
countryCode: 0;
distinguishedName: CN=TEST-CLUSTER,CN=Computers,DC=exchange,DC=msft;
dNSHostName: TEST-CLUSTER.exchange.msft;
dSCorePropagationData: 0x0 = (  );
instanceType: 0x4 = ( WRITE );
isCriticalSystemObject: FALSE;
lastLogoff: 0 (never);
lastLogon: 0 (never);
localPolicyFlags: 0;
logonCount: 0;
name: TEST-CLUSTER;
objectCategory: CN=Computer,CN=Schema,CN=Configuration,DC=exchange,DC=msft;
objectClass (5): top; person; organizationalPerson; user; computer;
objectGUID: 90c511fa-6dcc-4357-8d2e-3762bfc62ce2;
objectSid: S-1-5-21-3347541649-2078682762-2984813736-1143;
primaryGroupID: 515 = ( GROUP_RID_COMPUTERS );
pwdLastSet: 3/15/2009 10:42:37 AM Eastern Daylight Time;
sAMAccountName: TEST-CLUSTER$;
sAMAccountType: 805306369 = ( MACHINE_ACCOUNT );
servicePrincipalName (6): MSServerClusterMgmtAPI/TEST-CLUSTER.exchange.msft; MSServerClusterMgmtAPI/TEST-CLUSTER; MSClusterVirtualServer/TEST-CLUSTER.exchange.msft; MSClusterVirtualServer/TEST-CLUSTER; HOST/TEST-CLUSTER.exchange.msft; HOST/TEST-CLUSTER;
userAccountControl: 0x1020 = ( PASSWD_NOTREQD | WORKSTATION_TRUST_ACCOUNT );
uSNChanged: 671913;
uSNCreated: 671898;
whenChanged: 3/15/2009 10:42:38 AM Eastern Daylight Time;
whenCreated: 3/15/2009 10:40:46 AM Eastern Daylight Time;

———–

 

  • Disabled VCO after removing the cluster resources.

Expanding base ‘CN=TEST-CLUSTER,CN=Computers,DC=exchange,DC=msft’…
Getting 1 entries:
Dn: CN=TEST-CLUSTER,CN=Computers,DC=exchange,DC=msft
accountExpires: 9223372036854775807 (never);
badPasswordTime: 0 (never);
badPwdCount: 0;
cn: TEST-CLUSTER;
codePage: 0;
countryCode: 0;
distinguishedName: CN=TEST-CLUSTER,CN=Computers,DC=exchange,DC=msft;
dNSHostName: TEST-CLUSTER.exchange.msft;
dSCorePropagationData: 0x0 = (  );
instanceType: 0x4 = ( WRITE );
isCriticalSystemObject: FALSE;
lastLogoff: 0 (never);
lastLogon: 0 (never);
localPolicyFlags: 0;
logonCount: 0;
name: TEST-CLUSTER;
objectCategory: CN=Computer,CN=Schema,CN=Configuration,DC=exchange,DC=msft;
objectClass (5): top; person; organizationalPerson; user; computer;
objectGUID: 90c511fa-6dcc-4357-8d2e-3762bfc62ce2;
objectSid: S-1-5-21-3347541649-2078682762-2984813736-1143;
primaryGroupID: 515 = ( GROUP_RID_COMPUTERS );
pwdLastSet: 3/15/2009 10:42:37 AM Eastern Daylight Time;
sAMAccountName: TEST-CLUSTER$;
sAMAccountType: 805306369 = ( MACHINE_ACCOUNT );
servicePrincipalName (6): MSServerClusterMgmtAPI/TEST-CLUSTER.exchange.msft; MSServerClusterMgmtAPI/TEST-CLUSTER; MSClusterVirtualServer/TEST-CLUSTER.exchange.msft; MSClusterVirtualServer/TEST-CLUSTER; HOST/TEST-CLUSTER.exchange.msft; HOST/TEST-CLUSTER;
userAccountControl: 0x1022 = ( ACCOUNTDISABLE | PASSWD_NOTREQD | WORKSTATION_TRUST_ACCOUNT );
uSNChanged: 671916;
uSNCreated: 671898;
whenChanged: 3/15/2009 10:45:02 AM Eastern Daylight Time;
whenCreated: 3/15/2009 10:40:46 AM Eastern Daylight Time;

———–

Exchange 2007 /newCMS or /recoverCMS fails when installing on Windows 2003 clusters.

When running Exchange 2007 /newCMS or /recoverCMS on a Windows 2003 clustered node, the following error may be noted in the setup command.

[ERROR] The computer account ‘<CMSName>’ was created on the domain controller
‘\PDCEmulator.domain.com, but has not replicated to the desired domain controller (LocalDC.domain.com) after waiting approximately 60 seconds. Please wait for the account to replicate and re-run setup /newcms.

When running Exchange 2007 /newCMS or /recoverCMS, you cannot specify a domain controller for use.  This results in Exchange setup choosing a domain controller that is in the same Active Directory site where Exchange is being installed.

The Windows 2003 cluster service, by default, will create kerberos enabled machine accounts ALWAYS on the PDC emulator when the PDC emulator is available regardless of the Active Directory site membership of the nodes.

This error generally results from two situations:

  • Exchange nodes are in Active Directory Site B, the PDC Emulator is in Active Directory Site A.
    • In this case cluster created the machine account in ADSiteA.
    • Exchange setup, using a domain controller in ADSiteB, has not seen the machine account because it has not replicated to the chosen domain controller in ADSiteB.
    • Depending on AD Site Links for replication, it could be a significant time until the two AD Sites replicate and converge.
    • This error will continue until the two sites are replicated and the machine account can be found on the domain controller that Exchange is utilizing.
  • Exchange nodes are in Active Directory Site A, the PDC Emulator is in Active Directory Site A.
    • In this case the Active Directory site may be flat, and contain several domain controllers.
    • Here the error is thrown because of intra-site replication latency between domain controllers that Exchange is utilizing and the PDC emulator.

When this failure occurs, even after replicating AD sites you can expect a second failure to occur during the setup process.  If you are monitoring cluster administrator, you will see that the core Exchange resources are created without issue.  At the time we go to create and online the resource for the first database instance, the following error is thrown:

[ERROR] Setup cannot continue because the RPC server is unavailable. This could be
due to DNS information for clustered mailbox server ‘<CMSName>’ has not finished
replicating. Run Setup again after DNS replication has completed. You can verify
that DNS replication has completed by running "nslookup <CMSName>".

At a casual glance it would appear that you are having a name resolution issue that is preventing setup from continuing and that may actually be the case.  Usually though this error, when it occurs after the previously mentioned error, is actually the result of an access denied to a local RPC call to create the first database instance.  Let’s explore why this happens and take a look at some supporting information.

When the first error regarding replication is encountered re-running the setup command will fail until replication occurs.  If the installer waits, or replication is forced, the setup command will find the machine account available on the local domain controller and will be allowed to continue.  Here are sample LDP dumps of an initial account creation on the PDC emulator and the replicated account to the local domain controller.

  • LDP dump of machine account on PDC emulator.

Expanding base ‘CN=Mail,CN=Computers,DC=domain,DC=com
Result <0>: (null)
Matched DNs:
Getting 1 entries:
>> Dn: CN=Mail,CN=Computers,DC=domain,DC=com
5> objectClass: top; person; organizationalPerson; user; computer;
1> cn: Mail;
1> distinguishedName: CN=Mail,CN=Computers,DC=domain,DC=com;
1> instanceType: 0x4 = ( IT_WRITE );
1> whenCreated: 03/26/2008 15:30:41 Eastern Standard Time Eastern Daylight Time;

1> whenChanged: 03/26/2008 15:30:41 Eastern Standard Time Eastern Daylight Time;

1> uSNCreated: 20888;
1> uSNChanged: 20893;
1> name: Mail;
1> objectGUID: a5d09e3e-2d7d-4e3d-8bf9-30b423ead057;
1> userAccountControl: 0x1020 = ( UF_PASSWD_NOTREQD | UF_WORKSTATION_TRUST_ACCOUNT
);
1> badPwdCount: 0;
1> codePage: 0;
1> countryCode: 0;
1> badPasswordTime: 01/01/1601 00:00:00 UNC ;
1> lastLogoff: 01/01/1601 00:00:00 UNC ;
1> lastLogon: 01/01/1601 00:00:00 UNC ;
1> localPolicyFlags: 0;
1> pwdLastSet: 03/26/2008 15:30:41 Eastern Standard Time Eastern Daylight Time;
1> primaryGroupID: 515;
1> objectSid: S-1-5-21-2075556647-3556310751-2339872061-1111;
1> accountExpires: 09/14/30828 02:48:05 UNC ;
1> logonCount: 0;
1> sAMAccountName: MAIL$;
1> sAMAccountType: 805306369;
1> objectCategory: CN=Computer,CN=Schema,CN=Configuration,DC=domain,DC=com;
1> isCriticalSystemObject: FALSE;
2> dSCorePropagationData: 03/26/2008 15:52:12 Eastern Standard Time Eastern
Daylight Time; 30650/29691/8424 20284:77:2544 UNC;

  • LDP dump of replicated machine account to the local domain controller.

Expanding base ‘CN=Mail,CN=Computers,DC=domain,DC=com’…
Result <0>: (null)
Matched DNs:
Getting 1 entries:
>> Dn: CN=Mail,CN=Computers,DC=domain,DC=com
5> objectClass: top; person; organizationalPerson; user; computer;
1> cn: Mail;
1> distinguishedName: CN=Mail,CN=Computers,DC=domain,DC=com;
1> instanceType: 0x4 = ( IT_WRITE );
1> whenCreated: 03/26/2008 15:30:41 Eastern Standard Time Eastern Daylight Time;

1> whenChanged: 03/26/2008 15:30:50 Eastern Standard Time Eastern Daylight Time;

1> uSNCreated: 16994;
1> uSNChanged: 16994;
1> name: Mail;
1> objectGUID: a5d09e3e-2d7d-4e3d-8bf9-30b423ead057;
1> userAccountControl: 0x1020 = ( UF_PASSWD_NOTREQD | UF_WORKSTATION_TRUST_ACCOUNT
);
1> codePage: 0;
1> countryCode: 0;
1> localPolicyFlags: 0;
1> pwdLastSet: 03/26/2008 15:30:41 Eastern Standard Time Eastern Daylight Time;
1> primaryGroupID: 515;
1> objectSid: S-1-5-21-2075556647-3556310751-2339872061-1111;
1> accountExpires: 09/14/30828 02:48:05 UNC ;
1> sAMAccountName: MAIL$;
1> sAMAccountType: 805306369;
1> objectCategory: CN=Computer,CN=Schema,CN=Configuration,DC=domain,DC=com;
1> isCriticalSystemObject: FALSE;
2> dSCorePropagationData: 03/26/2008 15:52:48 Eastern Standard Time Eastern
Daylight Time; 30650/29691/8424 20876:77:2544 UNC;

If you are looking at these and saying they are almost exactly the same, you are correct.  Please note the following are the same:

  • whenCreated: 03/26/2008 15:30:41 Eastern Standard Time Eastern Daylight Time;
  • objectGUID: a5d09e3e-2d7d-4e3d-8bf9-30b423ead057
  • pwdLastSet: 03/26/2008 15:30:41 Eastern Standard Time Eastern Daylight Time

In this specific case, we’re interested in PWD last set.

After replicating the machine account to the local domain controller you can re-run the setup.com /newCMS or setup.com /recoverCMS commands.  When you do, the command will now continue and no longer throw a replication error.  If you are watching cluster administrator, you will notice that the network name and IP address created from the first attempt are deleted and a new network name and IP address created.  This is where the circumstances for the second failure are introduced. 

When the network name is deleted and recreated, cluster goes back to the PDC emulator and hi-jacks the account that was previously created there.  When it does, the Kerberos password on the account is updated.  This results in a different Kerberos password on the PDC emulator and synced into cluster then what exists on the local domain controller.  You can verify this with LDP.

  • LDP dump on PDC emulator after replicating and rerunning setup.

Expanding base ‘CN=Mail,CN=Computers,DC=domain,DC=com’…
Result <0>: (null)
Matched DNs:
Getting 1 entries:
>> Dn: CN=Mail,CN=Computers,DC=domain,DC=com
5> objectClass: top; person; organizationalPerson; user; computer;
1> cn: Mail;
1> distinguishedName: CN=Mail,CN=Computers,DC=domain,DC=com;
1> instanceType: 0x4 = ( IT_WRITE );
1> whenCreated: 03/26/2008 15:30:41 Eastern Standard Time Eastern Daylight Time;
1> whenChanged: 03/30/2008 10:35:42 Eastern Standard Time Eastern Daylight Time;
1> uSNCreated: 16994;
1> uSNChanged: 24911;
1> name: Mail;
1> objectGUID: a5d09e3e-2d7d-4e3d-8bf9-30b423ead057;
1> userAccountControl: 0x1020 = ( UF_PASSWD_NOTREQD | UF_WORKSTATION_TRUST_ACCOUNT
);
1> codePage: 0;
1> countryCode: 0;
1> localPolicyFlags: 0;
1> pwdLastSet: 03/30/2008 10:34:44 Eastern Standard Time Eastern Daylight Time;
1> primaryGroupID: 515;
1> objectSid: S-1-5-21-2075556647-3556310751-2339872061-1111;
1> accountExpires: 09/14/30828 02:48:05 UNC ;
1> sAMAccountName: MAIL$;
1> sAMAccountType: 805306369;
1> dNSHostName: MAIL.domain.com;
8> servicePrincipalName: exchangeMDB/Mail.domain.com; exchangeMDB/Mail;
exchangeRFR/Mail.domain.com; exchangeRFR/Mail;
MSClusterVirtualServer/Mail.domain.com; MSClusterVirtualServer/MAIL;
HOST/Mail.domain.com; HOST/MAIL;
1> objectCategory: CN=Computer,CN=Schema,CN=Configuration,DC=domain,DC=com;
1> isCriticalSystemObject: FALSE;
2> dSCorePropagationData: 03/26/2008 15:52:48 Eastern Standard Time Eastern
Daylight Time; 30650/29691/8424 20876:77:2544 UNC;

  • LDP dump on local domain controller after replicating and rerunning setup.

Expanding base ‘CN=Mail,CN=Computers,DC=domain,DC=com’…
Result <0>: (null)
Matched DNs:
Getting 1 entries:
>> Dn: CN=Mail,CN=Computers,DC=domain,DC=com
5> objectClass: top; person; organizationalPerson; user; computer;
1> cn: Mail;
1> distinguishedName: CN=Mail,CN=Computers,DC=domain,DC=com;
1> instanceType: 0x4 = ( IT_WRITE );
1> whenCreated: 03/26/2008 15:30:41 Eastern Standard Time Eastern Daylight Time;

1> whenChanged: 03/26/2008 15:30:50 Eastern Standard Time Eastern Daylight Time;

1> uSNCreated: 16994;
1> uSNChanged: 16994;
1> name: Mail;
1> objectGUID: a5d09e3e-2d7d-4e3d-8bf9-30b423ead057;
1> userAccountControl: 0x1020 = ( UF_PASSWD_NOTREQD | UF_WORKSTATION_TRUST_ACCOUNT
);
1> codePage: 0;
1> countryCode: 0;
1> localPolicyFlags: 0;
1> pwdLastSet: 03/26/2008 15:30:41 Eastern Standard Time Eastern Daylight Time;
1> primaryGroupID: 515;
1> objectSid: S-1-5-21-2075556647-3556310751-2339872061-1111;
1> accountExpires: 09/14/30828 02:48:05 UNC ;
1> sAMAccountName: MAIL$;
1> sAMAccountType: 805306369;
1> dNSHostName: MAIL.domain.com;
8> servicePrincipalName: exchangeMDB/Mail.domain.com; exchangeMDB/Mail;
exchangeRFR/Mail.domain.com; exchangeRFR/Mail;
MSClusterVirtualServer/Mail.domain.com; MSClusterVirtualServer/MAIL;
HOST/Mail.domain.com; HOST/MAIL;
1> objectCategory: CN=Computer,CN=Schema,CN=Configuration,DC=domain,DC=com;
1> isCriticalSystemObject: FALSE;
2> dSCorePropagationData: 03/26/2008 15:52:48 Eastern Standard Time Eastern
Daylight Time; 30650/29691/8424 20876:77:2544 UNC;

Taking a look at our comparison points from before:

  • whenCreated
    • SiteA DC – 03/26/2008 15:30:41 Eastern Standard Time Eastern Daylight Time
    • SiteB DC – 03/26/2008 15:30:41 Eastern Standard Time Eastern Daylight Time
  • objectGUID
    • SiteA DC – a5d09e3e-2d7d-4e3d-8bf9-30b423ead057
    • SiteB DC – a5d09e3e-2d7d-4e3d-8bf9-30b423ead057
  • pwdLastSet
    • SiteA DC – pwdLastSet: 03/30/2008 10:34:44 Eastern Standard Time Eastern Daylight Time
    • SiteB DC – pwdLastSet: 03/26/2008 15:30:41 Eastern Standard Time Eastern Daylight Time

As you can clearly see the object is the same (based on objectGUID), but the kerberos password stored on the DC in SiteA and the DC in SiteB is not the same.  The passwords on SiteA is newer then SiteB.  This makes sense, when cluster kerberos enables the new network name it does not delete and recreate the computer account, hence the whenCreated time and objectGUID did not change.  But, it does hijack the existing account, and updates the kerberos password resulting in pwdLastSet getting updated.  Therefore, when running setup a second time, we fail with RPC DNS error (which is really access denied).

If you again replicate domain controllers or allow time for replication to complete naturally, the kerberos passwords will sync between domain controllers.  This time when re-running setup we are watermarked at a different location, and the database instance is successfully created and brought online.

The base moral of this story is that when the PDC emulator is not in the same active directory site as the Exchange nodes, or you have a flat AD site with multiple domain controllers (with some intra-site replication latency), you can expect that Exchange setup will fail at least two times when creating the clustered resources.

There are three potential ways to work around this issue.

  • Use Windows 2008 instead of Windows 2003.
    • Windows 2008 will use local domain controllers to Kerberos enable objects and update existing objects.
    • Note:  If you are in a flat AD site you may still run into this issue due to intra-site AD replication latencies.
  • Shut the PDC emulator down while running setup.
    • By shutting the PDC emulator down you make it unavailable for cluster service use. 
    • When the cluster service cannot find the PDC emulator, it reverts back to using a local domain controller to Kerberos enable objects and update existing objects.
  • Virtually make the PDC emulator unavailable by putting a host file entry in place.
    • This by far is the easiest solution since it can be controlled from the node where setup is being run.
    • In the host file, place an entry for the PDC emulator using a completely unresponsive IP address.
    • Here is an example host file:

# Copyright (c) 1993-2006 Microsoft Corp.
#
# This is a sample HOSTS file used by Microsoft TCP/IP for Windows.
#
# This file contains the mappings of IP addresses to host names. Each
# entry should be kept on an individual line. The IP address should
# be placed in the first column followed by the corresponding host name.
# The IP address and the host name should be separated by at least one
# space.
#
# Additionally, comments (such as these) may be inserted on individual
# lines or following the machine name denoted by a ‘#’ symbol.
#
# For example:
#
#      102.54.94.97     rhino.acme.com          # source server
#       38.25.63.10     x.acme.com              # x client host

168.0.0.1    pdcEmulator.company.com
168.0.0.1    pdcEmulator

    • The IP addressed used is completely unavailable / not responsive on the network.
    • Entries are made by both the FQDN and the Netbios name of the PDC Emulator.
    • When the PDC emulator is made unresponsive, cluster will revert back to using a local domain controller to Kerberos enable objects and update existing objects.
    • You may notice that the network name resource takes longer to come online, it will eventually online if there are no other circumstances preventing it from coming online
    • You need to remove the host file entry when setup has completed successfully.

A lossy failover causes duplicate mails to be delivered to clients from the hub transport dumpster when using Exchange 2007 SP1 Cluster Continuous Replication (CCR).

In the recent weeks I have worked with some customers that have experienced duplicate mails in the Outlook client after a CCR cluster experiences a lossy failover.

A lossy failover occurs when the Exchange resources move between nodes and logs exist on the source that could not be copied to the target.  Depending on settings the actual loss is incurred when a failover occurs and the databases are automatically mounted based on the availability setting, the administrator forces the databases online, or the forceDatabaseMountAfter time period has expired.

The following events can be noted during a lossy failover:

Log Name:      Application
Source:        MSExchangeIS
Date:          2/23/2009 9:24:17 AM
Event ID:      9796
Task Category: General
Level:         Warning
Keywords:      Classic
User:          N/A
Computer:      2008-Node6.exchange.msft
Description:
Database "2008-MBX5-SG22008-MBX5-SG2-DB1" has been subject to a lossy failover. The database may be patched if the Information Store detects it is necessary.

Log Name:      Application
Source:        MSExchangeRepl
Date:          2/23/2009 9:24:24 AM
Event ID:      2099
Task Category: Service
Level:         Information
Keywords:      Classic
User:          N/A
Computer:      2008-Node6.exchange.msft
Description:
The Microsoft Exchange Replication Service requested that Hub Transport server 2008-DC1 resubmit messages between time periods 2/22/2009 8:15:10 PM (UTC) and 2/23/2009 6:24:10 PM (UTC).

The hub transport dumpster is designed to attempt to re-deliver messages that may have been lost during a lossy failover.  A dumpster is maintained on the hub transport server for each CCR or LCR enabled storage group.   The dumpster is a user configured size and a user configured time (see references below).  In terms of how the dumpster operates it’s a FIFO queue of messages.  Let’s look at an example:

My dumpster size is 5 meg.  My dumpster retention time is 7 days.  I have a single storage group with a single mailbox.  I send a 2 meg message.  This 2 meg message is committed to the hub transport dumpster queue.  I then send another 2 meg message – this message is committed to the hub transport dumpster queue.  Finally I finish by sending another 2 meg message – in this case the oldest message in the queue is popped off so that the next 2 meg message can be committed to the hub transport dumpster.  Using this method you can see that messages will recycle through the hub transport dumpster with the first in messages becoming the first out in order to accommodate new messages.

Another example is where messages are moved out due to time expiring.  My dumpster size is 5 meg.  My dumpster retention time is 2 days.  I have a single storage group with a single mailbox.  I send a 2 meg message on Thursday at noon.  I then send another 2 meg message on Friday at noon.  No other messages are sent to this mailbox.  At this point both messages will be committed to the queue since the dumpster has not reached a full condition.  Saturday at noon the message I sent on Thursday is automatically removed from the queue since it has expired based on my maximum retention time.

When a lossy failover occurs the hub transport servers, when requested, will flush the entire contents of their queues back into transport.  This causes the hub transport server to evaluate the messages as if they were just received, and begin the delivery process.  In order to prevent duplicates, each MAILBOX server maintains a list of message ids based on storage group.  Each message that is received is evaluated against this table, if the message id is found the message is turfed (by the store as opposed to being turfed on the transport server) so that no duplicate occurs.  If the message id is not found, the message is re-delivered.  Please note again this table is based on storage group.

Where we run into potential issues is when a move mailbox operation occurs.  When a user is moved between stores or between servers the duplicate tracking table is not updated.  Since this table is not updated, when the messages come from the hub transport server each is evaluated as a new message resulting in duplicate message delivery occurring.  Lets take a look at an example:

I have a user that is on ServerAStorageGroupAMailboxDatabaseA.  ServerA in an Exchange 2007 SP1 CCR cluster.  Due to a power outage of the primary node the Exchange resources are failed from ServerANodeA to ServerANodeB.  In the process 3 logs are lost – availability is set to “best availability” – the databases mount automatically.  The cluster informs the hub transport servers of a lossy failover and the dumpsters are flushed.  The user does not receive any duplicate messages.  This is by design – the duplicate tracking table exists for this storage group, is populated with entries for this mailbox, and successfully turfs any potential duplicates for this user.

Lets take a look at another example:

I have a user that is on ServerAStorageGroupAMailboxDatabaseA.  ServerA is an Exchange 2007 SP1 CCR cluster.  On Monday I move this user from ServerA to ServerB.  The user is now located in ServerBStorageGroupBMailboxDatabaseB.  On Wednesday, due to a power outage, ServerA experiences a lossy failover between nodes.  The hub transport dumpsters are flushed.  When this happens messages destined to this user are re-evaluated, and re-routed to ServerB (where the user now exists).  The duplicate tracking table is consulted, but there are no matches for the messages originating from the dumpster (remember that a move mailbox does not update the duplicate tracking table).  Since there are no matches, all messages for this user, that were submitted from the dumpster, are re-delivered as new to the user.  This appears to the user as duplicate messages.

As of today this behavior is by design.  From an administrator standpoint there is nothing that can be done to mitigate this issue outside of ensuring that the Maximum size per storage group and maximum retention time for the hub transport dumpster are configured appropriately for your organization.

For more information see the following:

http://technet.microsoft.com/en-us/library/bb676352.aspx

http://technet.microsoft.com/en-us/library/bb288910.aspx

http://technet.microsoft.com/en-us/library/aa997963.aspx

http://msexchangeteam.com/archive/2007/01/17/432237.aspx

Using Standby Continuous Replication in both single node cluster implementation and database portability implementation when only a single machine is available.

Last week I received an interesting question from one of our Exchange MVPs. 

“Can I use a single node cluster as an SCR target for both recovering the entire cluster or recovering just a single database?”

My original thoughts on this was no…when implementing SCR you have to make a choice.  Are you implementing a single node cluster target where you would recover the entire cluster or are you implementing a standalone mailbox server target where you would use database portability.

The more I thought about this, the more I determined that YES, there is a way to have your cake and eat it to when you only have a single machine to act as the SCR target.  Let me explain…

Since we are talking about using a single node cluster as the SCR target that means that we have some form of cluster source, either SCC or CCR. 

In the event that you were to loose the entire source cluster, you would activate the SCR target by:

  • Running restore-storagegroupcopy –standbymachine <NODE>
  • Running setup.com /recoverCMS /cmsName:<NAME> /cmsIPAddress:<IP>
  • Mounting the databases.

Now the question becomes, how can I use this single node cluster in order to recover just a single database – ie using the database portability recovery method.

In this case the first thing we need to do is identify the database or databases that we wish to recover to the SCR target.  These databases would have to be dismounted on the source server if they were not already.  We would then run restore-storagegroup –standbymachine on these database instances.

The databases that have been ported to this machine are not in a “clean shutdown” state.  At this time I recommend that we bring the databases to a clean shutdown state.  To perform this operation:

  • Open a command prompt.
  • Run the following command –> eseutil /r <LOGPREFIX> /l <LOG PATH> /s <CHECKPOINT PATH> /d <DATABASE PATH>

Assume the following:

  • The log files and checkpoint file are located at x:SG3
  • The database files are located at y:SG3
  • The log prefix is E02 (the first 3 letters of a log file).
  • The command would be eseutil /r E02 /l “x:SG3” /s “x:SG3” /d “y:SG3”

If you had to use the –force switch with restore-storagegroupcopy, this means that all logs could not be copied over.  In this case you would run eseutil /r E02 /l “x:SG3” /s “x:SG3” /d “y:SG3” /a (note the addition of /a) given the above example.  Some situations where you would have to use –force might include network failures preventing the copy from proceeding successfully, corrupted log files on the source, or missing log files / database on the source.

When this is completed all SCR replication to the target must be disabled.  The reason for this is that an active CMS (we’ll get to this later) cannot also be a target for SCR replication.  To disable SCR in bulk consider running get-storagegroup –server <SOURCE> | disable-storagegroupcopy –standbymachine <TARGET>.  Because this command modifies an AD attribute, we will need to take sometime here to allow for AD replication to converge and for the replication services to detect this change and discontinue the replication instances.  (Note:  The disable command will fail on the storage groups where you ran restore-storagegroupcopy -standbymachine.  When running restore-storagegroupcopy –standbymachine on a database enabled for SCR it is automatically disabled for SCR when the command restore command completes successfully.)

So at this point what we have accomplished is to have all databases and logs necessary for recovery of the desired database instances present on the SCR target and all other replica instances disabled to that same target.  We can now continue with the recovery.

Since the SCR target in this case is running the passive node installation of Exchange, it cannot be made into a standalone mailbox server without uninstalling Exchange.  In this case that’s fine, we are going to create a single node active CMS.  In order to create the active CMS, we will run setup.com /newCMS /cmsName:<NAME> /cmsIPAddress:<IP>.  This setup command gives us a single node CCR cluster.  Please note that cluster type does not matter here since we are assuming that we will always be using a SINGLE NODE cluster.  In essence what we have done here is created a new Exchange server in the environment.

To take stock of where we are at now…we now have a single node cluster with X number of databases that are fully recovered and in a clean shutdown state.  We now need to mount these databases.

To mount these databases:

  • Create a storage group for each database that was restored.  DO NOT use the same paths as the restored databases.
  • Create a new mailbox database in each of these storage groups to correspond to the restored databases.  DO NOT use the same database paths as the restored databases.  DO NOT mount the databases (thus creating a blank database).
    • If the mount database option is accidentally selected, the database must be dismounted before continuing.
  • Run move-storagegrouppath –identity <NewSG> –logfilePath:<RestoredLogPath> –systemfilePath:<RestoredCheckpointPath> –configurationOnly:$TRUE
  • Run move-databasepath –identity <NewDB> –edbFilePath:”<RestoredDBPathDBName.edb>”  -configurationOnly:$TRUE.  [You will need to use the same *.edb name as the database on disk]
  • Run get-mailboxdatabase –server <NewCMS> | set-mailboxdatabase –AllowFileRestore:$TRUE
  • Run get-mailboxdatabase –server <NewCMS> | mount-database

This should mount the databases that were restored.

The finishing steps are to rehome the users to the restored databases.  To perform this step:

Get-Mailbox –Database:“<OriginalServer><SGName><DatabaseName>” | where {$_.ObjectClass -NotMatch ‘(SystemAttendantMailbox|ExOleDbSystemMailbox)’} | Move-Mailbox -ConfigurationOnly –TargetDatabase: “<NewCMS><NewSG><RestoredDatabaseName>”

Please note the following:

  • When using /recoverCMS the source cluster can be cleared and it’s nodes re-subscribed as an SCR target.
  • If following these instructions to implement database portability, the original clustered nodes that are running the other databases are ineligible to be SCR targets since there is an active CMS.  Moving back to the original cluster could possibly mean moving all mailboxes back using the move mailbox process.
  • Using this approach to recover a single database means disabling replication for all other databases.  This may have an impact on your overall disaster recovery design and planning.
  • Depending on replayLagTimes of the SCR target there could be a large delta of logs to be recovered before the database can be mounted.  Planning for activation must include the time to copy delta logs as well as relay all outstanding log files.
  • If using a cluster source consider using CCR for maintaining database availability and using SCR for disaster recoveries.
  • Using move-mailbox –configurationOnly causes multiple changes to be committed to the active directory as well as client profiles changes.  Please consider this also in the overall time to recovery.

*Thanks for Channing Heffney and Ross Smith for reviewing my information.

**********************************

Update 2/15/2010

Updated ESEUTIL /r command to reflect correct storage group paths.

**********************************

Restore-StorageGroupCopy –standbymachine requires use of –force in order to complete successfully when activating a Standby Continuous Replication target in Exchange 2007 SP1.

When attempt to run restore-storagegroupcopy on an SCR target you may receive the following error (note this is with verbose output):

[PS] C:WindowsSystem32>Restore-StorageGroupCopy MBX-4MBX-4-SG1 -StandbyMachine MBX-3 -Verbose > output.txt
VERBOSE: Restore-StorageGroupCopy : Beginning processing.
VERBOSE: Restore-StorageGroupCopy : Searching objects "MBX-4MBX-4-SG1" of type "StorageGroup" under the root "$null".
VERBOSE: Restore-StorageGroupCopy : Previous operation run on domain controller ‘DC-3.domain.com.
VERBOSE: Restore-StorageGroupCopy : Processing object "MBX-4MBX-4-SG1".
VERBOSE: Restoring Storage Group Copy "MBX-4MBX-4-SG1".
WARNING: Failed to copy remaining log files (through Exx.log)during
Restore-StorageGroupCopy operation for storage group copy (MBX-4-SG1). Failure reason: Unable to move a log file from the inspector directory to the log directory for storage group 9b5be25a-1c2b-48a0-9087-2819d2887001|Standby. Old path: f:MBX-4MBX-4-SG1-LogsE00.log. New path: f:MBX-4MBX-4-SG1-LogsIgnoredLogsE00OutofDate2009-02-17T14-42-53E00.log.
.
Restore-StorageGroupCopy : Failed to copy the last log files for storage group ‘MBX-4-SG1’. Use the Force option if you want to restore the storage group despite the data loss.
At line:1 char:25
+ Restore-StorageGroupCopy  <<<< MBX-4MBX-4-SG1 -StandbyMachine MBX-3 -Verbose
VERBOSE: Restore-StorageGroupCopy : Ending processing.

When reviewing the application log of the server, you will note the following events:

Log Name:      Application
Source:        MSExchangeRepl
Date:          2/17/2009 9:42:53 AM
Event ID:      2013
Task Category: Service
Level:         Information
Keywords:      Classic
User:          N/A
Computer:      MBX-3.domain.com
Description:
The replication instance for storage group MBX-4MBX-4-SG1 found an invalid copy of log file f:MBX-4MBX-4-SG1-LogsinspectorE00.log. The log file has been moved to f:MBX-4MBX-4-SG1-LogsIgnoredLogsInspectionFailed2009-02-17T14-42-53E00.log. Reason: The log file has failed inspection.  It is inappropriate for replay..

Log Name:      Application
Source:        MSExchangeRepl
Date:          2/17/2009 9:42:53 AM
Event ID:      2089
Task Category: Action
Level:         Information
Keywords:      Classic
User:          N/A
Computer:      MBX-3.domain.com
Description:
The Restore-StorageGroupCopy operation on MBX-4MBX-4-SG1 failed to complete with error code Microsoft.Exchange.Management.Tasks.RestoreSCRFailedToCopyLastLog: Failed to copy the last log files for storage group ‘MBX-4-SG1’. Use the Force option if you want to restore the storage group despite the data loss..

In the output of the management shell command the user is advised to use the –FORCE option to complete this operation.  There are certain circumstances where this is necessary, this is not one of them.  The reason this should not be necessary is that the source is fully available so all logs not yet replicated should be available for copy when running restore-storagegroupcopy.

The issue here arises from a bug.  We will attempt to copy the last log up to 6 times.  In this case, regardless of our success or failure, we continue to copy the logs until we reach the max attempts and then display a failure.  In all actuality the command completed successfully.  Let’s take a look at that.

If you review the log file directory on the SCR target, you will see that the ENN.log is actually present in the log file folder.

 

image

 

Using eseutil /ml <log>, we can dump the header information for the log file on the SCR TARGET log directory.

Extensible Storage Engine Utilities for Microsoft(R) Exchange Server

Version 08.01

Copyright (C) Microsoft Corporation. All Rights Reserved.

Initiating FILE DUMP mode…

      Base name: e00
      Log file: e00.log
      lGeneration: 2959 (0xB8F)
      Checkpoint: NOT AVAILABLE
      creation time: 02/17/2009 09:26:41
      prev gen time: 02/17/2009 09:26:40
      Format LGVersion: (7.3704.12)
      Engine LGVersion: (7.3704.12)
     
Signature: Create time:01/08/2009 10:39:04 Rand:47957569 Computer:
      Env SystemPath: e:MBX-4MBX-4-SG1-System
      Env LogFilePath: f:MBX-4MBX-4-SG1-Logs
      Env Log Sec size: 512
      Env (CircLog,Session,Opentbl,VerPage,Cursors,LogBufs,LogFile,Buffers)
          (    off,    552,  27600,  15960,  27600,   2048,   2048,2000000000)
      Using Reserved Log File: false
      Circular Logging Flag (current file): off
      Circular Logging Flag (past files): off

      Last Lgpos: (0xb8f,8,30)

Integrity check passed for log file: e00.log

Operation completed successfully in 0.63 seconds.

The most important item in this output to us is the signature. 

Using eseutil /ml <log>, we can dump the header information for the log file on the SCR SOURCE log directory.

Extensible Storage Engine Utilities for Microsoft(R) Exchange Server

Version 08.01

Copyright (C) Microsoft Corporation. All Rights Reserved.

Initiating FILE DUMP mode…

      Base name: e00
      Log file: e00.log
      lGeneration: 2959 (0xB8F)
      Checkpoint: NOT AVAILABLE
      creation time: 02/17/2009 09:26:41
      prev gen time: 02/17/2009 09:26:40
      Format LGVersion: (7.3704.12)
      Engine LGVersion: (7.3704.12)
     
Signature: Create time:01/08/2009 10:39:04 Rand:47957569 Computer:
      Env SystemPath: e:MBX-4MBX-4-SG1-System
      Env LogFilePath: f:MBX-4MBX-4-SG1-Logs
      Env Log Sec size: 512
      Env (CircLog,Session,Opentbl,VerPage,Cursors,LogBufs,LogFile,Buffers)
(    off,    552,  27600,  15960,  27600,   2048, 2048,2000000000)
      Using Reserved Log File: false
      Circular Logging Flag (current file): off
      Circular Logging Flag (past files): off

      Last Lgpos: (0xb8f,8,30)

Integrity check passed for log file: e00.log

Operation completed successfully in 0.63 seconds.

The signature is what we can use to make the most direct comparison.  Based on this information, we can confirm that the log file that was copied and placed in the SCR TARGET log directory matches the log from the SCR SOURCE log directory.  The command was actually successful.

As I indicated above we actually will attempt to copy the log file 6 times (max retires).  If you look in the management shell text, you will see that the log was moved to “f:MBX-4MBX-4-SG1-LogsIgnoredLogsE00OutofDate2009-02-17T14-42-53E00.log” –> in this case the E00OutofDate directory inside ignored logs. 

 

image

 

If you review the folder you will see that the time stamps of the logs are all the same and correspond to the attempt to run restore-storagegroupcopy –standbymachine.  If you run eseutil /ml against one of the log headers, you can again compare the signatures and verify these logs both match the log successfully copied to the SCR TARGET directory and the log on the SCR SOURCE directory.

Extensible Storage Engine Utilities for Microsoft(R) Exchange Server

Version 08.01

Copyright (C) Microsoft Corporation. All Rights Reserved.

Initiating FILE DUMP mode…

      Base name: 200
      Log file: 2009-02-17T14-41-58E00.log
      lGeneration: 2959 (0xB8F)
      Checkpoint: NOT AVAILABLE
      creation time: 02/17/2009 09:26:41
      prev gen time: 02/17/2009 09:26:40
      Format LGVersion: (7.3704.12)
      Engine LGVersion: (7.3704.12)
     
Signature: Create time:01/08/2009 10:39:04 Rand:47957569 Computer:
      Env SystemPath: e:MBX-4MBX-4-SG1-System
      Env LogFilePath: f:MBX-4MBX-4-SG1-Logs
      Env Log Sec size: 512
      Env (CircLog,Session,Opentbl,VerPage,Cursors,LogBufs,LogFile,Buffers)
(    off,    552,  27600,  15960,  27600,   2048, 2048,2000000000)
      Using Reserved Log File: false
      Circular Logging Flag (current file): off
      Circular Logging Flag (past files): off

      Last Lgpos: (0xb8f,8,30)

Integrity check passed for log file: e00.log

Operation completed successfully in 0.63 seconds.

Because of this issue repeat attempts to run restore-storagegroupcopy –standbymachine will return the same error.  There are a few things that can be done to correct this issue:

  • Use the information in this blog to manually verify log copy was successful.  If successful, run restore-storagegroupcopy –standbymachine <name> –force.
  • Contact Microsoft Customer Support Services.  There are incremental updates for both Exchange 2007 SP1 RU5 and Exchange 2007 SP1 RU6 currently available.

This issue is corrected in Exchange 2007 SP1 RU7.

This issue only affects Exchange 2007 SP1 Standby Continuous Replication when using any potential source.

(Note:  If you elect to receive an interim update, the IU only needs to be applied on the SCR target machine.)

Permissions recommended for the CNO (Cluster Name Object) in Windows 2008 for Exchange 2007 SP1 setup operations.

In Windows 2003 when cluster would attempt to create or modify Kerberos enabled machine accounts it would do so by leveraging the rights assigned to the cluster service account.  The Windows 2003 cluster service would use this domain account for the logon right at service startup.

In Windows 2008 when the cluster attempts to create or modify Kerberos enable machine accounts it does so by leveraging the machine account associated with the name of the cluster (this is the Cluster Name Object (CNO) ).  The Windows 2008 cluster service now starts under “Local System”.

When the CNO does not have rights to join machine accounts to the domain, or modify existing machine accounts, the Exchange setup will fail after programmatically creating the network name resources and attempting to bring it online.

This situation most commonly occurs when running:

1)  Setup.com /newCMS /cmsName:<NAME> /cmsIPv4Address:<IP>

2)  Setup.com /recoverCMS /cmsName:<NAME> /cmsIPv4Address:<IP>

3)  Enable-ContinuousReplicationHostName

The following errors may be noted during setup where the network name failed to come online due to this issue:

"Cluster Common Failure Exception: Failed to bring cluster resource Network name (<NAME>) in cluster group <NAME> online.The group or resource is not in the correct state to perform the requested operation. (Exception from HRESULT:0x8007139f)"

Error 0x8007139f translates to:

ERROR_INVALID_STATE
# The group or resource is not in the correct state to
# perform the requested operation.

In the application and system logs, the following events may be noted:

Log Name: Application
Source: MSExchangeRepl
Date: 10/24/2008 2:17:15 PM
Event ID: 107
Task Category: Action
Level: Error
Keywords: Classic
User: N/A
Computer: <NAME>.domain.com
Description:
The New-ClusteredMailboxServer operation failed for server <NAME>

Log Name: Application
Source: MSExchangeSetup
Date: 10/24/2008 2:17:15 PM
Event ID: 1002
Task Category: Microsoft Exchange Setup
Level: Error
Keywords: Classic
User: N/A
Computer: <NAME>.domain.com
Description:
Exchange Server component Clustered Mailbox Server failed.
Error: Error:
Cluster Common Failure Exception: Failed to bring cluster resource Network Name (<NAME>) in cluster group <NAME> online. The event log may contain more details. Cluster Common Failure Exception: The group or resource is not in the correct state to perform the requested operation. (Exception from HRESULT: 0x8007139F)

Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 10/24/2008 2:17:13 PM
Event ID: 1194
Task Category: Network Name Resource
Level: Error
Keywords:
User: SYSTEM
Computer: <NAME>.domain.com
Description:
Cluster network name resource ‘Network Name (<NAME>)’ failed to create its
associated computer object in domain ‘domain.com’ for the following reason: Unable to create computer account. The text for the associated error code is: Access is denied.

To correct this situation this is what I recommend when creating Windows 2008 clusters.  (These steps assume the cluster service on the nodes has not already been configured):

  • Using Active Directory Users and Computers showing advanced features:
    • In the appropriate container create a new machine account to correspond to the name of the cluster – this will be the cluster name object or CNO.
    • In the appropriate container create a new machine account for the Exchange name – this will be the CMS or clustered mailbox server name.
  • Once the machine accounts are created, the necessary permissions should be updated:
    • Get the properties of the CMS computer account.
    • Select the security tab.
    • Select add.
      • Select the object types button – change the scope to just computer accounts.
      • In the search field, type the name of the CNO machine account and press check names.
      • Press OK once the machine account is found.
    • In the group or user names box, find the machine account just added.
      • Assign the FULL CONTROL right to this machine account.
  • Complete the process by disabling both the CNO account and the CMS account.
  • Allow time for AD replication.

If the cluster services have already been configured you can skip the step of creating an account for the CNO and disabling the CNO account since this account should already exist in the active directory.

When these steps are completed you should be able to establish the cluster services and begin the Exchange installation.

If you are using Standby Continuous Replication (SCR) and the target is a single node cluster you will follow the same instructions with the exception of:

  • Create two CNO accounts, one for each cluster.
  • Add both CNO accounts with full control to the same CMS account.
  • Disable all accounts created.

By updating permissions for the additional CNO this will ensure that the standby cluster CNO has the appropriate rights when running setup.com /recoverCMS.

If you are using continuous replication hostnames with cluster continuous replication clusters you will follow the same process outlined above to pre-stage your machine accounts associated with the replication names and add the CNO account with full control.  The only CNO account that requires permissions is that of the cluster hosting the replication host names – SCR target cluster CNOs do not require permissions to these names.

By pre-staging machine accounts and establishing the appropriate security contexts you can help prevent errors during Exchange setup and commandlet operations.

Errors with Test-ReplicationHealth when using a multi-subnet Windows 2008 Cluster.

Windows 2008 supports having clustered nodes that are installed into different network subnets.  For Exchange 2007 SP1 this becomes a configuration used for cluster continuous replication clusters.

In troubleshooting and monitoring clusters customers will often use the test-replicationhealth commandlet to determine the status of replication between two clustered nodes in a continuous cluster replication solution.  When the cluster itself is multi-subnet, the following error is thrown during testing:

Server Check Result Error
—— —– —— —–
2008-NODE6 ClusterNetwork WARNING Warnings:
Network ‘Cluster Netw
ork 3′ used for client co
nnectivity is up but node
‘2008-Node5’ does not ha
ve a Network Interface Ca
rd configured on it. Chec
k that a NIC is configure
d for this network and is
enabled.
Network ‘Cluster Netw
ork 1′ used for client co
nnectivity is up but node
‘2008-Node6’ does not ha
ve a Network Interface Ca
rd configured on it. Chec
k that a NIC is configure
d for this network and is
enabled.
2008-NODE6 QuorumGroup Passed
2008-NODE6 FileShareQuorum Passed
2008-NODE6 CmsGroup Passed
2008-NODE6 NodePaused Passed
2008-NODE6 DnsRegistrationStatus Passed
2008-NODE6 ReplayService Passed
2008-NODE6 DBMountedFailover Passed
2008-NODE6 SGCopySuspended Passed
2008-NODE6 SGCopyFailed Passed
2008-NODE6 SGInitializing Passed
2008-NODE6 SGCopyQueueLength Passed
2008-NODE6 SGReplayQueueLength Passed

 

The issue is a bug in the way that test-replicationhealth handles cluster networks.  In cluster administrator, under networks, you will see a “network” enumerated for each subnet.  Each of these networks shows all of the cluster interfaces that reside on that network.  In a single subnet cluster, these networks would generally have two interfaces, one for each node.  In a multi-subnet cluster, each network only has a single interface, one for each node.

To view a network and it’s interfaces, select networks from the left hand pane:

 

image

 

Here is an example of a multi-subnet cluster (showing a single interface per network):

 

image

 

Here is an example of a single subnet cluster (showing multiple interfaces per network):

 

image

 

At this time this issue is not scheduled to be corrected in any Exchange 2007 release.

 

*Updates:

10/19/2009 – Updated to reflect fix release.

Recommendations for enabling a two node Standby Continuous Replication target based on a Single Copy Cluster (Exchange 2007 SP1)

In this blog post I will assume there exists a source cluster that consists of a two node Exchange 2007 SP1 Single Copy Cluster (SCC) hosted on either Windows 2003 or Windows 2008. 

Standby Continuous Replication (SCR) was designed, in this type of deployment, to have a target that is a single node cluster.  Recently I’ve received several requests on how this could be extended to a two node cluster functioning as the SCR target.

Single copy clusters make having a two node target more complicated because of having to deal with the shared storage.  It is a requirement of SCR that the same drive letters / paths used for databases, logs, and system files on the source also exist on the target.  Also, we must take into consideration the fact that the storage necessary for replication can only be owned by a single node (shared nothing cluster model), and therefore only one node of the target cluster can be subscribed as the SCR target.

If you desire to have a two node SCR target, consider making the following configuration changes to assist in ensuring that the physical disk resources are owned on the correct node.

Windows cluster allows administrators to specify, on the properties of clustered groups, a list of preferred owners.  The preferred owners list on an Exchange cluster is generally cosmetic.  When preferred owners is combined with a Failback Policy, the settings become more then cosmetic.  A preferred owners group allows the administrator to establish the list of nodes, in order, that they prefer the group be hosted on when nodes are available.  When combined with a failback policy, the preferred owners list tells the cluster where and when to move the group automatically when specific nodes are available.  Let’s look at a few examples of this as it applies to our SCR target.  The preferred owners list and failback policy will be invoked anytime cluster membership also changes, for example, when rebooting a node that is a member of the cluster.

Example #1:

I have a two node SCR target with a group configured to hold my physical disk resources.  I have set a preferred owners list of NodeA then NodeB and a failback policy of immediate.  The group is currently owned on NodeA.  At patch management time I apply the necessary hotfixes to NodeB and reboot the server.  When NodeB has successfully rejoined the cluster, I then apply the patches to NodeA and reboot.  The disk group automatically moves from NodeA to NodeB.  When NodeA successfully rejoins the cluster, the disk group automatically moves back to NodeA.  Replication can now successfully resume since the underlying storage necessary for replication is present on NodeA, and NodeA is subscribed as the SCR target.  In this instance cluster membership changed during the reboot causing the cluster to evaluate the preferred owners list and failback policy and take actions as defined.

Example #2:

I have a two node SCR target with a group configured to hold my physical disk resources.  I have set a preferred owners list of NodeA then NodeB and a failback policy of immediate.  The group is currently owned on NodeA.  NodeA experiences a blue screen condition due to a faulty storage driver.  The disk group automatically moves from NodeA to NodeB.  When NodeA automatically reboots and successfully rejoins the cluster, the disk group automatically moves back to NodeA.  Replication can now successfully resume since the underlying storage necessary for replication is present on NodeA, and NodeA is subscribed as the SCR target.  In this instance cluster membership changed during the reboot causing the cluster to evaluate the preferred owners list and failback policy and take actions as defined.

Example #3

I have a two node SCR target with a group configured to hold my physical disk resources.  I have set a preferred owners list of NodeA then NodeB and a failback policy of immediate.  At patch management time I apply the necessary hotfixes to NodeB and reboot the server.  When NodeB has successfully rejoined the cluster, I launch failover cluster management and manually move the disk group from NodeA to NodeB.  I then apply the patches to NodeA and reboot the server.  When NodeA successfully rejoins the cluster, the disk group automatically moves back to NodeA.  Replication can now successfully resume since the underlying storage necessary for replication is present on NodeA, and NodeA is subscribed as the SCR target.  In this instance cluster membership changed during the reboot causing the cluster to evaluate the preferred owners list and failback policy and take actions as defined.

Example #4

I have a two node SCR target with a group configured to hold my physical disk resources.  I have set a preferred owners list of NodeA then NodeB and a failback policy of immediate.  An administrator, using failover cluster management, moves the disk group from NodeA to NodeB.  The group is not moved back.  Replication will enter a failed state for all instances since the storage necessary for replication to function is no longer present on the node subscribed to SCR.  Alerting informs the administrator there is an issue.  It is determined that the disk group is owned on the wrong node, and is manually moved back to NodeA.  Soon after replication successfully resumes since the underlying storage necessary for replication is present on NodeA, and NodeA is subscribed as the SCR target.  In this instance cluster membership did NOT change, so the preferred owners list and failback policy was not applied. 

Establishing the disk group, Preferred Owner, and Failback Policy in Windows 2003

Use the following steps to establish the disk group, preferred owners list, and a failback policy in Windows 2003.

  • Launch cluster administrator and connect to the SCR target cluster.
  • Under groups give in the left hand pane, create a new cluster group. 

 

clip_image002[4]

 

    • Name the group as appropriate.
    • By default disks found on a shared bus have physical disk resources created for them in default groups (ie Group0, Group1, Group2).  Move physical disk resources from the default groups into the new group created.
    • If mount points are being used physical disk resources must be created manually.  Please sure that all mountpoint disks are created and made dependant on the lettered volume hosting them (refer to – http://support.microsoft.com/default.aspx?scid=kb;[LN];280297).
  • Right click on the new group – select properties.
  • On the general tab is the preferred owners list.  Using the modify button, add preferred owners to the list and adjust the order as necessary.  (The server listed first in order will be the node most preferred to own the group).

 

    clip_image002[6]

     

    clip_image004[4]

     

  • After changing preferred owners (if necessary), select the failback tab.
    • Select the radio button allow failback.
    • Select the radio button immediately.

 

    clip_image006[4]

     

    The configuration of preferred owners and a failback policy can be performed with command line.

    To set the list of preferred owners and configure failback:

     

  • cluster.exe <ClusterFQDN> group <GroupName> /setOwners:<FirstNode>,<SecondNode>
  • cluster.exe <ClusterFQDN> group <GroupName> /prop AutoFailbackType=1

    Examples of these commands:

     

  • cluster.exe 2003-Cluster3.exchange.msft group SCRTargetDisks /setOwners:2003-Node1,2003-Node2
  • cluster.exe 2003-Cluster3.exchange.msft group SCRTargetDisks /pro AutoFailbackType=1

Establishing the disk group, Preferred Owner, and Failback Policy in Windows 2008

  • Launch failover cluster management and connect to the SCR target cluster.
  • In the left hand pane, under the cluster name, right click on Services and Application, select more actions -> Create Empty Service of Application

 

clip_image002[8]

 

  • Under services and applications you will now see a group named "New service of application".
  • Right click on "New service or application", select rename, and assign an appropriate name.

 

clip_image004[6]

 

  • Right click on the group, select add storage.  From here, choose the storage that should be added to this group.
    • Note:  By default Windows 2008 adds both lettered volumes and mounted volumes (mount points) to the available storage group at cluster creation.  Mounted volumes must manually be made dependant on their lettered physical disk.  Please update dependencies if necessary.
    • If the desired storage does not appear in the storage picker ensure that it has been added to the available storage group.  Only storage that has first been added to the available storage group is allowed to be added to services and applications.

 

clip_image006[6]

clip_image008[4]

 

  • Once storage has been added, right click on group and select properties.
  • On the general tab, select the checkbox next to each node of the cluster.  Use the up / down buttons to establish the preferred order.  Machines appearing first in the list will have preference over other nodes.

 

clip_image010[4]

 

  • Select the Failover tab.
    • Under the failback portion, select the radio button next to "Allow Failback".
    • Under "Allow Failback" select the radio button next to "Immediately".

 

clip_image012[4]

 

  • Apply the changes to complete the configuration.

The configuration of preferred owners and a failback policy can be performed with command line.

To set the list of preferred owners and configure failback:

  • cluster.exe <ClusterFQDN> group <GroupName> /setOwners:<FirstNode>,<SecondNode>
  • cluster.exe <ClusterFQDN> group <GroupName> /prop AutoFailbackType=1

Examples of these commands:

  • cluster.exe 2008-Cluster3.exchange.msft group SCRTargetDisks /setOwners:2008-Node1,2008-Node2
  • cluster.exe 2008-Cluster3.exchange.msft group SCRTargetDisks /pro AutoFailbackType=1

Consider reviewing the following references for more information.

http://support.microsoft.com/kb/197047

http://support.microsoft.com/kb/299631

http://support.microsoft.com/kb/823955

Get-storagegroupcopystatus = Initializing

A question that I see a lot of, and causes come confusion out there, is why does get-storagegroupcopystatus report a status of initializing for replicated storage groups.  This status occurs when using any form of replication (LCR = local continuous replication / CCR = cluster continuous replication / SCR = standby continuous replication).

 

Some history…

 

In Exchange 2007 RTM there was no initializing status.  When the replication service was restarted for any reason, the replication service would display a status of healthy for each storage group.  The issue here is that the storage groups may not have actually been healthy.  It would be possible that two databases copies could be diverged, but this would not be detected until the first log on the source was generated, copied, inspected, and compared to the passive database instance.  While waiting for this to happen, the administrator is given a false sense that all is well because all instances by default were assumed healthy.

 

An example of how this is not the most desired approach can be seen with CCR installations.  In this configuration we’ll assume there is a reason that the passive database copy was diverged from the active copy.  When the customer issues a move-clusteredmailboxserver, we are able to move the instance between the two nodes.  Since the replication service reported the instances as healthy, nothing prevented us from moving the resources between the two nodes.

 

The change…

 

In Exchange 2007 SP1 we took a change to the replication service to now display a status of initializing.  What initializing is telling the administrator is that:

 

  • The replication service has not received a notification to copy a log.
  • The replication service has not yet copied a log.
  • The replication service has not yet inspected a log.
  • The replication service has not placed a log out for replay and determined divergence information.

 

When the above criteria have been met the replication service will change the instances from initializing to either healthy or failed.  Here is an example of get-storagegroupcopystatus showing initializing status:

 

Name                      SummaryCopySt CopyQueueLeng ReplayQueueL LastInspecte
                          atus          th            ength        dLogTime
—-                      ————- ————- ———— ————
2008-MBX3-SG1             Initializing  0             0
2008-MBX3-SG2             Initializing  0             0

 

Where I most commonly hear this question is in test labs where activities like move-clusteredmailboxserver do not work.  When storage groups are in an initializing state commandlets report that action cannot be taken because replication is failed.  It really is not about replication being failed but more that replication is in any other state but healthy.  Other conditions that we see the initializing state is post VSS backups of the passive database copies or when an replication instance is suspended and resumed.

 

Here is an example of a move-clusteredmailboxserver command results where storage group copy status is initializing:

 

Move-ClusteredMailboxServer : Continuous replication is in a failed, seeding, or suspended state on ‘2008-MBX3-SG1’. Move-clusteredMailboxServer cannot be performed if one or more of the server’s storage group copies are in failed, seeding or suspended states.
At line:1 char:27
+ Move-ClusteredMailboxServer <<<<

 

To change replication to a healthy state I usually do one of two steps:

 

  1. Dismount and re-mount all databases -> this should hopefully cause log roll to occur and the replication service to replicate a log.
  2. Create test mailboxes in each store and send mail to them.  Mail flow will create log files.  If the mail flow is not significant enough to roll the logs, automatic log roll will occur at a later time even though the log file is not full.

 

Monitoring…

 

There are two reliable ways to monitor continuous replication status for initializing state. 

 

The first method is to run get-storagegroupcopystatus.

 

The second method is to use performance monitor and the "MSExchange Replication" performance object.  Under this object is a counter named "Initializing".  If the counter displays a value of 1 then the storage group instance is in initializing state.  If the counter displays a value of 0 the the storage group is not in the initializing state.  Below is an example of performance monitor showing databases in an initializing state moving from Initializing to Healthy.

 

image

 

Information on monitoring continuous replication can be found here -> http://technet.microsoft.com/en-us/library/bb629521.aspx

Cluster stability issues with Exchange 2007 SP1 RU5 Single Copy Clusters (SCC) enabled for Standby Continuous Replication (SCR) on Windows 2008

In the recent weeks I’ve worked several cases where Exchange 2007 SP1 Single Copy Clusters with storage groups enabled for Standby Continuous Replication on Windows 2008 have had performance and network connectivity issues.  These issues have manifested themselves with the following symptoms:

 

  • Outlook clients appear to have no issues connecting to or using the Exchange instances.
  • Access to the host machine console is slow.  RDP may connect unreliably.
  • Use of management tools may fail fail to connect, for example event viewer and failover cluster manager will fail to connect.
  • Get-storagegroupcopystatus -standbymachine may report failed replication instances between source and target.
  • Review of share management under server management shows that SCR shares are being deleted and recreated on the source cluster.

 

In each case the following was common to all the situations I worked.

 

  • Operating system is Windows 2008
  • Source cluster is an Exchange 2007 SP1 RU5 Single Copy Cluster (SCC)
  • One or all storage groups are enabled for Standby Continuous Replication (SCR)
  • Network interfaces drivers were older then July 2008.
  • Network teaming was used on the public cluster interfaces and configured for Fault Tolerance with Load Balancing
  • If standby continuous replication was disabled on any enabled storage group cluster stability could be maintained.

 

The following was performed to resolve the stability issues.  Note:  All steps are considered part of the solution.

 

1)  Upgrade network interface drivers to a revision July 2008 or newer.

 

2)  Reconfigure all network teams for Fault Tolerance Only.

 

3)  Install KB 955733 to all clustered nodes.

 

This kb article corrects known issues with status codes returned by the Windows operating system for certain function calls.  Note that when downloading the fix it is marked as Vista x64 although the fix is for Windows 2008 x64.

 

4)  Open a case with Microsoft CSS – Request Exchange interim update KB957834.  (Recommend that customers upgrade to Exchange 2007 SP1 RU5 and deploy the RU5 IU).  This article can be found at:  http://support.microsoft.com/kb/955733

 

This update is available for Exchange 2007 SP1 RU4 and Exchange 2007 SP1 RU5 at this time.  This update corrects issues with share endpoint checking for servicing SCR.  For more information regarding the replication service and shares see http://blogs.technet.com/timmcmic/archive/2008/12/23/exchange-replication-service-exchange-2007-sp1-and-windows-2008-clusters.aspx