Monthly Archives: March 2009

EXTRA!! EXTRA!! Read all about it…get your SCR fixes here (aka Exchange 2007 SP1 RU7 released!)

Exchange 2007 SP1 RU7 released today.   The following important SCR (Standby Continuous Replication) fixes are now available.

  • KB 961281 Update Rollup 5 for Exchange Server 2007 Service Pack 1 introduced an error when you enable SCR for a storage group in an environment with a parent/child domain structure. See Tim’s blog post for more info.
  • KB 957834 This fixes an issue where network shares are deleted and created intermittently by the Replication service on a CMS in a single copy cluster when the CMS is an SCR source.
  • KB 958331 This fixes an issue where Restore-StorageGroupCopy can fail in SCR environment.

You can download Exchange 2007 SP1 RU7 here.

http://www.microsoft.com/downloads/details.aspx?FamilyID=2074fefd-fa1a-4c3e-bf72-94585e454150&displaylang=en

Running setup.com /clearLocalCMS on a Windows 2008 cluster disables the machine accounted (VCO) associated with the CMS name.

The setup.com /clearLocalCMS is responsible for purging the clustered configuration for Exchange resources without making any changes to the active directory for the Exchange instance.  This command is commonly used to clean the cluster configuration on the source Exchange cluster prior to running setup.com /recoverCMS or to clear the clustered resources from the target Exchange cluster (for example – standby continuous replication using a single node cluster target).

The setup.com /clearLocalCMS removes the Exchange resources and if possible deletes the remaining clustered group.  If the cluster is a single copy cluster, the physical disk resources will be maintained and the Exchange CMS group renamed.

On Windows 2003, the deleting of clustered resources does not effect the status of the AD machine account object associated with the CMS name.  When the deletion is processed either programmatically or though cluster administrator, the machine account associated with the CMS remains enabled in the active directory.

On Windows 2008, the deleting of clustered resources does effect the status of the AD machine account object associated with the CMS name (machine account = VCO or Virtual Computer Object).  When the deletion is processed either programmatically or through failover cluster management, the VCO associated with the CMS becomes disabled in the active directory.

When there is not a CMS online utilizing this machine account this is generally not an issue.  There are times though when a CMS is online and servicing clients, and setup.com /clearLocalCMS is run on another cluster that originally owned the same CMS.  When that is the case, administrators need to take a manual step to re-enable the VCO in order to ensure that the other online CMS continues to function properly.

Additional steps to re-enable the machine account are documented at http://technet.microsoft.com/en-us/library/bb738150.aspx.

“Because the cluster nodes are running Windows Server 2008, after you run Setup /ClearLocalCMS, the virtual computer object (VCO) will be disabled. To re-enable the VCO, click Start, point to All Programs, point to Administrative Tools, and then click Active Directory Users and Computers. Expand the domain, expand Computers, right-click the CMS VCO, and then click Enable Account.”

Here are sample LDP dumps showing the enabled account and disabled account.

  • Enabled VCO:

Expanding base ‘CN=TEST-CLUSTER,CN=Computers,DC=exchange,DC=msft’…
Getting 1 entries:
Dn: CN=TEST-CLUSTER,CN=Computers,DC=exchange,DC=msft
accountExpires: 9223372036854775807 (never);
badPasswordTime: 0 (never);
badPwdCount: 0;
cn: TEST-CLUSTER;
codePage: 0;
countryCode: 0;
distinguishedName: CN=TEST-CLUSTER,CN=Computers,DC=exchange,DC=msft;
dNSHostName: TEST-CLUSTER.exchange.msft;
dSCorePropagationData: 0x0 = (  );
instanceType: 0x4 = ( WRITE );
isCriticalSystemObject: FALSE;
lastLogoff: 0 (never);
lastLogon: 0 (never);
localPolicyFlags: 0;
logonCount: 0;
name: TEST-CLUSTER;
objectCategory: CN=Computer,CN=Schema,CN=Configuration,DC=exchange,DC=msft;
objectClass (5): top; person; organizationalPerson; user; computer;
objectGUID: 90c511fa-6dcc-4357-8d2e-3762bfc62ce2;
objectSid: S-1-5-21-3347541649-2078682762-2984813736-1143;
primaryGroupID: 515 = ( GROUP_RID_COMPUTERS );
pwdLastSet: 3/15/2009 10:42:37 AM Eastern Daylight Time;
sAMAccountName: TEST-CLUSTER$;
sAMAccountType: 805306369 = ( MACHINE_ACCOUNT );
servicePrincipalName (6): MSServerClusterMgmtAPI/TEST-CLUSTER.exchange.msft; MSServerClusterMgmtAPI/TEST-CLUSTER; MSClusterVirtualServer/TEST-CLUSTER.exchange.msft; MSClusterVirtualServer/TEST-CLUSTER; HOST/TEST-CLUSTER.exchange.msft; HOST/TEST-CLUSTER;
userAccountControl: 0x1020 = ( PASSWD_NOTREQD | WORKSTATION_TRUST_ACCOUNT );
uSNChanged: 671913;
uSNCreated: 671898;
whenChanged: 3/15/2009 10:42:38 AM Eastern Daylight Time;
whenCreated: 3/15/2009 10:40:46 AM Eastern Daylight Time;

———–

 

  • Disabled VCO after removing the cluster resources.

Expanding base ‘CN=TEST-CLUSTER,CN=Computers,DC=exchange,DC=msft’…
Getting 1 entries:
Dn: CN=TEST-CLUSTER,CN=Computers,DC=exchange,DC=msft
accountExpires: 9223372036854775807 (never);
badPasswordTime: 0 (never);
badPwdCount: 0;
cn: TEST-CLUSTER;
codePage: 0;
countryCode: 0;
distinguishedName: CN=TEST-CLUSTER,CN=Computers,DC=exchange,DC=msft;
dNSHostName: TEST-CLUSTER.exchange.msft;
dSCorePropagationData: 0x0 = (  );
instanceType: 0x4 = ( WRITE );
isCriticalSystemObject: FALSE;
lastLogoff: 0 (never);
lastLogon: 0 (never);
localPolicyFlags: 0;
logonCount: 0;
name: TEST-CLUSTER;
objectCategory: CN=Computer,CN=Schema,CN=Configuration,DC=exchange,DC=msft;
objectClass (5): top; person; organizationalPerson; user; computer;
objectGUID: 90c511fa-6dcc-4357-8d2e-3762bfc62ce2;
objectSid: S-1-5-21-3347541649-2078682762-2984813736-1143;
primaryGroupID: 515 = ( GROUP_RID_COMPUTERS );
pwdLastSet: 3/15/2009 10:42:37 AM Eastern Daylight Time;
sAMAccountName: TEST-CLUSTER$;
sAMAccountType: 805306369 = ( MACHINE_ACCOUNT );
servicePrincipalName (6): MSServerClusterMgmtAPI/TEST-CLUSTER.exchange.msft; MSServerClusterMgmtAPI/TEST-CLUSTER; MSClusterVirtualServer/TEST-CLUSTER.exchange.msft; MSClusterVirtualServer/TEST-CLUSTER; HOST/TEST-CLUSTER.exchange.msft; HOST/TEST-CLUSTER;
userAccountControl: 0x1022 = ( ACCOUNTDISABLE | PASSWD_NOTREQD | WORKSTATION_TRUST_ACCOUNT );
uSNChanged: 671916;
uSNCreated: 671898;
whenChanged: 3/15/2009 10:45:02 AM Eastern Daylight Time;
whenCreated: 3/15/2009 10:40:46 AM Eastern Daylight Time;

———–

Exchange 2007 /newCMS or /recoverCMS fails when installing on Windows 2003 clusters.

When running Exchange 2007 /newCMS or /recoverCMS on a Windows 2003 clustered node, the following error may be noted in the setup command.

[ERROR] The computer account ‘<CMSName>’ was created on the domain controller
‘\PDCEmulator.domain.com, but has not replicated to the desired domain controller (LocalDC.domain.com) after waiting approximately 60 seconds. Please wait for the account to replicate and re-run setup /newcms.

When running Exchange 2007 /newCMS or /recoverCMS, you cannot specify a domain controller for use.  This results in Exchange setup choosing a domain controller that is in the same Active Directory site where Exchange is being installed.

The Windows 2003 cluster service, by default, will create kerberos enabled machine accounts ALWAYS on the PDC emulator when the PDC emulator is available regardless of the Active Directory site membership of the nodes.

This error generally results from two situations:

  • Exchange nodes are in Active Directory Site B, the PDC Emulator is in Active Directory Site A.
    • In this case cluster created the machine account in ADSiteA.
    • Exchange setup, using a domain controller in ADSiteB, has not seen the machine account because it has not replicated to the chosen domain controller in ADSiteB.
    • Depending on AD Site Links for replication, it could be a significant time until the two AD Sites replicate and converge.
    • This error will continue until the two sites are replicated and the machine account can be found on the domain controller that Exchange is utilizing.
  • Exchange nodes are in Active Directory Site A, the PDC Emulator is in Active Directory Site A.
    • In this case the Active Directory site may be flat, and contain several domain controllers.
    • Here the error is thrown because of intra-site replication latency between domain controllers that Exchange is utilizing and the PDC emulator.

When this failure occurs, even after replicating AD sites you can expect a second failure to occur during the setup process.  If you are monitoring cluster administrator, you will see that the core Exchange resources are created without issue.  At the time we go to create and online the resource for the first database instance, the following error is thrown:

[ERROR] Setup cannot continue because the RPC server is unavailable. This could be
due to DNS information for clustered mailbox server ‘<CMSName>’ has not finished
replicating. Run Setup again after DNS replication has completed. You can verify
that DNS replication has completed by running "nslookup <CMSName>".

At a casual glance it would appear that you are having a name resolution issue that is preventing setup from continuing and that may actually be the case.  Usually though this error, when it occurs after the previously mentioned error, is actually the result of an access denied to a local RPC call to create the first database instance.  Let’s explore why this happens and take a look at some supporting information.

When the first error regarding replication is encountered re-running the setup command will fail until replication occurs.  If the installer waits, or replication is forced, the setup command will find the machine account available on the local domain controller and will be allowed to continue.  Here are sample LDP dumps of an initial account creation on the PDC emulator and the replicated account to the local domain controller.

  • LDP dump of machine account on PDC emulator.

Expanding base ‘CN=Mail,CN=Computers,DC=domain,DC=com
Result <0>: (null)
Matched DNs:
Getting 1 entries:
>> Dn: CN=Mail,CN=Computers,DC=domain,DC=com
5> objectClass: top; person; organizationalPerson; user; computer;
1> cn: Mail;
1> distinguishedName: CN=Mail,CN=Computers,DC=domain,DC=com;
1> instanceType: 0x4 = ( IT_WRITE );
1> whenCreated: 03/26/2008 15:30:41 Eastern Standard Time Eastern Daylight Time;

1> whenChanged: 03/26/2008 15:30:41 Eastern Standard Time Eastern Daylight Time;

1> uSNCreated: 20888;
1> uSNChanged: 20893;
1> name: Mail;
1> objectGUID: a5d09e3e-2d7d-4e3d-8bf9-30b423ead057;
1> userAccountControl: 0x1020 = ( UF_PASSWD_NOTREQD | UF_WORKSTATION_TRUST_ACCOUNT
);
1> badPwdCount: 0;
1> codePage: 0;
1> countryCode: 0;
1> badPasswordTime: 01/01/1601 00:00:00 UNC ;
1> lastLogoff: 01/01/1601 00:00:00 UNC ;
1> lastLogon: 01/01/1601 00:00:00 UNC ;
1> localPolicyFlags: 0;
1> pwdLastSet: 03/26/2008 15:30:41 Eastern Standard Time Eastern Daylight Time;
1> primaryGroupID: 515;
1> objectSid: S-1-5-21-2075556647-3556310751-2339872061-1111;
1> accountExpires: 09/14/30828 02:48:05 UNC ;
1> logonCount: 0;
1> sAMAccountName: MAIL$;
1> sAMAccountType: 805306369;
1> objectCategory: CN=Computer,CN=Schema,CN=Configuration,DC=domain,DC=com;
1> isCriticalSystemObject: FALSE;
2> dSCorePropagationData: 03/26/2008 15:52:12 Eastern Standard Time Eastern
Daylight Time; 30650/29691/8424 20284:77:2544 UNC;

  • LDP dump of replicated machine account to the local domain controller.

Expanding base ‘CN=Mail,CN=Computers,DC=domain,DC=com’…
Result <0>: (null)
Matched DNs:
Getting 1 entries:
>> Dn: CN=Mail,CN=Computers,DC=domain,DC=com
5> objectClass: top; person; organizationalPerson; user; computer;
1> cn: Mail;
1> distinguishedName: CN=Mail,CN=Computers,DC=domain,DC=com;
1> instanceType: 0x4 = ( IT_WRITE );
1> whenCreated: 03/26/2008 15:30:41 Eastern Standard Time Eastern Daylight Time;

1> whenChanged: 03/26/2008 15:30:50 Eastern Standard Time Eastern Daylight Time;

1> uSNCreated: 16994;
1> uSNChanged: 16994;
1> name: Mail;
1> objectGUID: a5d09e3e-2d7d-4e3d-8bf9-30b423ead057;
1> userAccountControl: 0x1020 = ( UF_PASSWD_NOTREQD | UF_WORKSTATION_TRUST_ACCOUNT
);
1> codePage: 0;
1> countryCode: 0;
1> localPolicyFlags: 0;
1> pwdLastSet: 03/26/2008 15:30:41 Eastern Standard Time Eastern Daylight Time;
1> primaryGroupID: 515;
1> objectSid: S-1-5-21-2075556647-3556310751-2339872061-1111;
1> accountExpires: 09/14/30828 02:48:05 UNC ;
1> sAMAccountName: MAIL$;
1> sAMAccountType: 805306369;
1> objectCategory: CN=Computer,CN=Schema,CN=Configuration,DC=domain,DC=com;
1> isCriticalSystemObject: FALSE;
2> dSCorePropagationData: 03/26/2008 15:52:48 Eastern Standard Time Eastern
Daylight Time; 30650/29691/8424 20876:77:2544 UNC;

If you are looking at these and saying they are almost exactly the same, you are correct.  Please note the following are the same:

  • whenCreated: 03/26/2008 15:30:41 Eastern Standard Time Eastern Daylight Time;
  • objectGUID: a5d09e3e-2d7d-4e3d-8bf9-30b423ead057
  • pwdLastSet: 03/26/2008 15:30:41 Eastern Standard Time Eastern Daylight Time

In this specific case, we’re interested in PWD last set.

After replicating the machine account to the local domain controller you can re-run the setup.com /newCMS or setup.com /recoverCMS commands.  When you do, the command will now continue and no longer throw a replication error.  If you are watching cluster administrator, you will notice that the network name and IP address created from the first attempt are deleted and a new network name and IP address created.  This is where the circumstances for the second failure are introduced. 

When the network name is deleted and recreated, cluster goes back to the PDC emulator and hi-jacks the account that was previously created there.  When it does, the Kerberos password on the account is updated.  This results in a different Kerberos password on the PDC emulator and synced into cluster then what exists on the local domain controller.  You can verify this with LDP.

  • LDP dump on PDC emulator after replicating and rerunning setup.

Expanding base ‘CN=Mail,CN=Computers,DC=domain,DC=com’…
Result <0>: (null)
Matched DNs:
Getting 1 entries:
>> Dn: CN=Mail,CN=Computers,DC=domain,DC=com
5> objectClass: top; person; organizationalPerson; user; computer;
1> cn: Mail;
1> distinguishedName: CN=Mail,CN=Computers,DC=domain,DC=com;
1> instanceType: 0x4 = ( IT_WRITE );
1> whenCreated: 03/26/2008 15:30:41 Eastern Standard Time Eastern Daylight Time;
1> whenChanged: 03/30/2008 10:35:42 Eastern Standard Time Eastern Daylight Time;
1> uSNCreated: 16994;
1> uSNChanged: 24911;
1> name: Mail;
1> objectGUID: a5d09e3e-2d7d-4e3d-8bf9-30b423ead057;
1> userAccountControl: 0x1020 = ( UF_PASSWD_NOTREQD | UF_WORKSTATION_TRUST_ACCOUNT
);
1> codePage: 0;
1> countryCode: 0;
1> localPolicyFlags: 0;
1> pwdLastSet: 03/30/2008 10:34:44 Eastern Standard Time Eastern Daylight Time;
1> primaryGroupID: 515;
1> objectSid: S-1-5-21-2075556647-3556310751-2339872061-1111;
1> accountExpires: 09/14/30828 02:48:05 UNC ;
1> sAMAccountName: MAIL$;
1> sAMAccountType: 805306369;
1> dNSHostName: MAIL.domain.com;
8> servicePrincipalName: exchangeMDB/Mail.domain.com; exchangeMDB/Mail;
exchangeRFR/Mail.domain.com; exchangeRFR/Mail;
MSClusterVirtualServer/Mail.domain.com; MSClusterVirtualServer/MAIL;
HOST/Mail.domain.com; HOST/MAIL;
1> objectCategory: CN=Computer,CN=Schema,CN=Configuration,DC=domain,DC=com;
1> isCriticalSystemObject: FALSE;
2> dSCorePropagationData: 03/26/2008 15:52:48 Eastern Standard Time Eastern
Daylight Time; 30650/29691/8424 20876:77:2544 UNC;

  • LDP dump on local domain controller after replicating and rerunning setup.

Expanding base ‘CN=Mail,CN=Computers,DC=domain,DC=com’…
Result <0>: (null)
Matched DNs:
Getting 1 entries:
>> Dn: CN=Mail,CN=Computers,DC=domain,DC=com
5> objectClass: top; person; organizationalPerson; user; computer;
1> cn: Mail;
1> distinguishedName: CN=Mail,CN=Computers,DC=domain,DC=com;
1> instanceType: 0x4 = ( IT_WRITE );
1> whenCreated: 03/26/2008 15:30:41 Eastern Standard Time Eastern Daylight Time;

1> whenChanged: 03/26/2008 15:30:50 Eastern Standard Time Eastern Daylight Time;

1> uSNCreated: 16994;
1> uSNChanged: 16994;
1> name: Mail;
1> objectGUID: a5d09e3e-2d7d-4e3d-8bf9-30b423ead057;
1> userAccountControl: 0x1020 = ( UF_PASSWD_NOTREQD | UF_WORKSTATION_TRUST_ACCOUNT
);
1> codePage: 0;
1> countryCode: 0;
1> localPolicyFlags: 0;
1> pwdLastSet: 03/26/2008 15:30:41 Eastern Standard Time Eastern Daylight Time;
1> primaryGroupID: 515;
1> objectSid: S-1-5-21-2075556647-3556310751-2339872061-1111;
1> accountExpires: 09/14/30828 02:48:05 UNC ;
1> sAMAccountName: MAIL$;
1> sAMAccountType: 805306369;
1> dNSHostName: MAIL.domain.com;
8> servicePrincipalName: exchangeMDB/Mail.domain.com; exchangeMDB/Mail;
exchangeRFR/Mail.domain.com; exchangeRFR/Mail;
MSClusterVirtualServer/Mail.domain.com; MSClusterVirtualServer/MAIL;
HOST/Mail.domain.com; HOST/MAIL;
1> objectCategory: CN=Computer,CN=Schema,CN=Configuration,DC=domain,DC=com;
1> isCriticalSystemObject: FALSE;
2> dSCorePropagationData: 03/26/2008 15:52:48 Eastern Standard Time Eastern
Daylight Time; 30650/29691/8424 20876:77:2544 UNC;

Taking a look at our comparison points from before:

  • whenCreated
    • SiteA DC – 03/26/2008 15:30:41 Eastern Standard Time Eastern Daylight Time
    • SiteB DC – 03/26/2008 15:30:41 Eastern Standard Time Eastern Daylight Time
  • objectGUID
    • SiteA DC – a5d09e3e-2d7d-4e3d-8bf9-30b423ead057
    • SiteB DC – a5d09e3e-2d7d-4e3d-8bf9-30b423ead057
  • pwdLastSet
    • SiteA DC – pwdLastSet: 03/30/2008 10:34:44 Eastern Standard Time Eastern Daylight Time
    • SiteB DC – pwdLastSet: 03/26/2008 15:30:41 Eastern Standard Time Eastern Daylight Time

As you can clearly see the object is the same (based on objectGUID), but the kerberos password stored on the DC in SiteA and the DC in SiteB is not the same.  The passwords on SiteA is newer then SiteB.  This makes sense, when cluster kerberos enables the new network name it does not delete and recreate the computer account, hence the whenCreated time and objectGUID did not change.  But, it does hijack the existing account, and updates the kerberos password resulting in pwdLastSet getting updated.  Therefore, when running setup a second time, we fail with RPC DNS error (which is really access denied).

If you again replicate domain controllers or allow time for replication to complete naturally, the kerberos passwords will sync between domain controllers.  This time when re-running setup we are watermarked at a different location, and the database instance is successfully created and brought online.

The base moral of this story is that when the PDC emulator is not in the same active directory site as the Exchange nodes, or you have a flat AD site with multiple domain controllers (with some intra-site replication latency), you can expect that Exchange setup will fail at least two times when creating the clustered resources.

There are three potential ways to work around this issue.

  • Use Windows 2008 instead of Windows 2003.
    • Windows 2008 will use local domain controllers to Kerberos enable objects and update existing objects.
    • Note:  If you are in a flat AD site you may still run into this issue due to intra-site AD replication latencies.
  • Shut the PDC emulator down while running setup.
    • By shutting the PDC emulator down you make it unavailable for cluster service use. 
    • When the cluster service cannot find the PDC emulator, it reverts back to using a local domain controller to Kerberos enable objects and update existing objects.
  • Virtually make the PDC emulator unavailable by putting a host file entry in place.
    • This by far is the easiest solution since it can be controlled from the node where setup is being run.
    • In the host file, place an entry for the PDC emulator using a completely unresponsive IP address.
    • Here is an example host file:

# Copyright (c) 1993-2006 Microsoft Corp.
#
# This is a sample HOSTS file used by Microsoft TCP/IP for Windows.
#
# This file contains the mappings of IP addresses to host names. Each
# entry should be kept on an individual line. The IP address should
# be placed in the first column followed by the corresponding host name.
# The IP address and the host name should be separated by at least one
# space.
#
# Additionally, comments (such as these) may be inserted on individual
# lines or following the machine name denoted by a ‘#’ symbol.
#
# For example:
#
#      102.54.94.97     rhino.acme.com          # source server
#       38.25.63.10     x.acme.com              # x client host

168.0.0.1    pdcEmulator.company.com
168.0.0.1    pdcEmulator

    • The IP addressed used is completely unavailable / not responsive on the network.
    • Entries are made by both the FQDN and the Netbios name of the PDC Emulator.
    • When the PDC emulator is made unresponsive, cluster will revert back to using a local domain controller to Kerberos enable objects and update existing objects.
    • You may notice that the network name resource takes longer to come online, it will eventually online if there are no other circumstances preventing it from coming online
    • You need to remove the host file entry when setup has completed successfully.

A lossy failover causes duplicate mails to be delivered to clients from the hub transport dumpster when using Exchange 2007 SP1 Cluster Continuous Replication (CCR).

In the recent weeks I have worked with some customers that have experienced duplicate mails in the Outlook client after a CCR cluster experiences a lossy failover.

A lossy failover occurs when the Exchange resources move between nodes and logs exist on the source that could not be copied to the target.  Depending on settings the actual loss is incurred when a failover occurs and the databases are automatically mounted based on the availability setting, the administrator forces the databases online, or the forceDatabaseMountAfter time period has expired.

The following events can be noted during a lossy failover:

Log Name:      Application
Source:        MSExchangeIS
Date:          2/23/2009 9:24:17 AM
Event ID:      9796
Task Category: General
Level:         Warning
Keywords:      Classic
User:          N/A
Computer:      2008-Node6.exchange.msft
Description:
Database "2008-MBX5-SG22008-MBX5-SG2-DB1" has been subject to a lossy failover. The database may be patched if the Information Store detects it is necessary.

Log Name:      Application
Source:        MSExchangeRepl
Date:          2/23/2009 9:24:24 AM
Event ID:      2099
Task Category: Service
Level:         Information
Keywords:      Classic
User:          N/A
Computer:      2008-Node6.exchange.msft
Description:
The Microsoft Exchange Replication Service requested that Hub Transport server 2008-DC1 resubmit messages between time periods 2/22/2009 8:15:10 PM (UTC) and 2/23/2009 6:24:10 PM (UTC).

The hub transport dumpster is designed to attempt to re-deliver messages that may have been lost during a lossy failover.  A dumpster is maintained on the hub transport server for each CCR or LCR enabled storage group.   The dumpster is a user configured size and a user configured time (see references below).  In terms of how the dumpster operates it’s a FIFO queue of messages.  Let’s look at an example:

My dumpster size is 5 meg.  My dumpster retention time is 7 days.  I have a single storage group with a single mailbox.  I send a 2 meg message.  This 2 meg message is committed to the hub transport dumpster queue.  I then send another 2 meg message – this message is committed to the hub transport dumpster queue.  Finally I finish by sending another 2 meg message – in this case the oldest message in the queue is popped off so that the next 2 meg message can be committed to the hub transport dumpster.  Using this method you can see that messages will recycle through the hub transport dumpster with the first in messages becoming the first out in order to accommodate new messages.

Another example is where messages are moved out due to time expiring.  My dumpster size is 5 meg.  My dumpster retention time is 2 days.  I have a single storage group with a single mailbox.  I send a 2 meg message on Thursday at noon.  I then send another 2 meg message on Friday at noon.  No other messages are sent to this mailbox.  At this point both messages will be committed to the queue since the dumpster has not reached a full condition.  Saturday at noon the message I sent on Thursday is automatically removed from the queue since it has expired based on my maximum retention time.

When a lossy failover occurs the hub transport servers, when requested, will flush the entire contents of their queues back into transport.  This causes the hub transport server to evaluate the messages as if they were just received, and begin the delivery process.  In order to prevent duplicates, each MAILBOX server maintains a list of message ids based on storage group.  Each message that is received is evaluated against this table, if the message id is found the message is turfed (by the store as opposed to being turfed on the transport server) so that no duplicate occurs.  If the message id is not found, the message is re-delivered.  Please note again this table is based on storage group.

Where we run into potential issues is when a move mailbox operation occurs.  When a user is moved between stores or between servers the duplicate tracking table is not updated.  Since this table is not updated, when the messages come from the hub transport server each is evaluated as a new message resulting in duplicate message delivery occurring.  Lets take a look at an example:

I have a user that is on ServerAStorageGroupAMailboxDatabaseA.  ServerA in an Exchange 2007 SP1 CCR cluster.  Due to a power outage of the primary node the Exchange resources are failed from ServerANodeA to ServerANodeB.  In the process 3 logs are lost – availability is set to “best availability” – the databases mount automatically.  The cluster informs the hub transport servers of a lossy failover and the dumpsters are flushed.  The user does not receive any duplicate messages.  This is by design – the duplicate tracking table exists for this storage group, is populated with entries for this mailbox, and successfully turfs any potential duplicates for this user.

Lets take a look at another example:

I have a user that is on ServerAStorageGroupAMailboxDatabaseA.  ServerA is an Exchange 2007 SP1 CCR cluster.  On Monday I move this user from ServerA to ServerB.  The user is now located in ServerBStorageGroupBMailboxDatabaseB.  On Wednesday, due to a power outage, ServerA experiences a lossy failover between nodes.  The hub transport dumpsters are flushed.  When this happens messages destined to this user are re-evaluated, and re-routed to ServerB (where the user now exists).  The duplicate tracking table is consulted, but there are no matches for the messages originating from the dumpster (remember that a move mailbox does not update the duplicate tracking table).  Since there are no matches, all messages for this user, that were submitted from the dumpster, are re-delivered as new to the user.  This appears to the user as duplicate messages.

As of today this behavior is by design.  From an administrator standpoint there is nothing that can be done to mitigate this issue outside of ensuring that the Maximum size per storage group and maximum retention time for the hub transport dumpster are configured appropriately for your organization.

For more information see the following:

http://technet.microsoft.com/en-us/library/bb676352.aspx

http://technet.microsoft.com/en-us/library/bb288910.aspx

http://technet.microsoft.com/en-us/library/aa997963.aspx

http://msexchangeteam.com/archive/2007/01/17/432237.aspx

Using Standby Continuous Replication in both single node cluster implementation and database portability implementation when only a single machine is available.

Last week I received an interesting question from one of our Exchange MVPs. 

“Can I use a single node cluster as an SCR target for both recovering the entire cluster or recovering just a single database?”

My original thoughts on this was no…when implementing SCR you have to make a choice.  Are you implementing a single node cluster target where you would recover the entire cluster or are you implementing a standalone mailbox server target where you would use database portability.

The more I thought about this, the more I determined that YES, there is a way to have your cake and eat it to when you only have a single machine to act as the SCR target.  Let me explain…

Since we are talking about using a single node cluster as the SCR target that means that we have some form of cluster source, either SCC or CCR. 

In the event that you were to loose the entire source cluster, you would activate the SCR target by:

  • Running restore-storagegroupcopy –standbymachine <NODE>
  • Running setup.com /recoverCMS /cmsName:<NAME> /cmsIPAddress:<IP>
  • Mounting the databases.

Now the question becomes, how can I use this single node cluster in order to recover just a single database – ie using the database portability recovery method.

In this case the first thing we need to do is identify the database or databases that we wish to recover to the SCR target.  These databases would have to be dismounted on the source server if they were not already.  We would then run restore-storagegroup –standbymachine on these database instances.

The databases that have been ported to this machine are not in a “clean shutdown” state.  At this time I recommend that we bring the databases to a clean shutdown state.  To perform this operation:

  • Open a command prompt.
  • Run the following command –> eseutil /r <LOGPREFIX> /l <LOG PATH> /s <CHECKPOINT PATH> /d <DATABASE PATH>

Assume the following:

  • The log files and checkpoint file are located at x:SG3
  • The database files are located at y:SG3
  • The log prefix is E02 (the first 3 letters of a log file).
  • The command would be eseutil /r E02 /l “x:SG3” /s “x:SG3” /d “y:SG3”

If you had to use the –force switch with restore-storagegroupcopy, this means that all logs could not be copied over.  In this case you would run eseutil /r E02 /l “x:SG3” /s “x:SG3” /d “y:SG3” /a (note the addition of /a) given the above example.  Some situations where you would have to use –force might include network failures preventing the copy from proceeding successfully, corrupted log files on the source, or missing log files / database on the source.

When this is completed all SCR replication to the target must be disabled.  The reason for this is that an active CMS (we’ll get to this later) cannot also be a target for SCR replication.  To disable SCR in bulk consider running get-storagegroup –server <SOURCE> | disable-storagegroupcopy –standbymachine <TARGET>.  Because this command modifies an AD attribute, we will need to take sometime here to allow for AD replication to converge and for the replication services to detect this change and discontinue the replication instances.  (Note:  The disable command will fail on the storage groups where you ran restore-storagegroupcopy -standbymachine.  When running restore-storagegroupcopy –standbymachine on a database enabled for SCR it is automatically disabled for SCR when the command restore command completes successfully.)

So at this point what we have accomplished is to have all databases and logs necessary for recovery of the desired database instances present on the SCR target and all other replica instances disabled to that same target.  We can now continue with the recovery.

Since the SCR target in this case is running the passive node installation of Exchange, it cannot be made into a standalone mailbox server without uninstalling Exchange.  In this case that’s fine, we are going to create a single node active CMS.  In order to create the active CMS, we will run setup.com /newCMS /cmsName:<NAME> /cmsIPAddress:<IP>.  This setup command gives us a single node CCR cluster.  Please note that cluster type does not matter here since we are assuming that we will always be using a SINGLE NODE cluster.  In essence what we have done here is created a new Exchange server in the environment.

To take stock of where we are at now…we now have a single node cluster with X number of databases that are fully recovered and in a clean shutdown state.  We now need to mount these databases.

To mount these databases:

  • Create a storage group for each database that was restored.  DO NOT use the same paths as the restored databases.
  • Create a new mailbox database in each of these storage groups to correspond to the restored databases.  DO NOT use the same database paths as the restored databases.  DO NOT mount the databases (thus creating a blank database).
    • If the mount database option is accidentally selected, the database must be dismounted before continuing.
  • Run move-storagegrouppath –identity <NewSG> –logfilePath:<RestoredLogPath> –systemfilePath:<RestoredCheckpointPath> –configurationOnly:$TRUE
  • Run move-databasepath –identity <NewDB> –edbFilePath:”<RestoredDBPathDBName.edb>”  -configurationOnly:$TRUE.  [You will need to use the same *.edb name as the database on disk]
  • Run get-mailboxdatabase –server <NewCMS> | set-mailboxdatabase –AllowFileRestore:$TRUE
  • Run get-mailboxdatabase –server <NewCMS> | mount-database

This should mount the databases that were restored.

The finishing steps are to rehome the users to the restored databases.  To perform this step:

Get-Mailbox –Database:“<OriginalServer><SGName><DatabaseName>” | where {$_.ObjectClass -NotMatch ‘(SystemAttendantMailbox|ExOleDbSystemMailbox)’} | Move-Mailbox -ConfigurationOnly –TargetDatabase: “<NewCMS><NewSG><RestoredDatabaseName>”

Please note the following:

  • When using /recoverCMS the source cluster can be cleared and it’s nodes re-subscribed as an SCR target.
  • If following these instructions to implement database portability, the original clustered nodes that are running the other databases are ineligible to be SCR targets since there is an active CMS.  Moving back to the original cluster could possibly mean moving all mailboxes back using the move mailbox process.
  • Using this approach to recover a single database means disabling replication for all other databases.  This may have an impact on your overall disaster recovery design and planning.
  • Depending on replayLagTimes of the SCR target there could be a large delta of logs to be recovered before the database can be mounted.  Planning for activation must include the time to copy delta logs as well as relay all outstanding log files.
  • If using a cluster source consider using CCR for maintaining database availability and using SCR for disaster recoveries.
  • Using move-mailbox –configurationOnly causes multiple changes to be committed to the active directory as well as client profiles changes.  Please consider this also in the overall time to recovery.

*Thanks for Channing Heffney and Ross Smith for reviewing my information.

**********************************

Update 2/15/2010

Updated ESEUTIL /r command to reflect correct storage group paths.

**********************************