11-09-2010 01:32 PM
You may want to go get your favorite beverage (at this point, I suggest a good strong Scotch) and some munchies....this one is a doosy.
I've installed the new BES 5.0.2 server:
W2K8 64 Bit
SQL 2005 Express SP3 on local machine
I've upgraded our Old BES to 4.0.7 so we can use the transporter:
2 NICs, one disabled
I have a test server:
SQL 2005 SP3
I ran the transporter on a few test users and everything went like a charm. I declared the system ready to start testing with production users the following Monday. (First mistake) My boss asked for me to move him right away as a test on the Friday before testing was to begin. KaBOOM. The migration said it completed, but he couldn't send or receive any emails and the device simply locked up.
At 3:00 on a Friday afternoon.
I tried activating him directly to the new server and it would get to the "Synchronizing" screen and sit at 0%. Then it wouldn't remove him from the new BES, he was stuck in the Queue. I called RIM and they didn't find the answer and it was end of day on a Friday so I reactivated my boss on the 4.0.7 server (took seconds) and capped it until Monday morning. Monday comes and I test another user and it works out just fine so we try the boss again and once again the transporter says it worked, but the handheld won't send or receive. Back onto the phone with RIM.
I spent from 9:30 AM until 4:00PM on the phone with RIM and we're no further than we were before. Here's what I was able to deduce and what RIM noted:
There seems to be some sort of communication DELAY between the new BES and the ONE mail server in our organization. (we have less than 12, more than 6) This can most notably be observed when running "iemstest". If I run that test from the new BES to any other user on any other server, it takes about 6 seconds. Users on the mail server in question take 1 minute and 40 seconds. If I run "iemstest" on the test BES server to the mail server in question it takes about 2 seconds, and from the 4.0 server about the same.
If I activate a user from the mail server in question, they will activate but it sits at "Synchronizing - 0%". If I let a handheld sit long enough (well over an hour) it will progress. I didn't bother letting it finish, but it got to 16% before it re-entered it's state of sitting idle.
There are no errors showing in any logs, and RIM states everything is working: just painfully slow to users on the mail server in question. We have tried reinstalling the CDO on the new BES to no avail. I swapped out the older version of the CDO on the test BES with the new version and it still works with no issues. I've rebooted the new BES several times and we even vMotioned the server to a new host to see if there was a NIC issue on the original, all to no avail.
RIM has more or less washed their hands of it stating there is a communication issue between the 2 boxes but they have no idea what or how and state that nothing appears to be wrong and they don't know why iemstest drags between the boxes. Since all other BES servers can talk to this box with no issue, I don't think it's the mail server; however, the new BES can talk to all the other mail servers with no issue so this problem is very specific to BES communication from the new BES to JUST this server.
Spotlight on Exchange shows the health of the Exchange server in question is fine, ping times are normal, nothing in any of the event logs of BES logs indicate an issue. I suspect there is some sort of MAPI/CDO issue stuck between this BES and the mail server but I have no way of testing that or knowing what to look for.
So, here I am, asking the bigger brain collective what to do next. The easy out is to move a few hundred mailboxes (executives at that!) to a new server and move on but I don't think that's a solution, it's just avoiding the issue and starting a new server off with problems right out of the gate.
I don't have much hair left to pull out.
Solved! Go to Solution.
11-09-2010 01:59 PM
What version mail server?
using the same BESAdmin service account on all BES servers?
How many BES servers in total?
When was last time you rebooted/power cycled the LAN switch?
11-09-2010 02:09 PM
Mail Server: Exchange 2003 SP2
Same service account on all servers.
Tricky: (2) Production boxes in 2 seperate physical datacenters. (4.0.6 and 4.0.7)
(1) Pre-Production 5.0.2 server
(1) Testing 5.0.1 MR2
(1) 5.0.1 Free BES
Which switch? The core or the rack switch? For the Mail server or the BES? Since we've vMotioned the BES to another server in another rack and still have the same issue, I doubt it's the rack switch for the BES anyhow. The core...well, that never gets bounced without somebody doing a year's worth of paperwork, and the rack that contains the mail server holds several other production boxes so I doubt we can bounce that either. I could look at the even logs for the switch the mail server is on, but since it talks to everything else properly, I doubt that's the problem.
Never know until I look though I suppose.
11-09-2010 02:14 PM
is IPV6 enabled on BES? try turning it off.
11-09-2010 02:22 PM - edited 11-09-2010 02:31 PM
Also, the server that is having the time issues is the same server the BES service account resides on.
Just for full disclosure.
Just checked, and no errors on the switch either.
11-09-2010 02:41 PM
5 BES servers connecting to the same Exchange server.
What happens if you stop services on 2 of them and the try the test again?
How many BES users do you have and what type of disk structure do you have?
BES users per spindle on exchange?
We had some delays here before adding in a second Raid array and Exchange 2003 BES users is about 3.75 exchange users.
Did you try turning off the windows firewall on teh BES?
11-09-2010 03:08 PM
Allow me to clarify:
The 2 physically separate servers do not attach to the same mail server. One handles one region of the country while the other handles ...well...the other.
The test box has ZERO users on it, so it doesn't connect that I'm aware of.
The free box has 1 user on another mail server.
The new BES server only tries to connect when I try to add a user. Otherwise it should not be connected.
700 Users on this server, I THINK 500 on the other.
We're in a SAN environment, so users per spindle is hard to account for.
Windows firewall is off.
My concern with most of those ideas is the concept that the new BES can talk to EVERY mail server perfectly fine except for this one, so it indicates a specific issue between the new BES and this one mail server alone. I Suspect some MAPI/CDO hung session or corruption related directly to these two boxes, but I have no way of knowing how to troubleshoot that and neither does RIM.
11-09-2010 03:34 PM
Could you make a new storage group on the excahnge server, Apply permissions, make a test acount and test it again.
I have seen once in a while that a BES server will not like a particular store for what ever reason.
11-09-2010 04:16 PM
We have the maximum number of Storage groups on all servers and I've tested various stores on this server. Here's what we're going to try going forward (in order):
A Scheduled reboot Thursday night to see if something is simply hung in MAPI.
Replacement of the NIC on the mail server to verify that there isn't some physical level issue that is undetected.
Rebuild the new BES from scratch and see if the issue is with the BES.
Migrate users off to a new Mail server.
I'll keep you posted.