I have a Dell PowerEdge 2850 running Windows Server 2003. It is the primary file server for one of my clients. I have another server also running Windows Server 2003 that acts as the core media server for Symantec Backup Exec 12.
I recently upgraded from Backup Exec 11d to 12. This upgrade was necessary because we also just upgraded from Exchange 2003 to Exchange 2007. After the upgrade I had to push-install the new version 12 Backup Exec Remote Agents to each of the servers I am backing up (about 6 total). 5 of my servers are doing just fine, faithfully completing backups every night. My file server routinely crashes.
Observations:
When the server crashes, it does not blue screen, it just locks up completely. Even the mouse is unresponsive. If you leave the server locked up long enough, it will eventually reboot itself and hang on the Windows splash screen.
There is absolutely zero useful Event Viewer evidence of a problem. The logs go from routine logging to an Unexplained Shutdown Event the next morning when I have to hard reset the server to get it to boot.
90% of the time the server does not boot cleanly, it hangs on the Windows splash screen. I don't have any light to shed here. When the server hangs all I can do is hard reset it and try again. Even after a successful boot and chkdsk /r operation, if you reboot the machine, you have a 90% chance it won't back up again cleanly.
The back story:
This server started crashing during nightly backups about a month ago. I tried everything I could think of to troubleshoot the problem and eventually had to give up because I could not keep coming to the office at 4 AM to try to get the server back online. One Friday I got lucky and the server stayed up for its entire full backup. I took this opportunity to restore the full backup to a temporary server I set up and switched all my users to the temporary. Then I reloaded the ailing file server.
I kept all my users on the temporary file server for about 3 weeks. I installed the same Backup Exec Remote Agent and Trend Micro A/V client on the temporary server that I was using on the regular file server. During this time, I had absolutely no problems backing up the temporary server.
I tested the reloaded file server extensively. I rebooted the server once an hour every day for 3 weeks trying to make it fail. It never did. I felt confident that the reload was the answer to my problems. I moved all of the data from the temporary server back to the regular server. I got 3 nightly backups out of it before it locked up again and started the familiar failure to boot cleanly behavior.
This weekend I decided to monitor the file server through the entire backup job. I RDPd into the file server and also into the server running Backup Exec. On the file server I opened the Task Manager so I could view the processes and watch CPU and memory usage. Everything was running smoothly for about 60GB worth of backup. Then I noticed that the byte count of the backup job in Backup Exec had stopped progressing. I looked back over at my RDP session into the file server, and I was getting real time updates about CPU and memory usage still - both nearly 0%, which is unusual. Backups usually hover around 40% usage for the duration of the backup job.
Let me reiterate this point: The screen was refreshing and I was getting real time Task Manager updates - until I clicked on the Start menu. The screen went black and the server locked up. In truth, I think the server had already locked up, the video card just hadn't figured it out yet.
I went back into my bag of trick: driving to the office and hard reseting the server over and over again when it hangs up at the Windows splash screen. I did this for 2 hours without getting a successful boot. I started panicking because I did not have a decent backup to use to get everything back onto the working temporary file server.
Once I exhausted everything I knew to do, I took a deep breath, booted to the Windows Server 2003 CD and performed a repair installation of Windows. The server came back up fine, with all of my data intact. I can now reboot the server at will and it will come back up cleanly. The problem is that I'm afraid as soon as I try to back that data up again I will back at square one.
So let me sum things up:
Here is what I've done so far to troubleshoot this server:
Deleted and recreated the RAID 5 sets. Initialized the drives. Reloaded the server with a fresh Server 2003 install.
Confirmed with Dell that I have installed the latest, Dell approved BIOS and NIC drivers.
Uninstalled / reinstalled the Backup Exec Remote Agent.
Uninstalled the Trend Micro A/V client.
Configured the server not to reboot itself after a blue screen so I can see any stop error. I used to think the server was blue screening, but since I enabled this setting I now know that the server just completely locks up.
Run chkdsk /r from the Windows Recovery Console. Several errors were found and corrected, but did not help my problem.
Help confirm or deny the following assumptions:
There are two problems at work here. Why the server is locking up in the first place, and why the server won't boot cleanly after a lockup.
This is ultimately a software problem. The server works fine and can be rebooted cleanly all day long - until the first lockup - following a fresh OS load or even a Repair installation.
This is not a problem with Backup Exec in general. All of my other servers back up just fine. For the record, all of the other servers run Server 2003, and some of them house more data than the file server in question here.
Any help is appreciated. The irony is almost too much to bear. Backing up my data is what is jeopardizing it.