we have a sql server 2008 active/active cluster running on wondows 2008R2 O/S. 14GB RAM, 4xCPU. we have set a ceiling of 12GB for sql server. We're running an agent job which loads 3 million records to a database. during this load the job fails and the cluster seems to attempt to fail over to the other node but unsuccessfully i.e., the cluster address is no longer accessible. we have to manually fail the cluster node back.
during the load on viewing task manager we can see that memory usage hits a max of 12.5GB and CPU at times hits 100% on all 4 CPU, but for the most part fluctuates at an average of about 60%.
I suppose my question is, will a cluster try to fail over if memory or CPU are taking a heavy hit? or am i barking up the wrong tree?
also any ideas why it wouldn't fully fail over? we've crawled through logs, of which there are a lot, and can't find anything useful. we've also tried recreating the issue but it ran successfully at a later time. Also 3 million rows doesn't seem like a lot but in terms of resources should 14GB RAM and 4xCPU not be sufficient?
Further information on this, we ran the load again today and corrupted the database!
We received the error message : LogWriter: Operating system error 170.
It looks like, under the heavy load, the sql cluster attempted to fail over and in doing so migrated a lun (or drive) which meant the disk was no longer reachable. (this is just our theory). The database is now 'suspect' and requiring restoration.
The 170 error above also indicates that on failing over to the other node, the sql service could not start as it was already in use, therefore it couldn't fail over fully?? But I'm wondering why would it need to fail over in the first place?
My assumptions could be completely wrong on this, so any ideas would be appreciated.