Windows Server 2008 Software Raid 5 - Data integrity issues
- by Fopedush
I've got a server running Windows Server 2008 R2, with a (windows native) software raid-5 array. The array consists of 7x 1TB Western Digital RE3 and RE4 drives. I have offline backups of this array.
The problem is this: I noticed a few days ago after copying a large file to the disk that there was an integrity issue with that file - it was a ~12GB file that I had downloaded via uTorrent. After moving it to the raid array, I used uTorrent to relocate the download location, and performed a re-check so I could seed it from that location. The recheck found that only 6308/6310 chunks of the copied file were intact.
My next step was to write a quick powershell script that would copy files to the array, while performing a SHA1 hash of the original and resultant files and comparing them. Smaller files (100-1000MB) copied over just fine. When I started copying larger data (~15GB), I found that the hash check failed about 2/3rds of the time. The corrupt files had very, very small inconsistencies - less than .01%. I further eliminated the possibility of networking or client issues by placing this large file on the C:\ of the server, and copying it repeatedly from there to the array, seeing similar results.
Copying the data via explorer, powershell, or the standard windows command prompt yield the same results. None of the copies fail or report any problems. The raid array itself is listed as healthy in disk management.
After a few experiments, I shut down the server and ran memtest overnight. No errors were detected. A basic run of chkdsk found no problems, but I did not use the /R flag, as I was unsure how that might affect a software raid-5 volume.
I next ran Crystal Disk Info to check the smart data on the drives - but found that CDI only detected 5 out of 7 of the disks in the array. I have no idea why. Nevertheless, CDI shows the following "caution" flags on a single one of the drives:
05 199 199 140 000000000001 Reallocated Sectors Count
C5 200 200 __0 000000000001 Current Pending Sector Count
Which is a little bit alarming, but I don't really know what to do with the information. I hardly feel like one reallocated sector could be causing this.
At this point, I'm looking for some guidance on what to do next. I need to determine the cause of this issue, but I'm hesitant to run chkdsk /R or any bootable disk health checkers because I'm afraid they might break the array. I've considered triggering a re-sync of the array, but I'm not actually sure how to do that without doing something silly like manually dropping a disk and then restoring it.
Any advice that could help me ferret out the precise cause of this issue would be greatly appreciated.