Investigating a potential CPU failure
- by Jernej
On a Ubuntu server that I am using for computations I have recently observed that some CPU extensive programs (GUROBI,CPLEX) often segfault.
Being in correspondence with tech support of the respective programs I was suggested that it may be a hardware issue.
The administrator of the server performed a detailed memtest and it turned out that the RAM modules appear to be fine.
Hence I've used the tool mprime to test the CPU and the following two lines appear multiple times durring the execution of the stress tests:
[Worker #4 Oct 18 18:47] FATAL ERROR: Rounding was 0.498046875, expected less than 0.4
[Worker #4 Oct 18 18:47] Hardware failure detected, consult stress.txt file.
The stress.txt file in itself is not very verbose about what could be the cause of this error so I would like to ask whether anyone here happens to know what could be the cause of this issue? Is there some other test I could perform to nail the problem even further?
The temperature of the system (and all cores) was fine during the entire stress test (+69.0°C (high = +80.0°C, crit = +98.0°C)) the CPU in question is a Intel Core i7-2600K CPU @ 3.40GHz and is not overclocked or modified in any way.
Also what is interesting that if I run mprime to only stress the CPU all tests pass fine. The error is only triggered when I let mprime stress the CPU+RAM.