"postgres blocked for more than 120 seconds" - is my db still consistent?
- by nn4l
I am using an iscsi volume on an Open-E storage system for several virtual machines running on a XenServer host. Occasionally, when there is a very high disk I/O load on the virtual machines (and therefore also on the storage system), I got this error message on the vm consoles:
[2594520.161701] INFO: task kjournald:117 blocked for more than 120 seconds.
[2594520.161787] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2594520.162194] INFO: task flush-202:0:229 blocked for more than 120 seconds.
[2594520.162274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2594520.162801] INFO: task postgres:1567 blocked for more than 120 seconds.
[2594520.162882] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
I understand this error message is caused by the kernel to inform that these processes haven't been run for 120 seconds, most likely because a disk access to the storage system has not yet been processed.
But what is the effect on the processes. For example, will the postgres process eventually write its data when the storage system is idle again after a few minutes, so that all data is still consistent? Or will it abort the write, leaving some tables in an inconsistent state?
I certainly expect that the former should be the case - if the disk access is slow, postgres (or any other affected process) should just wait as long as it takes. I can live with the application hanging for a few minutes. But if there is a chance for data corruption then any of these errors is really bad news.
Please advise what to do here.