I found out a performance problems with a Mumble server, which I described in a previous question are caused by an I/O latency problem of unknown origin. As I have no idea what is causing this and how to further debug it, I'm asking for your ideas on the topic.
I'm running a Hetzner EX4S root server as KVM hypervisor. The server is running Debian Wheezy Beta 4 and KVM virtualisation is utilized through LibVirt.
The server has two different 3TB hard drives as one of the hard drives was replaced after S.M.A.R.T. errors were reported. The first hard disk is a Seagate Barracuda XT ST33000651AS (512 bytes logical, 4096 bytes physical sector size), the other one a Seagate Barracuda 7200.14 (AF) ST3000DM001-9YN166 (512 bytes logical and physical sector size). There are two Linux software RAID1 devices. One for the unencrypted boot partition and one as container for the encrypted rest, using both hard drives.
Inside the latter RAID device lies an AES encrypted LUKS container. Inside the LUKS container there is a LVM physical volume. The hypervisor's VFS is split on three logical volumes on the described LVM physical volume: one for /, one for /home and one for swap.
Here is a diagram of the block device configuration stack:
sda (Physical HDD)
- md0 (RAID1)
- md1 (RAID1)
sdb (Physical HDD)
- md0 (RAID1)
- md1 (RAID1)
md0 (Boot RAID)
- ext4 (/boot)
md1 (Data RAID)
- LUKS container
- LVM Physical volume
- LVM volume hypervisor-root
- LVM volume hypervisor-home
- LVM volume hypervisor-swap
- … (Virtual machine volumes)
The guest systems (virtual machines) are mostly running Debian Wheezy Beta 4 too. We have one additional Ubuntu Precise instance. They get their block devices from the LVM physical volume, too. The volumes are accessed through Virtio drivers in native writethrough mode. The IO scheduler (elevator) on both the hypervisor and the guest system is set to deadline instead of the default cfs as that happened to be the most performant setup according to our bonnie++ test series.
The I/O latency problem is experienced not only inside the guest systems but is also affecting services running on the hypervisor system itself. The setup seems complex, but I'm sure that not the basic structure causes the latency problems, as my previous server ran four years with almost the same basic setup, without any of the performance problems.
On the old setup the following things were different:
Debian Lenny was the OS for both hypervisor and almost all guests
Xen software virtualisation (therefore no Virtio, also)
no LibVirt management
Different hard drives, each 1.5TB in size (one of them was a Seagate Barracuda 7200.11 ST31500341AS, the other one I can't tell anymore)
We had no IPv6 connectivity
Neither in the hypervisor nor in guests we had noticable I/O latency problems
According the the datasheets, the current hard drives and the one of the old machine have an average latency of 4.12ms.