I am experiencing issues with VMWare vSphere 5.1 and NFS storage on 2 different setups, which result in an "All Path Down" state for the NFS shares. This first happened once or twice a day, but lately it occurs much more frequent, as specially when Acronis Backup jobs are running.
Setup 1 (Production):
2 ESXi 5.1 hosts (Essentials Plus) + OpenFiler with NFS as storage
Setup 2 (Lab):
1 ESXi 5.1 host + Ubuntu 12.04 LTS with NFS as storage
Here is an example from the vmkernel.log:
2013-05-28T08:07:33.479Z cpu0:2054)StorageApdHandler: 248: APD Timer started for ident [987c2dd0-02658e1e]
2013-05-28T08:07:33.479Z cpu0:2054)StorageApdHandler: 395: Device or filesystem with identifier [987c2dd0-02658e1e] has entered the All Paths Down state.
2013-05-28T08:07:33.479Z cpu0:2054)StorageApdHandler: 846: APD Start for ident [987c2dd0-02658e1e]!
2013-05-28T08:07:37.485Z cpu0:2052)NFSLock: 610: Stop accessing fd 0x410007e4cf28 3
2013-05-28T08:07:37.485Z cpu0:2052)NFSLock: 610: Stop accessing fd 0x410007e4d0e8 3
2013-05-28T08:07:41.280Z cpu1:2049)StorageApdHandler: 277: APD Timer killed for ident [987c2dd0-02658e1e]
2013-05-28T08:07:41.280Z cpu1:2049)StorageApdHandler: 402: Device or filesystem with identifier [987c2dd0-02658e1e] has exited the All Paths Down state.
2013-05-28T08:07:41.281Z cpu1:2049)StorageApdHandler: 902: APD Exit for ident [987c2dd0-02658e1e]!
2013-05-28T08:07:52.300Z cpu1:3679)NFSLock: 570: Start accessing fd 0x410007e4d0e8 again
2013-05-28T08:07:52.300Z cpu1:3679)NFSLock: 570: Start accessing fd 0x410007e4cf28 again
As long as the issue occurred once or twice a day it really wasn't a problem, but now this issue has impact on the VMs. The VMs get slow or even hang, resulting in a reset through vCenter in the production environment.
I searched the web extensively and asked in forums, but till now nobody was able to help me. Based on blog posts and VMWare KB articles I tried the following NFS settings:
Net.TcpipHeapSize = 32
Net.TcpipHeapMax = 128
NFS.HartbeatFrequency = 12
NFS.HartbeatMaxFailures = 10
NFS.HartbeatTimeout = 5
NFS.MaxQueueDepth = 64
Instead of NFS.MaxQueueDepth = 64 I already tried other settings like NFS.MaxQueueDepth = 32 or even NFS.MaxQueueDepth = 1. Unfortunately without any luck.
It would be great if someone could help me on this issue. It is really annoying.
Thanks in advance for all the help.
[UPDATE]
As I explained in the comment below, here is the network setup:
On the production setup the NFS traffic is bound to a separate VLAN with ID 20. I am using a HP 1810 24 Port Switch. The OpenFiler system is connected to the VLAN with 4 Intel GbE NICs with dynamic LACP. The ESXis both have 4 Intel GbE NICs using 2 static LACP trunks containing 2 NICs each. One pair is connected to the regular LAN and the other one to the VLAN 20.
And here is a screenshot of the vSwitch:
Switch configuration:
Port configuration:
On the lab setup its a single Intel NIC on each side without VLAN, but with different IP subnet.