Issue with VMWare vSphere and NFS: re occurring apd state

Posted by Bastian N. on Server Fault See other posts from Server Fault or by Bastian N.
Published on 2013-05-30T12:26:52Z Indexed on 2013/06/28 22:23 UTC
Read the original article Hit count: 510

Filed under:
|

I am experiencing issues with VMWare vSphere 5.1 and NFS storage on 2 different setups, which result in an "All Path Down" state for the NFS shares. This first happened once or twice a day, but lately it occurs much more frequent, as specially when Acronis Backup jobs are running.

Setup 1 (Production): 2 ESXi 5.1 hosts (Essentials Plus) + OpenFiler with NFS as storage

Setup 2 (Lab): 1 ESXi 5.1 host + Ubuntu 12.04 LTS with NFS as storage

Here is an example from the vmkernel.log:

2013-05-28T08:07:33.479Z cpu0:2054)StorageApdHandler: 248: APD Timer started for ident [987c2dd0-02658e1e]
2013-05-28T08:07:33.479Z cpu0:2054)StorageApdHandler: 395: Device or filesystem with identifier [987c2dd0-02658e1e] has entered the All Paths Down state.
2013-05-28T08:07:33.479Z cpu0:2054)StorageApdHandler: 846: APD Start for ident [987c2dd0-02658e1e]!
2013-05-28T08:07:37.485Z cpu0:2052)NFSLock: 610: Stop accessing fd 0x410007e4cf28  3
2013-05-28T08:07:37.485Z cpu0:2052)NFSLock: 610: Stop accessing fd 0x410007e4d0e8  3
2013-05-28T08:07:41.280Z cpu1:2049)StorageApdHandler: 277: APD Timer killed for ident [987c2dd0-02658e1e]
2013-05-28T08:07:41.280Z cpu1:2049)StorageApdHandler: 402: Device or filesystem with identifier [987c2dd0-02658e1e] has exited the All Paths Down state.
2013-05-28T08:07:41.281Z cpu1:2049)StorageApdHandler: 902: APD Exit for ident [987c2dd0-02658e1e]!
2013-05-28T08:07:52.300Z cpu1:3679)NFSLock: 570: Start accessing fd 0x410007e4d0e8 again
2013-05-28T08:07:52.300Z cpu1:3679)NFSLock: 570: Start accessing fd 0x410007e4cf28 again

As long as the issue occurred once or twice a day it really wasn't a problem, but now this issue has impact on the VMs. The VMs get slow or even hang, resulting in a reset through vCenter in the production environment.

I searched the web extensively and asked in forums, but till now nobody was able to help me. Based on blog posts and VMWare KB articles I tried the following NFS settings:

Net.TcpipHeapSize = 32
Net.TcpipHeapMax = 128
NFS.HartbeatFrequency = 12
NFS.HartbeatMaxFailures = 10
NFS.HartbeatTimeout = 5
NFS.MaxQueueDepth = 64

Instead of NFS.MaxQueueDepth = 64 I already tried other settings like NFS.MaxQueueDepth = 32 or even NFS.MaxQueueDepth = 1. Unfortunately without any luck.

It would be great if someone could help me on this issue. It is really annoying.

Thanks in advance for all the help.

[UPDATE] As I explained in the comment below, here is the network setup:

On the production setup the NFS traffic is bound to a separate VLAN with ID 20. I am using a HP 1810 24 Port Switch. The OpenFiler system is connected to the VLAN with 4 Intel GbE NICs with dynamic LACP. The ESXis both have 4 Intel GbE NICs using 2 static LACP trunks containing 2 NICs each. One pair is connected to the regular LAN and the other one to the VLAN 20.

And here is a screenshot of the vSwitch: enter image description here

Switch configuration: enter image description here

Port configuration: enter image description here

On the lab setup its a single Intel NIC on each side without VLAN, but with different IP subnet.

© Server Fault or respective owner

Related posts about vmware-esxi

Related posts about nfs