Issue with VMWare vSphere and NFS: re occurring apd state

Posted by Bastian N. on Server Fault See other posts from Server Fault or by Bastian N.
Published on 2013-05-30T12:26:52Z Indexed on 2013/06/28 22:23 UTC
Read the original article Hit count: 601

Filed under:

I am experiencing issues with VMWare vSphere 5.1 and NFS storage on 2 different setups, which result in an "All Path Down" state for the NFS shares. This first happened once or twice a day, but lately it occurs much more frequent, as specially when Acronis Backup jobs are running.

Setup 1 (Production): 2 ESXi 5.1 hosts (Essentials Plus) + OpenFiler with NFS as storage

Setup 2 (Lab): 1 ESXi 5.1 host + Ubuntu 12.04 LTS with NFS as storage

Here is an example from the vmkernel.log:

2013-05-28T08:07:33.479Z cpu0:2054)StorageApdHandler: 248: APD Timer started for ident [987c2dd0-02658e1e]
2013-05-28T08:07:33.479Z cpu0:2054)StorageApdHandler: 395: Device or filesystem with identifier [987c2dd0-02658e1e] has entered the All Paths Down state.
2013-05-28T08:07:33.479Z cpu0:2054)StorageApdHandler: 846: APD Start for ident [987c2dd0-02658e1e]!
2013-05-28T08:07:37.485Z cpu0:2052)NFSLock: 610: Stop accessing fd 0x410007e4cf28  3
2013-05-28T08:07:37.485Z cpu0:2052)NFSLock: 610: Stop accessing fd 0x410007e4d0e8  3
2013-05-28T08:07:41.280Z cpu1:2049)StorageApdHandler: 277: APD Timer killed for ident [987c2dd0-02658e1e]
2013-05-28T08:07:41.280Z cpu1:2049)StorageApdHandler: 402: Device or filesystem with identifier [987c2dd0-02658e1e] has exited the All Paths Down state.
2013-05-28T08:07:41.281Z cpu1:2049)StorageApdHandler: 902: APD Exit for ident [987c2dd0-02658e1e]!
2013-05-28T08:07:52.300Z cpu1:3679)NFSLock: 570: Start accessing fd 0x410007e4d0e8 again
2013-05-28T08:07:52.300Z cpu1:3679)NFSLock: 570: Start accessing fd 0x410007e4cf28 again

As long as the issue occurred once or twice a day it really wasn't a problem, but now this issue has impact on the VMs. The VMs get slow or even hang, resulting in a reset through vCenter in the production environment.

I searched the web extensively and asked in forums, but till now nobody was able to help me. Based on blog posts and VMWare KB articles I tried the following NFS settings:

Net.TcpipHeapSize = 32
Net.TcpipHeapMax = 128
NFS.HartbeatFrequency = 12
NFS.HartbeatMaxFailures = 10
NFS.HartbeatTimeout = 5
NFS.MaxQueueDepth = 64

Instead of NFS.MaxQueueDepth = 64 I already tried other settings like NFS.MaxQueueDepth = 32 or even NFS.MaxQueueDepth = 1. Unfortunately without any luck.

It would be great if someone could help me on this issue. It is really annoying.

Thanks in advance for all the help.

[UPDATE] As I explained in the comment below, here is the network setup:

On the production setup the NFS traffic is bound to a separate VLAN with ID 20. I am using a HP 1810 24 Port Switch. The OpenFiler system is connected to the VLAN with 4 Intel GbE NICs with dynamic LACP. The ESXis both have 4 Intel GbE NICs using 2 static LACP trunks containing 2 NICs each. One pair is connected to the regular LAN and the other one to the VLAN 20.

And here is a screenshot of the vSwitch: enter image description here

Switch configuration: enter image description here

Port configuration: enter image description here

On the lab setup its a single Intel NIC on each side without VLAN, but with different IP subnet.

Developer IT

Issue with VMWare vSphere and NFS: re occurring apd state - Developer IT

Issue with VMWare vSphere and NFS: re occurring apd state

vmware-esxi

nfs

Related posts about vmware-esxi

VMWare ESXi, change the default path for a VM

VMWare ESXi 3.5 Wake On Lan

blu-ray archiving in vmware ESXi 4

I need to get VMWare Server running inside of a VMWare ESXi virtual machine

Eject disk on Mac Pro running VMWare ESXi

Related posts about nfs

12.10 update breaks NFS mount

NFS to NFS mount

CentOS 5.4 NFS v4 client file permissions differ from original files & NFS Share file contents

Slow NFS and GFS2 performance

NFS (with Kerberos) mount failing due to "Server not found in Kerberos database" error

Categories cloud