NFS v4, HA Migration, and stale handles on clients
- by Karl Katzke
I'm managing a server running NFS v4 with Pacemaker/OpenAIS. NFS is configured to use TCP. When I migrate the NFS server to another node in the Pacemaker cluster, even though the metadata is persisted, connections from the clients 'hang' and eventually time out after 90 seconds. After that 90 seconds, the old mountpoint becomes 'stale' and the mounted files can no longer be accessed.
The 90 second grace period seems to be part of the server configuration and not the client configuration. I see this message on the server:
kernel: NFSD: starting 90-second grace period
If I restart the NFS client on the client nodes after I migrate (unmounting and then remounting the share), then I don't experience the problem, but connections and file transfers still interrupted.
Three questions:
What is the 90 second grace period? What's it there for?
How can I keep the files from going stale on the clients without restarting them after I migrate the NFS server to another node?
Is it actually possible to migrate the NFS server without having large file uploads drop?