In a customer site, the network team added a firewall between the client and the server. This is causing idle connections to get disconnected after about 40 minutes of idle time. The network people say that the firewall doesn't have any idle connection timeout, but the fact is that the idle connections get broken.
In order to get around this, we first configured the server (a Linux machine) with TCP keepalives turned on with tcp_keepalive_time=300, tcp_keepalive_intvl=300, and tcp_keepalive_probes=30000. This works, and the connections stay viable for days or more. However, we would also like the server to detect dead clients and kill the connection, so we changed the settings to time=300,intvl=180,probes=10, thinking that if the client was indeed alive, the server would probe every 300s (5 minutes) and the client would respond with an ACK and that would keep the firewall from seeing this as an idle connection and killing it. If the client was dead, after 10 probes, the server would abort the connection. To our surprise, the idle but alive connections get killed after about 40 minutes as before.
Wireshark running on the client side shows no keepalives at all between the server and client, even when keepalives are enabled on the server.
What could be happening here?
If the keepalive settings on the server are time=300,intvl=180,probes=10, I would expect that if the client is alive but idle, the server would send keepalive probes every 300 seconds and leave the connection alone, and if the client is dead, it would send one after 300 seconds, then 9 more probes every 180 seconds before killing the connection. Am I right?
One possibility is that the firewall is somehow intercepting the keepalive probes from the server and failing to pass them on to the client, and the fact that it got a probe makes it think that the connection is active. Is this common behavior for a firewall? We don't know what kind of firewall is involved.
The server is a Teradata node and the connection is from a Teradata client utility to the database server, port 1025 on the server side, but we have seen the same problem with an SSH connection so we think it affects all TCP connections.