New details added at the end of this question; it's possible that I'm zeroing in on the cause.
I have a UDP OpenVPN-based VPN set up in tap mode (I need tap because I need the VPN to pass multicast packets, which doesn't seem to be possible with tun networks) with a handful of clients across the Internet. I've been experiencing frequent TCP connection freezes over the VPN. That is, I will establish a TCP connection (e.g. an SSH connection, but other protocols have similar issues), and at some point during the session, it seems that traffic will cease being transmitted over that TCP session.
This seems to be related to points at which large data transfers occur, such as if I execute an ls command in an SSH session, or if I cat a long log file. Some Google searches turn up a number of answers like this previous one on Server Fault, indicating that the likely culprit is an MTU issue: that during periods of high traffic, the VPN is trying to send packets that get dropped somewhere in the pipes between the VPN endpoints. The above-linked answer suggests using the following OpenVPN configuration settings to mitigate the problem:
fragment 1400
mssfix
This should limit the MTU used on the VPN to 1400 bytes and fix the TCP maximum segment size to prevent the generation of any packets larger than that. This seems to mitigate the problem a bit, but I still frequently see the freezes. I've tried a number of sizes as arguments to the fragment directive: 1200, 1000, 576, all with similar results. I can't think of any strange network topology between the two ends that could trigger such a problem: the VPN server is running on a pfSense machine connected directly to the Internet, and my client is also connected directly to the Internet at another location.
One other strange piece of the puzzle: if I run the tracepath utility, then that seems to band-aid the problem. A sample run looks like:
[~]$ tracepath -n 192.168.100.91
1: 192.168.100.90 0.039ms pmtu 1500
1: 192.168.100.91 40.823ms reached
1: 192.168.100.91 19.846ms reached
Resume: pmtu 1500 hops 1 back 64
The above run is between two clients on the VPN: I initiated the trace from 192.168.100.90 to the destination of 192.168.100.91. Both clients were configured with fragment 1200; mssfix; in an attempt to limit the MTU used on the link. The above results would seem to suggest that tracepath was able to detect a path MTU of 1500 bytes between the two clients. I would assume that it would be somewhat smaller due to the fragmentation settings specified in the OpenVPN configuration. I found that result somewhat strange.
Even stranger, however: if I have a TCP connection in the stalled state (e.g. an SSH session with a directory listing that froze in the middle), then executing the tracepath command shown above causes the connection to start up again! I can't figure out any reasonable explanation for why this would be the case, but I feel like this might be pointing toward a solution to ultimately eradicate the problem.
Does anyone have any recommendations for other things to try?
Edit: I've come back and looked at this a bit further, and have found only more confounding information:
I set the OpenVPN connection to fragment at 1400 bytes, as shown above. Then, I connected to the VPN from across the Internet and used Wireshark to look at the UDP packets that were sent to the VPN server while the stall occurred. None were greater than the specified 1400 byte count, so the fragmentation seems to be functioning properly.
To verify that even a 1400-byte MTU would be sufficient, I pinged the VPN server using the following (Linux) command:
ping <host> -s 1450 -M do
This (I believe) sends a 1450-byte packet with fragmentation disabled (I at least verified that it didn't work if I set it to an obviously-too-large value like 1600 bytes). These seem to work just fine; I get replies back from the host with no issue.
So, maybe this isn't an MTU issue at all. I'm just confused as to what else it might be!
Edit 2: The rabbit hole just keeps getting deeper: I've now isolated the problem a bit more. It seems to be related to the exact OS that the VPN client uses. I have successfully duplicated the problem on at least three Ubuntu machines (versions 12.04 through 13.04). I can reliably duplicate an SSH connection freeze within a minute or so by just cat-ing a large log file.
However, if I do the same test using a CentOS 6 machine as a client, then I don't see the problem! I've tested using the exact same OpenVPN client version as I was using on the Ubuntu machines. I can cat log files for hours without seeing the connection freeze. This seems to provide some insight as to the ultimate cause, but I'm just not sure what that insight is.
I have examined the traffic over the VPN using Wireshark. I'm not a TCP expert, so I'm not sure what to make of the gory details, but the gist is that at some point, a UDP packet gets dropped due to the limited bandwidth of the Internet link, causing TCP retransmissions inside the VPN tunnel. On the CentOS client, these retransmissions occur properly and things move on happily. At some point with the Ubuntu clients, though, the remote end starts retransmitting the same TCP segment over and over (with the transmit delay increasing between each retransmission). The client sends what looks like a valid TCP ACK to each retransmission, but the remote end still continues to transmit the same TCP segment periodically. This extends ad infinitum and the connection stalls. My question here would be:
Does anyone have any recommendations for how to troubleshoot and/or determine the root cause of the TCP issue? It's as if the remote end isn't accepting the ACK messages sent by the VPN client.
One common difference between the CentOS node and the various Ubuntu releases is that Ubuntu has a much more recent Linux kernel version (from 3.2 in Ubuntu 12.04 to 3.8 in 13.04). A pointer to some new kernel bug maybe? I'm assuming that if that were so, then I wouldn't be the only one experiencing the problem; I don't think this seems like a particularly exotic setup.