Application stuck in TCP retransmit

Posted by SandeepJ on Ask Ubuntu See other posts from Ask Ubuntu or by SandeepJ
Published on 2014-06-02T11:39:30Z Indexed on 2014/06/02 15:59 UTC
Read the original article Hit count: 175

Filed under:
|
|

I am running Linux kernel 3.13 (Ubuntu 14.04) on two Virtual Machines each of which operates inside two different servers running ESXi 5.1. There is a zeromq client-server application running between the two VMs. After running for about 10-30 minutes, this application consistently hangs due to inability to retransmit a lost packet.

When I run the same setup over Ubuntu 12.04 (Linux 3.11), the application never fails

If you notice below, "ss" (socket statistics) shows 1 packet lost, sk_wmem_queued of 14110 (i.e. w14110) and a high rto (120000).

State Recv-Q Send-Q Local Address:Port Peer Address:Port

ESTAB 0 12350 192.168.2.122:41808 192.168.2.172:55550

timer:(on,16sec,10) uid:1000 ino:35042

sk:ffff880035bcb100 <-> skmem:(r0,rb648720,t0,tb1164800,f2274,w14110,o0,bl0) ts sack cubic wscale:7,7 rto:120000 rtt:7.5/3 ato:40 mss:8948 cwnd:1 ssthresh:21 send 9.5Mbps unacked:1 retrans:1/10 lost:1 rcv_rtt:1476 rcv_space:37621

Since this has happened so consistently, I was able to capture the TCP log in wireshark. I found that the packet which is lost does get retransmitted and even acknowledged by the TCP in the other OS (the sequence number is seen in the ACK), but the sender doesn't seem to understand this ACK and continues retransmitting.

MTU is 9000 on both virtual machines and througout the route. The packets being sent are large in size.

As I said earlier, this does not happen on Ubuntu 12.04 (kernel 3.11). So I did a diff on the TCP config options (seen via "sysctl -a |grep tcp ") between 14.04 and 12.04 and found the following differences.

I also noticed that net.ipv4.tcp_mtu_probing=0 in both configurations.

Left side is 3.11, right side is 3.13

<<net.ipv4.tcp_abc = 0
<<net.ipv4.tcp_cookie_size = 0
<<net.ipv4.tcp_dma_copybreak = 4096

14c11
<< net.ipv4.tcp_early_retrans = 2
---
>> net.ipv4.tcp_early_retrans = 3

17c14
<< net.ipv4.tcp_fastopen = 0
>> net.ipv4.tcp_fastopen = 1

20d16
<< net.ipv4.tcp_frto_response = 0
26,27c22
<< net.ipv4.tcp_max_orphans = 16384
<< net.ipv4.tcp_max_ssthresh = 0

>> net.ipv4.tcp_max_orphans = 4096
29,30c24,25
<< net.ipv4.tcp_max_tw_buckets = 16384
<< net.ipv4.tcp_mem = 94377 125837  188754

>> net.ipv4.tcp_max_tw_buckets = 4096
>> net.ipv4.tcp_mem = 23352 31138   46704
34a30
>> net.ipv4.tcp_notsent_lowat = -1

My question to the networking experts on this forum : Are there any other debugging tools or options I can install/enable to dig further into why this TCP retransmit failure is occurring so consistently ? Are there any configuration changes which might account for this weird behaviour.

© Ask Ubuntu or respective owner

Related posts about networking

Related posts about 14.04