Observing flow control idle time in TCP
- by user12820842
Previously I described how to observe congestion control strategies during transmission, and here I talked about TCP's sliding window approach for handling flow control on the receive side. A neat trick would now be to put the pieces together and ask the following question - how often is TCP transmission blocked by congestion control (send-side flow control) versus a zero-sized send window (which is the receiver saying it cannot process any more data)? So in effect we are asking whether the size of the receive window of the peer or the congestion control strategy may be sub-optimal. The result of such a problem would be that we have TCP data that we could be transmitting but we are not, potentially effecting throughput.
So flow control is in effect:
when the congestion window is less than or equal to the amount of bytes outstanding on the connection. We can derive this from args[3]-tcps_snxt - args[3]-tcps_suna, i.e. the difference between the next sequence number to send and the lowest unacknowledged sequence number; and
when the window in the TCP segment received is advertised as 0
We time from these events until we send new data (i.e. args[4]-tcp_seq = snxt value when window closes. Here's the script:
#!/usr/sbin/dtrace -s
#pragma D option quiet
tcp:::send
/ (args[3]-tcps_snxt - args[3]-tcps_suna) = args[3]-tcps_cwnd /
{
cwndclosed[args[1]-cs_cid] = timestamp;
cwndsnxt[args[1]-cs_cid] = args[3]-tcps_snxt;
@numclosed["cwnd", args[2]-ip_daddr, args[4]-tcp_dport] = count();
}
tcp:::send
/ cwndclosed[args[1]-cs_cid] && args[4]-tcp_seq = cwndsnxt[args[1]-cs_cid] /
{
@meantimeclosed["cwnd", args[2]-ip_daddr, args[4]-tcp_dport] =
avg(timestamp - cwndclosed[args[1]-cs_cid]);
@stddevtimeclosed["cwnd", args[2]-ip_daddr, args[4]-tcp_dport] =
stddev(timestamp - cwndclosed[args[1]-cs_cid]);
@numclosed["cwnd", args[2]-ip_daddr, args[4]-tcp_dport] = count();
cwndclosed[args[1]-cs_cid] = 0;
cwndsnxt[args[1]-cs_cid] = 0;
}
tcp:::receive
/ args[4]-tcp_window == 0 &&
(args[4]-tcp_flags & (TH_SYN|TH_RST|TH_FIN)) == 0 /
{
swndclosed[args[1]-cs_cid] = timestamp;
swndsnxt[args[1]-cs_cid] = args[3]-tcps_snxt;
@numclosed["swnd", args[2]-ip_saddr, args[4]-tcp_dport] = count();
}
tcp:::send
/ swndclosed[args[1]-cs_cid] && args[4]-tcp_seq = swndsnxt[args[1]-cs_cid] /
{
@meantimeclosed["swnd", args[2]-ip_daddr, args[4]-tcp_sport] =
avg(timestamp - swndclosed[args[1]-cs_cid]);
@stddevtimeclosed["swnd", args[2]-ip_daddr, args[4]-tcp_sport] =
stddev(timestamp - swndclosed[args[1]-cs_cid]);
swndclosed[args[1]-cs_cid] = 0;
swndsnxt[args[1]-cs_cid] = 0;
}
END
{
printf("%-6s %-20s %-8s %-25s %-8s %-8s\n", "Window",
"Remote host", "Port", "TCP Avg WndClosed(ns)", "StdDev",
"Num");
printa("%-6s %-20s %-8d %@-25d %@-8d %@-8d\n", @meantimeclosed,
@stddevtimeclosed, @numclosed);
}
So this script will show us whether the peer's receive window size is preventing flow ("swnd" events) or whether congestion control is limiting flow ("cwnd" events). As an example I traced on a server with a large file transfer in progress via a webserver and with an active ssh connection running "find / -depth -print". Here is the output:
^C
Window Remote host Port TCP Avg WndClosed(ns) StdDev Num
cwnd 10.175.96.92 80 86064329 77311705 125
cwnd 10.175.96.92 22 122068522 151039669 81
So we see in this case, the congestion window closes 125 times for port 80 connections and 81 times for ssh. The average time the window is closed is 0.086sec for port 80 and 0.12sec for port 22.
So if you wish to change congestion control algorithm in Oracle Solaris 11, a useful step may be to see if congestion really is an issue on your network. Scripts like the one posted above can help assess this, but it's worth reiterating that if congestion control is occuring, that's not necessarily a problem that needs fixing. Recall that congestion control is about controlling flow to prevent large-scale drops, so looking at congestion events in isolation doesn't tell us the whole story. For example, are we seeing more congestion events with one control algorithm, but more drops/retransmission with another? As always, it's best to start with measures of throughput and latency before arriving at a specific hypothesis such as "my congestion control algorithm is sub-optimal".