Exploring TCP throughput with DTrace
- by user12820842
One key measure to use when assessing TCP throughput is assessing the amount of unacknowledged data in the pipe. This is sometimes termed the Bandwidth Delay Product (BDP) (note that BDP is often used more generally as the product of the link capacity and the end-to-end delay). In DTrace terms, the amount of unacknowledged data in bytes for the connection is the different between the next sequence number to send and the lowest unacknoweldged sequence number (tcps_snxt - tcps_suna). According to the theory, when the number of unacknowledged bytes for the connection is less than the receive window of the peer, the path bandwidth is the limiting factor for throughput. In other words, if we can fill the pipe without the peer TCP complaining (by virtue of its window size reaching 0), we are purely bandwidth-limited. If the peer's receive window is too small however, the sending TCP has to wait for acknowledgements before it can send more data. In this case the round-trip time (RTT) limits throughput. In such cases the effective throughput limit is the window size divided by the RTT, e.g. if the window size is 64K and the RTT is 0.5sec, the throughput is 128K/s.
So a neat way to visually determine if the receive window of clients may be too small should be to compare the distribution of BDP values for the server versus the client's advertised receive window. If the BDP distribution overlaps the send window distribution such that it is to the right (or lower down in DTrace since quantizations are displayed vertically), it indicates that the amount of unacknowledged data regularly exceeds the client's receive window, so that it is possible that the sender may have more data to send but is blocked by a zero-window on the client side.
In the following example, we compare the distribution of BDP values to the receive window advertised by the receiver (10.175.96.92) for a large file download via http.
# dtrace -s tcp_tput.d
^C
BDP(bytes) 10.175.96.92 80
value ------------- Distribution ------------- count
-1 | 0
0 | 6
1 | 0
2 | 0
4 | 0
8 | 0
16 | 0
32 | 0
64 | 0
128 | 0
256 | 3
512 | 0
1024 | 0
2048 | 9
4096 | 14
8192 | 27
16384 | 67
32768 |@@ 1464
65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 32396
131072 | 0
SWND(bytes) 10.175.96.92 80
value ------------- Distribution ------------- count
16384 | 0
32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 17067
65536 | 0
Here we have a puzzle. We can see that the receiver's advertised window is in the 32768-65535 range, while the amount of unacknowledged data in the pipe is largely in the 65536-131071 range. What's going on here? Surely in a case like this we should see zero-window events, since the amount of data in the pipe regularly exceeds the window size of the receiver. We can see that we don't see any zero-window events since the SWND distribution displays no 0 values - it stays within the 32768-65535 range.
The explanation is straightforward enough. TCP Window scaling is in operation for this connection - the Window Scale TCP option is used on connection setup to allow a connection to advertise (and have advertised to it) a window greater than 65536 bytes. In this case the scaling shift is 1, so this explains why the SWND values are clustered in the 32768-65535 range rather than the 65536-131071 range - the SWND value needs to be multiplied by two since the reciever is also scaling its window by a shift factor of 1.
Here's the simple script that compares BDP and SWND distributions, fixed to take account of window scaling.
#!/usr/sbin/dtrace -s
#pragma D option quiet
tcp:::send
/ (args[4]-tcp_flags & (TH_SYN|TH_RST|TH_FIN)) == 0 /
{
@bdp["BDP(bytes)", args[2]-ip_daddr, args[4]-tcp_sport] =
quantize(args[3]-tcps_snxt - args[3]-tcps_suna);
}
tcp:::receive
/ (args[4]-tcp_flags & (TH_SYN|TH_RST|TH_FIN)) == 0 /
{
@swnd["SWND(bytes)", args[2]-ip_saddr, args[4]-tcp_dport] =
quantize((args[4]-tcp_window)*(1 tcps_snd_ws));
}
And here's the fixed output.
# dtrace -s tcp_tput_scaled.d
^C
BDP(bytes) 10.175.96.92 80
value ------------- Distribution ------------- count
-1 | 0
0 | 39
1 | 0
2 | 0
4 | 0
8 | 0
16 | 0
32 | 0
64 | 0
128 | 0
256 | 3
512 | 0
1024 | 0
2048 | 4
4096 | 9
8192 | 22
16384 | 37
32768 |@ 99
65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 3858
131072 | 0
SWND(bytes) 10.175.96.92 80
value ------------- Distribution ------------- count
512 | 0
1024 | 1
2048 | 0
4096 | 2
8192 | 4
16384 | 7
32768 | 14
65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1956
131072 | 0