Odd tcp deadlock under windows
- by John Robertson
We are moving large amounts of data on a LAN and it has to happen very rapidly and reliably. Currently we use windows TCP as implemented in C++. Using large (synchronous) sends moves the data much faster than a bunch of smaller (synchronous) sends but will frequently deadlock for large gaps of time (.15 seconds) causing the overall transfer rate to plummet. This deadlock happens in very particular circumstances which makes me believe it should be preventable altogether. More importantly if we don't really know the cause we don't really know it won't happen some time with smaller sends anyway. Can anyone explain this deadlock?
Deadlock description (OK, zombie-locked, it isn't dead, but for .15 or so seconds it stops, then starts again)
The receiving side sends an ACK.
The sending side sends a packet containing the end of a message (push flag is set)
The call to socket.recv takes about .15 seconds(!) to return
About the time the call returns an ACK is sent by the receiving side
The the next packet from the sender is finally sent (why is it waiting? the tcp window is plenty big)
The odd thing about (3) is that typically that call doesn't take much time at all and receives exactly the same amount of data. On a 2Ghz machine that's 300 million instructions worth of time. I am assuming the call doesn't (heaven forbid) wait for the received data to be acked before it returns, so the ack must be waiting for the call to return, or both must be delayed by something else.
The problem NEVER happens when there is a second packet of data (part of the same message) arriving between 1 and 2. That part very clearly makes it sound like it has to do with the fact that windows TCP will not send back a no-data ACK until either a second packet arrives or a 200ms timer expires. However the delay is less than 200 ms (its more like 150 ms).
The third unseemly character (and to my mind the real culprit) is (5). Send is definitely being called well before that .15 seconds is up, but the data NEVER hits the wire before that ack returns. That is the most bizarre part of this deadlock to me. Its not a tcp blockage because the TCP window is plenty big since we set SO_RCVBUF to something like 500*1460 (which is still under a meg). The data is coming in very fast (basically there is a loop spinning out data via send) so the buffer should fill almost immediately. According to msdn the buffer being full and at least one pending send should cause the data to be sent (though in another place it mentions that there various "heuristics" used in deciding when a send hits the wire).
Anway, why the sender doesn't actually send more data during that .15 second pause is the most bizarre part to me. The information above was captured on the receiving side via wireshark (except of course the socket.recv return times which were logged in a text file). We tried changing the send buffer to zero and turning off Nagle on the sender (yes, I know Nagle is about not sending small packets - but we tried turning Nagle off in case that was part of the unstated "heuristics" affecting whether the message would be posted to the wire. Technically microsoft's Nagle is that a small packet isn't sent if the buffer is full and there is an outstanding ACK, so it seemed like a possibility).