WCF timeout exception detailed investigation
- by Jason Kealey
We have an application that has a WCF service (*.svc) running on IIS7 and various clients querying the service. The server is running Win 2008 Server. The clients are running either Windows 2008 Server or Windows 2003 server. I am getting the following exception, which I have seen can in fact be related to a large number of potential WCF issues.
System.TimeoutException: The request channel timed out while waiting for a reply after 00:00:59.9320000. Increase the timeout value passed to the call to Request or increase the SendTimeout value on the Binding. The time allotted to this operation may have been a portion of a longer timeout. ---> System.TimeoutException: The HTTP request to 'http://www.domain.com/WebServices/myservice.svc/gzip' has exceeded the allotted timeout of 00:01:00. The time allotted to this operation may have been a portion of a longer timeout.
I have increased the timeout to 30min and the error still occurred. This tells me that something else is at play, because the quantity of data could never take 30min to upload or download.
The error comes and goes. At the moment, it is more frequent. It does not seem to matter if I have 3 clients running simultaneously or 100, it still occurs once in a while. Most of the time, there are no timeouts but I still get a few per hour. The error comes from any of the methods that are invoked. One of these methods does not have parameters and returns a bit of data. Another takes in lots of data as a parameter but executes asynchronously. The errors always originate from the client and never reference any code on the server in the stack trace. It always ends with:
at System.Net.HttpWebRequest.GetResponse()
at System.ServiceModel.Channels.HttpChannelFactory.HttpRequestChannel.HttpChannelRequest.WaitForReply(TimeSpan timeout)
On the server:
I've tried (and currently have) the following binding settings:
maxBufferSize="2147483647" maxReceivedMessageSize="2147483647" maxBufferPoolSize="2147483647"
It does not seem to have an impact.
I've tried (and currently have) the following throttling settings:
<serviceThrottling maxConcurrentCalls="1500" maxConcurrentInstances="1500" maxConcurrentSessions="1500"/>
It does not seem to have an impact.
I currently have the following settings for the WCF service.
[ServiceBehavior(InstanceContextMode = InstanceContextMode.Single, ConcurrencyMode = ConcurrencyMode.Single)]
I ran with ConcurrencyMode.Multiple for a while, and the error still occurred.
I've tried restarting IIS, restarting my underlying SQL Server, restarting the machine. All of these don't seem to have an impact.
I've tried disabling the Windows firewall. It does not seem to have an impact.
On the client, I have these settings:
maxReceivedMessageSize="2147483647"
<system.net>
<connectionManagement>
<add address="*" maxconnection="16"/>
</connectionManagement>
</system.net>
My client closes its connections:
var client = new MyClient();
try
{
return client.GetConfigurationOptions();
}
finally
{
client.Close();
}
I have changed the registry settings to allow more outgoing connections:
MaxConnectionsPerServer=24, MaxConnectionsPer1_0Server=32.
I have now just recently tried SvcTraceViewer.exe. I managed to catch one exception on the client end. I see that its duration is 1 minute. Looking at the server side trace, I can see that the server is not aware of this exception. The maximum duration I can see is 10 seconds.
I have looked at active database connections using exec sp_who on the server. I only have a few (2-3). I have looked at TCP connections from one client using TCPview. It usually is around 2-3 and I have seen up to 5 or 6.
Simply put, I am stumped. I have tried everything I could find, and must be missing something very simple that a WCF expert would be able to see. It is my gut feeling that something is blocking my clients at the low-level (TCP), before the server actually receives the message and/or that something is queuing the messages at the server level and never letting them process.
If you have any performance counters I should look at, please let me know. (please indicate what values are bad, as some of these counters are hard to decypher). Also, how could I log the WCF message size? Finally, are there any tools our there that would allow me to test how many connections I can establish between my client and server (independently from my application)
Thanks for your time!
Extra information added June 20th:
My WCF application does something similar to the following.
while (true)
{
Step1GetConfigurationSettingsFromServerViaWCF(); // can change between calls
Step2GetWorkUnitFromServerViaWCF();
DoWorkLocally(); // takes 5-15minutes.
Step3SendBackResultsToServerViaWCF();
}
Using WireShark, I did see that when the error occurs, I have a five TCP retransmissions followed by a TCP reset later on. My guess is the RST is coming from WCF killing the connection. The exception report I get is from Step3 timing out.
I discovered this by looking at the tcp stream "tcp.stream eq 192". I then expanded my filter to "tcp.stream eq 192 and http and http.request.method eq POST" and saw 6 POSTs during this stream. This seemed odd, so I checked with another stream such as tcp.stream eq 100. I had three POSTs, which seems a bit more normal because I am doing three calls. However, I do close my connection after every WCF call, so I would have expected one call per stream (but I don't know much about TCP).
Investigating a bit more, I dumped the http packet load to disk to look at what these six calls where.
1) Step3
2) Step1
3) Step2
4) Step3 - corrupted
5) Step1
6) Step2
My guess is two concurrent clients are using the same connection, that is why I saw duplicates. However, I still have a few more issues that I can't comprehend:
a) Why is the packet corrupted? Random network fluke - maybe? The load is gzipped using this sample code: http://msdn.microsoft.com/en-us/library/ms751458.aspx - Could the code be buggy once in a while when used concurrently? I should test without the gzip library.
b) Why would I see step 1 & step 2 running AFTER the corrupted operation timed out? It seems to me as if these operations should not have occurred. Maybe I am not looking at the right stream because my understanding of TCP is flawed. I have other streams that occur at the same time. I should investigate other streams - a quick glance at streams 190-194 show that the Step3 POST have proper payload data (not corrupted). Pushing me to look at the gzip library again.