Timeout Considerations for Solicit Response
- by Michael Stephenson
Background
One of the clients I work with had been experiencing some issues for a while surrounding web service timeouts. It's been a little challenging to work through the problems due to limitations in the diagnostic information available from one of the applications, but I learned some interesting things while troubleshooting the problem which don't seem to have been discussed much in the community so I thought I'd share my findings.
In the scenario we have BizTalk trying to make calls to a .net web service which was exposed as a WSE 2 endpoint. In the process BizTalk will try to make a large number of concurrent web service calls to the application, and the backend application has more than enough infrastructure and capability to handle the load.
We have configured the <ConnectionManagement> section of the BizTalk configuration file to support up to 100 concurrent connections from each of our 2 BizTalk send servers to the web servers of the application.
The problem we were facing was that the BizTalk side was reporting a significant number of timeouts when calling the web service.
One of the biggest issues was the challenge of being able to correlate a message from BizTalk to the IIS log in the .net application and the custom logs in the application especially when there was a fairly large number of servers hosting the web services. However the key moment came when we were able to identify a specific call which had taken 40 seconds to execute on the server (yes a long time I know but that's a different story!). Anyway we were able to identify that this had timed out on the BizTalk side. Based on the normal 2 minute timeout we knew something unexpected was going on.
From here I decided to do some experimentation and I wanted to start outside of BizTalk because my hunch was this was not a BizTalk behaviour but something which was being highlighted by BizTalk because of our large load.
Server-side - Sample Web Service
To begin with I created a sample web service. Nothing special just a vanilla asmx web service hosted in IIS6 on Windows 2003 Standard Edition. The web service is just a hello world style web service as shown in the below picture. The only key feature is that the server side web method has a 30 second sleep in it and will trace out some information before and after the thread is set to sleep.
In the configuration for this web service there again is nothing special it's pretty much the most plain simple web service you could build.
Client-Side
To begin looking at what was happening with our example I created a number of different ways to consume the web service.
SoapHttpClientProtocol Example
I created a small application which would use a normal proxy generated to call the web service. It would iterate around a loop and make calls using the begin/end methods so I can do this asynchronously. I would do a loop of 20 calls with the ConnectionManager configuration section supporting only 5 concurrent connections to the server.
<connectionManagement>
<remove address="*"/>
<add address = "*" maxconnection = "12" />
<add address = "http://<ServerName>" maxconnection = "5" />
</connectionManagement>
</system.net>
The below picture shows an example of the service calling code, key points are:
I have configured the timeout of 40 seconds for the proxy
I am using the asynchronous methods on the proxy to call the web service
The Test
I would run the client and execute 21 calls to the web service.
The Results
Below is the client side trace showing what's happening on the client.
In the below diagram is the web service side trace showing what's happening on the server
Some observations on the results are:
All of the calls were successful from the clients perspective
You could see the next call starting on the server as soon as the previous one had completed
Calls took significantly longer than 40 seconds from the start of our call to the return. In fact call 20 took 2 minutes and 30 seconds from the perspective of my code to execute even though I had set the timeout to 40 seconds
WSE 2 Sample
In the second example I used the exact same code to call the web service again with a single exception that I modified the web service proxy to derive from WebServiceClient protocol which is part of WSE 2 (using SP3). The below picture shows the basic code and the key points are:
I have configured the timeout of 40 seconds for the proxy
I am using the asynchronous methods on the proxy to call the web service
The Test
This test would execute 21 calls from the client to the web service.
The Results
The below trace is from the client side:
The below trace is from the server side:
Some observations on the trace results for this scenario are:
With call 4 if you look at the server side trace it did not start executing on the server for a number of seconds after the other 4 initial calls which were accepted by the server. I re-ran the test and this happened a couple of times and not on most others so at this point I'm just putting this down to something unexpected happening on the development machine and we will leave this observation out of scope of this article.
You can see that the client side trace statement executed almost immediately in all cases
All calls after the initial few calls would timeout
On the client side the calls that did timeout; timed out in a longer duration than the 40 seconds we set as the timeout
You can see that as calls were completing on the server the next calls were starting to come through
The calls that timed out on the client did actually connect to the server and their server side execution completed successfully
Elaboration on the findings
Based on the above observations I have drawn the below sequence diagram to illustrate conceptually what is happening. Everything except the final web service object is on the client side of the call. In the diagram below I've put two notes on the Web Service Proxy to show the two different places where the different base classes seem to start their timeout counters.
From the earlier samples we can work out that the timeout counter for the WSE web service proxy starts before the one for the SoapHttpClientProtocol proxy and the WSE one includes the time to get a connection from the pool; whereas the Soap proxy timeout just covers the method execution.
One interesting observation is if we rerun the above sample and increase the number of calls from 21 to 100,000 then for the WSE sample we will see a similar pattern where everything after the first few calls will timeout on the client as soon as it makes a connection to the server whereas the soap proxy will happily plug away and process all of the calls without a single timeout. I have actually set the sample running overnight and this did happen.
At this point you are probably thinking the same thoughts I was at the time about the differences in behaviour and which is right and why are they different? I'm not sure there is a definitive answer to this in the documentation, or at least not that I could find! I think you just have to consider that they are different and they could have different effects depending on your messaging solution.
In lots of situations this is just not an issue as your concurrent requests doesn't get to the situation where you end up throttling the web service calls on the client side, however this is definitely more common with an integration broker such as BizTalk where you often have high throughput requirements.
Some of the considerations you should make
Based on this behaviour you should be aware of the following:
In a .net application if you are making lots of concurrent web service calls from an application in an asynchronous manner your user may thing they are experiencing poor performance but you think your web service is working well. The problem could be that the client will have a default of 2 connections to remote servers so you should bear this in mind
When you are developing a BizTalk solution or a .net solution with the WSE 2 stack you may experience timeouts under load and throttling the number of connections using the max connections element in the configuration file will not help you
For an application using WSE2 or SoapHttpClientProtocol an expired timeout will not throw an error until after a connection to the server has been made so you should consider this in your transaction and durability patterns
Our Work Around
In the short term for our specific scenario we know that we can handle this by just increasing our timeout value. There is only a specific small window when we get lots of concurrent traffic that causes this scenario so we should be able to increase the timeout to take into consideration the additional client side wait, and on the odd occasion where we do get a timeout the BizTalk send port retry will handle this.
What was causing our original problem was that for that short window we were getting a lot of retries which significantly increased the load on our send servers and highlighted the issue.
Longer Term Solution
As a longer term solution this really gives us more ammunition to argue a migration to WCF. The application we are calling has some factors which limit the protocols we can use but with WCF we would have more control on the various timeout options because in WCF you can configure specific parts of the timeout.
Summary
I've had this blog post on my to do list for ages but hopefully it will be useful to some people to just understand this behaviour and to possibly help you with some performance issues you may have. I do not believe there is too much in the way of documentation particularly around WSE2 and ASMX in this area so again another bit of ammunition for migrating to WCF.
I'll try to do a follow up post with the sample for WCF to show how this changes things.