On sporadic occasions we receive the following error when attempting to call an .asmx web service from a .Net client application:
"The underlying connection was closed: A connection that was expected to be kept alive was closed by the server. Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host."
By sporadic I mean that it might occur zero, once every few days, or a half-dozen times a day for some users. It will never occur for the first web service call of a user. And the subsequent (usually the same) call will always work immediately after the failure. The failures happen across a variety of methods in the service and usually happens between 15-20 seconds (according to the log) from the time of the request.
Looking in the IIS site log for the particular call will show one or the other of the following windows error codes:
121: The semaphore timeout period has elapsed.
1236: The network connection was aborted by the local system.
Some additional environment details:
Running on internal network web farm consisting of two servers running IIS7 on Windows Server 2008 OS. These problems did not occur when running in an older IIS6 web farm of three servers running on Windows Server 2003 (and we use a single IIS6/2003 instance for our development and staging environments with no issues). EDIT: Also, all of these server instances are VMWare virtual machines, not sure if that is a surprise anymore or not.
The web service is a .Net 2.0/3.5 compiled .asmx web service that has its own application pool (.Net 2.0, integrated pipeline). Only has Windows Authentication enabled.
We have another web service on the farm that uses the same physical path as the primary service, the only difference being that Basic Authentication is enabled. This is used for a portion of our ERP system. Have tried using the same and different application pool - no effect on the error. This site isn't hit as often as the primary site and has never had an error.
As mentioned, the error will only happen when called from the .Net client - not from other applications. The client application is always creating a new web service object for each request and setting the service credentials to System.Net.CredentialCache.DefaultCredentials.
The application is either deployed locally to a client or run in a Citrix server session. Those users running in Citrix doesn't seem to experience the issue, only locally deployed clients. The Citrix servers and the web farm are located in the same physical location and are located in the same IP range (10.67.xx.xx). Locally deployed clients experiencing the error are located elsewhere (10.105.xx.xx, 10.31.xx.xx).
I've checked the OS logs to see if I can see any problems but nothing really sticks out.
EDIT: Actually, I myself just ran into the error a little bit ago. I decided to check out the logs again and saw that there was a Security log entry of "Audit Failure" at the 'same' time (IIS log entry at 1:39:59, event log entry at 1:39:50). Not sure if this is a coincidence or not, I'll have to check out the logs of previous errors. I'm probably grasping for straws but the details:
Log Name: Security
Source: Microsoft-Windows-Security-Auditing
Date: 7/8/2009 1:39:50 PM
Event ID: 5159
Task Category: Filtering Platform Connection
Level: Information
Keywords: Audit Failure
User: N/A
Computer: is071019.<**.net
Description:
The Windows Filtering Platform has blocked a bind to a local port.
Application Information:
Process ID: 1260
Application Name: \device\harddiskvolume1\windows\system32\svchost.exe
Network Information:
Source Address: 0.0.0.0
Source Port: 54802
Protocol: 17
Filter Information:
Filter Run-Time ID: 0
Layer Name: Resource Assignment
Layer Run-Time ID: 36
I've also tried to use Failed Request Tracing in IIS7 but the service call never actually gets to where FRT can capture it (even though the failure is logged in the web service log).
The network infrastructure group said they checked out the DNS and any NIC settings are correct so there is no 'flapping'. Everything pans out. I'm not sure that they checked out any domain controller servers though to see if that could be an issue.
Any ideas? Or any other debugging strategies to get to the bottom of this? I'm just the developer in charge of the software and don't really have the knowledge on what to investigate from the networking side of things - although it does sound like a networking issue to me based on what is happening.
Thanks in advance for any help.