We have a bunch of WCF services that work almost all of the time, using various bindings, ports, max sizes, etc. The super-frustrating thing about WCF is that when it (rarely) fails, we are powerless to find out why it failed. Sometimes you will get a message that looks like this:
System.ServiceModel.CommunicationException:
The socket connection was aborted.
This could be caused by an error
processing your message or a receive
timeout being exceeded by the remote
host, or an underlying network
resource issue. Local socket timeout
was '01:00:00'. ---
System.IO.IOException: Unable to read
data from the transport connection: An
existing connection was forcibly
closed by the remote host.
The problem is that the local socket timeout it's giving you is merely an attempt to be convenient. It may or may not be the cause of the problem. But OK, sometimes networks have issues. No big deal. We can retry or something. But here's the huge problem. On top of failing to tell you which precisely which timeout (if any) resulted in the failure ("your server-side receive timeout was exceeded," or something, would be helpful), WCF seems to have two types of timeouts.
Timeout Type #1) A timeout, that, if increased, would increase the chance of your operation's success. So, the pertinent timeout is an hour, you are uploading a huge file that will take an hour and twenty minutes. It fails. You increase the timeout, it succeeds. I have no no problem with this type of timeout.
Timeout Type #2) A timeout which merely defines how long you have to wait for the service to actually fail and give you an error, but modifying the value of this timeout has no impact on the chance of success. Basically, something happens during the first second of the service request which mucks things up. It will never recover. WCF doesn't magically retry the network connection for you. Fine, sometimes establishing a network connection doesn't go well. But, if your timeout is 2 hours, you have to wait 2 whole hours with no chance of it ever working before it finally acknowledges that it didn't work and gives you the error.
But the error you see in both cases looks the same. With timeout Type #2, it still looks like you are running into a timeout. But, you could increase all of your timeouts to 4 years, and all it would do is make it take 4 years to get an error message. I know that Type #2 exists because I can do an operation that is known to complete in less than a minute when successful, and have it take 2 hours to fail. But, if I kill it and retry, it succeeds quickly. (If you are wondering why there might be a 2 hour timeout on an operation that takes less than a minute, there are times I run the operation with a much larger file and it could take over an hour.)
So, to combat the problem with Type #2, you'd want your timeout to be really quick so you immediately know if there is a problem. Then you can retry. But the insurmountable problem is that because I don't know which timeouts are the cause of failure, I don't know what timeouts are Type #1 and which ones are Type #2. There may be one timeout (let's say the client-side send timeout) that acts like Type #1 in some cases and Type #2 in others. I have no idea, and I have no way of finding out.
Does anyone know how to track down Type #2 timeouts so I can set them to low values without having to shorten actual (read: Type #1) timeouts and lower the chance of success?
Thank you.