I am responsible for a third-party application (no access to source) running on IIS and SQL Server 2005 (500 concurrent users, 1TB data, 8 IIS servers). We have recently started to see significant blocking on the database (after months of running this application in production with no problems). This occurs at random intervals during the day, approximately every 30 minutes, and affects between 20 and 100 sessions each time. All of the sessions eventually hit the application time out and the sessions abort.
The problem disappears and then gradually re-emerges. The SPID responsible for the blocking always has the following features:
WAIT TYPE = ASYNC_NETWORK_IO
The SQL being run is “(@claimid
varchar(15))SELECT claimid, enrollid,
status, orgclaimid, resubclaimid,
primaryclaimid FROM claim WHERE
primaryclaimid = @claimid AND
primaryclaimid < claimid)”. This is
relatively innocuous SQL that should
only return one or two records, not a
large dataset.
NO OTHER SQL statements have been
implicated in the blocking, only this
SQL statement.
This is parameterized SQL for which
an execution plan is cached in
sys.dm_exec_cached_plans.
This SPID has an object-level S lock on the claim table, so all UPDATEs/INSERTs to the claim table are also blocked.
HOST ID varies. Different web servers are responsible for the blocking sessions. E.g., sometimes we trace back to web server 1, sometimes web server 2.
When we trace back to the web server implicated in the blocking, we see the following:
There is always some sort of
application related error in the
Event Log on the web server, linked
to the Host ID and Host Process ID
from the SQL Session.
The error messages vary, usually some
sort of SystemOutofMemory. (These
error messages seem to be similar to
error messages that we have seen in
the past without such dramatic
consequences. We think was happening
before, but didn’t lead to blocking.
Why now?)
No known problems with the network
adapters on either the web servers or
the SQL server.
(In any event the record set returned by the offending query would be small.)
Things ruled out:
Indexes are regularly defragmented.
Statistics regularly updated.
Increased sample size of statistics
on claim.primaryclaimid.
Forced recompilation of the cached
execution plan.
Created a compound index with
primaryclaimid, claimid.
No networking problems.
No known issues on the web server.
No changes to application software on
web servers.
We hypothesize that the chain of events goes something like this:
Web server process submits SQL
above.
SQL server executes the SQL, during
which it acquires a lock on the
claim table.
Web server process gets an error and
dies.
SQL server session is hung waiting
for the web server process to read
the data set.
SQL Server sessions that need to get
X locks on parts of the claim table
(anyone processing claims) are
blocked by the lock on the claim
table and remain blocked until they
all hit the application time out.
Any suggestions for troubleshooting while waiting for the vendor's assistance would be most welcome.
Is there a way to force SQL Server to lock at the row/page level for this particular SQL statement only?
Is there a way to set a threshold on ASYNC_NETWORK_IO waits only?