I recently received the following email from a blog reader:
"We are having an OLTP database instance, using SQL Server 2005 with little to moderate traffic (10-20 requests/min). There are also bulk imports that occur at regular intervals in this DB and the import duration ranges between 10secs to 1 min, depending on the data size. Intermittently (2-3 times in a week), we face an issue, where queries get timed out (default of 30 secs set in application). On analyzing, we found two stored procedures, having queries with multiple table joins inside them of taking a long time (5-10 mins) in getting executed, when ideally the execution duration ranges between 5-10 secs. Execution plan of the same displayed Clustered Index Scan happening instead of Clustered Index Seek. All required Indexes are found to be present and Index fragmentation is also minimal as we Rebuild Indexes regularly alongwith Updating Statistics. With no other alternate options occuring to us, we restarted SQL server and thereafter the performance was back on track. But sometimes it was still giving timeout errors for some hits and so we also restarted IIS and that stopped the problem as of now."
Rather than respond directly to the blog reader, I thought it would be more interesting to share my thoughts on this issue in a blog.
There are a few things that I can think of that could cause abnormal timeouts:
Blocking
Bad plan in cache
Outdated statistics
Hardware bottleneck
To determine if blocking is the issue, we can easily run sp_who/sp_who2 or a query directly on sysprocesses (select * from master..sysprocesses where blocking <> 0). If blocking is present and consistent, then you'll need to determine whether or not to kill the parent blocking process. Killing a process will cause the transaction to rollback, so you need to proceed with caution. Killing the parent blocking process is only a temporary solution, so you'll need to do more thorough analysis to figure out why the blocking was present. You should look into missing indexes and perhaps consider changing the database's isolation level to READ_COMMITTED_SNAPSHOT.
The blog reader mentions that the execution plan shows a clustered index scan when a clustered index seek is normal for the stored procedure. A clustered index scan might have been chosen either because that is what is in cache already or because of out of date statistics. The blog reader mentions that bulk imports occur at regular intervals, so outdated statistics is definitely something that could cause this issue. The blog reader may need to update statistics after imports are done if the imports are changing a lot of data (greater than 10%). If the statistics are good, then the query optimizer might have chosen to scan rather than seek in a previous execution because the scan was determined to be less costly due to the value of an input parameter. If this parameter value is rare, then its execution plan in cache is what we call a bad plan. You want the best plan in cache for the most frequent parameter values. If a bad plan is a recurring problem on your system, then you should consider rewriting the stored procedure. You might want to break up the code into multiple stored procedures so that each can have a different execution plan in cache.
To remove a bad plan from cache, you can recompile the stored procedure. An alternative method is to run DBCC FREEPROCACHE which drops the procedure cache. It is better to recompile stored procedures rather than dropping the procedure cache as dropping the procedure cache affects all plans in cache rather than just the ones that were bad, so there will be a temporary performance penalty until the plans are loaded into cache again.
To determine if there is a hardware bottleneck occurring such as slow I/O or high CPU utilization, you will need to run Performance Monitor on the database server. Hopefully you already have a baseline of the server so you know what is normal and what is not. Be on the lookout for I/O requests taking longer than 12 milliseconds and CPU utilization over 90%. The servers that I support typically are under 30% CPU utilization, but your baseline could be higher and be within a normal range.
If restarting the SQL Server service fixes the problem, then the problem was most likely due to blocking or a bad plan in the procedure cache. Rather than restarting the SQL Server service, which causes downtime, the blog reader should instead analyze the above mentioned things. Proceed with caution when restarting the SQL Server service as all transactions that have not completed will be rolled back at startup. This crash recovery process could take longer than normal if there was a long-running transaction running when the service was stopped. Until the crash recovery process is completed on the database, it is unavailable to your applications.
If restarting IIS fixes the problem, then the problem might not have been inside SQL Server. Prior to taking this step, you should do analysis of the above mentioned things.
If you can think of other reasons why the blog reader is facing this issue a few times a week, I'd love to hear your thoughts via a blog comment.