One morning, you wake up and feel funny. You can’t quite put your finger on it, but something isn’t quite right. What now? Unless you happen to be a hypochondriac, you likely drag yourself out of bed, get on with the day and gather more “evidence”. You check your symptoms over the next few days; do you feel the same, better, worse? If better, then great, it was some temporal issue, perhaps caused by an allergic reaction to some suspiciously spicy chicken. If the same or worse then you go to the doctor for some health advice, but armed with some data to share, and having ruled out certain possible causes that are fixed with a bit of rest and perhaps an antacid. Whether you realize it or not, in comparing how you feel one day to the next, you have taken baseline measurements. In much the same way, a DBA uses baselines to gauge the gauge health of their database servers. Of course, while SQL Server is very willing to share data regarding its health and activities, it has almost no idea of the difference between good and bad.
Over time, experienced DBAs develop “mental” baselines with which they can gauge the health of their servers almost as easily as their own body. They accumulate knowledge of the daily, natural state of each part of their database system, and so know instinctively when one of their databases “feels funny”. Equally, they know when an “issue” is just a passing tremor. They see their SQL Server with all of its four CPU cores running close 100% and don’t panic anymore. Why? It’s 5PM and every day the same thing occurs when the end-of-day reports, which are very CPU intensive, are running. Equally, they know when they need to respond in earnest when it is the first time they have heard about an issue, even if it has been happening every day.
Nevertheless, no DBA can retain mental baselines for every characteristic of their systems, so we need to collect physical baselines too. In my experience, surprisingly few DBAs do this very well. Part of the problem is that SQL Server provides a lot of instrumentation. If you look, you will find an almost overwhelming amount of data regarding user activity on your SQL Server instances, and use and abuse of the available CPU, I/O and memory. It seems like a huge task even to work out which data you need to collect, let alone start collecting it on a regular basis, managing its storage over time, and performing detailed comparative analysis.
However, without baselines, though, it is very difficult to pinpoint what ails a server, just by looking at a single snapshot of the data, or to spot retrospectively what caused the problem by examining aggregated data for the server, collected over many months.
It isn’t as hard as you think to get started. You’ve probably already established some troubleshooting queries of the type SELECT Value FROM SomeSystemTableOrView. Capturing a set of baseline values for such a query can be as easy as changing it as follows:
INSERT into BaseLine.SomeSystemTable (value, captureTime)
SELECT Value, SYSDATETIME() FROM SomeSystemTableOrView;
Of course, there are monitoring tools that will collect and manage this baseline data for you, automatically, and allow you to perform comparison of metrics over different periods. However, to get yourself started and to prove to yourself (or perhaps the person who writes the checks for tools) the value of baselines, stick something similar to the above query into an agent job, running every hour or so, and you are on your way with no excuses! Then, the next time you investigate a slow server, and see x open transactions, y users logged in, and z rows added per hour in the Orders table, compare to your baselines and see immediately what, if anything, has changed!