Wednesday, May 7, 2014

SQL Server Wait Stats Harvesting and Reporting

One the the keys - if not THE key - to identifying SQL Server performance issues is by investigating wait statistics.  As is the case with various SQL Server performance metrics, however, is the fact that these stats are cumulative.  That means you can't just query sys.dm_os_wait_stats and determine the root cause of an issue at a particular point in time unless your issue is happening right now.  If that were the case, you run one of the various scripts to calculate the differential of multiple sys.dm_os_wait_stats executions (example from Brent Ozar).  But what if your issue occurred off-hours, during the night, while you were traveling or in a meeting?

While I realize there are various ways to go about analyzing wait stats, I decided to put together a solution that was very easy to manage, deploy and write custom reports from.  You will need to configure a Central Management Server (CMS) and register your target servers - the servers you want to deploy this solution to.

Once you have your CMS configured, you will create a SQL Server Agent job ("Collect and Cycle Wait Stats", for example).  The job is going to have four steps:
  1. Check for DBADB database, which is the database I keep all DBA related scripts and data in
    • If not present, create it
    • Get script here
  2. Check for table that will house the wait stats
    • If not present, create it
    • Get script here
  3. Collect wait stats
    • I personally harvest the wait stats hourly but you may adjust as you see fit
    • This is a pretty typical query you will find in most blogs
    • Get script here
  4. Clear wait stats
    • By clearing the wait stats, I know each hourly harvest are the wait stats just for that hour and I don't have to do any differential calculations
    • If I want to know wait stats to a point in time, I can add them up accordingly
    • Get script here
  5. Report on it!
    • Preview of report below
    • Get the RDL here
By checking for and creating the database and table if they aren't present, you can deploy this by going to your CMS checking a box to target another instance.

Below is a preview of a report I just put together.  I'll likely update it in the coming days/weeks to better highlight anomalies, but this works for now.  You choose a date and the top 95% of waits are summarized by hour, which you can further drill into by clicking the plus sign by the hour on the left.  I added a few helpful reminders/notes from Paul Randal and Joe Sack's blogs.  As you can see, I had a CPU issue on this particular day at or around 10am as indicated by the elevated/irregular SOS_SCHEDULER_YIELD signal waits.