Thursday, March 20, 2014

Harvest and Report on SQL Server IO

I'm a big advocate of automation, and as a result I often create processes and reports to help with such automation.  My most recent task was to create a process to properly report on IO stall/wait stats.  A recent post of mine called IO Stats - Putting Lipstick on the Pig was a report I created to lay on top of statistics that derive from querying sys.fn_virtualfilestats.  At the end of the day, it's still a pig.  While various DBA's have shared queries that do computations on results from that function, it's very misleading because the data is cumulative.  That means the IO wait stats include all your maintenance activity: backups, index rebuild/reorgs, DBCC activity, etc.  Including all that activity can and will dramatically skew the data.  In fact the results I get for IO waits on one of my TempDB files is >200ms currently when in reality I rarely see values exceed 5ms on a per-hour basis.

I would like to share a better way of harvesting and reporting on your IO data.  Now, I personally have no problem putting these objects in the master database - adjust to suit your needs accordingly.  I will admit that a portion of this process was found online - if/when I find that author, I will credit them accordingly.  The summarizing and reporting portion is what I put together.

  1. First, you need to create the tables (all links go to my public GDrive folder)
  2. Next, create the stored procedure that will harvest the the snapshots
    • You will note that this procedure uses the compare process to store summarized data for reporting purposes
  3. Then, create the stored procedure that will compare the snapshots
  4. Lastly, create a scheduled job.  My reports are based on hourly snapshots, so if you change the frequency you may have to adjust some things.
    • exec master.dbo.sp_IOStatsSnapshot
That's it! Your data will start to collect as soon as your job schedule kicks in.  The data won't really mean much until a couple days go by so you'll have plenty of time to get the reports in place.

The first report is an IO summary report (Figure 1) that computes differentials based on the times you select.  So with this report you can get the IO wait stat summary for core hours, for example.  As is the case with all RDL I share, you need to change the data source information.

The second report is an IO analysis report (Figure 2) that shows the hourly data, highlighting metrics that are outside certain ranges.  I go with the standard recommendation that a read IO should take <20ms while a write should take <4ms.  The other values I chose for warning and critical were personal preferences and can be adjusted accordingly.

*Recent updates to the second report (Figure 2):
  • You can now type in a list of instances and the file listing will automatically pull from that instance via a dynamic data source - this keeps from having one report per instance
  • I changed the file listing to come from sys.master_files so it's always got just the current files and not those that may have moved drives and/or instances

Figure 1

Figure 2