Tuesday, May 13, 2014

SQL Server Spinlock Reference

I've been troubleshooting some performance issues of late, and it was brought to my attention that there is yet another DMV that may shed some light on the subject: sys.dm_os_spinlock_stats.

First, what is a spinlock?
Per Microsoft's whitepaper Diagnosing and Resolving Spinlock Contention on SQL Server: "Spinlocks are lightweight synchronization primitives which are used to protect access to data structures.  They are generally used when it is expected that access to a given data structure will need to be held for a very short period of time."

How does a spinlock differ than a latch?
Per the whitepaper: "The main difference between a latch and a spinlock is the fact that spinlocks will spin (execute a loop) for a period of time checking for availability of a data structure while a thread attempting to acquire access to a structure protected by a latch will immediately yield if the resource is not available. Yielding requires context switching of a thread off the CPU so that another thread can execute. This is a relatively expensive operation and for resources that are held for a very short duration it is more efficient overall to allow a thread to execute in a loop periodically checking for availability of the resource."

What condition might indicate spinlock contention?
Per the whitepaper:
1. A high number of spins and backoffs are observed for a particular spinlock type.

AND

2. The system is experiencing heavy CPU utilization or spikes in CPU consumption. In heavy CPU scenarios one may also observe high signal waits on SOS_SCHEDULER_YEILD (reported by the DMV sys.dm_os_wait_stats).

AND

3. The system is experiencing very high concurrency.

AND

4. The CPU usage and spins are increased disproportionate to throughput.
Options and workarounds for handling spinlock contention:
Per the whitepaper:
  1. Fully Qualified Names: Fully qualifying names of all objects will result in removing the need for SQL Server to execute code paths that are required to resolve names. We have observed contention points also on the SOS_CACHESTORE spinlock type encountered when not utilizing fully qualified names in calls to stored procedures. Failure to fully qualify these the names results in the need for SQL Server to lookup the default schema for the user which results in a longer code path required to execute the SQL.
  2. Parameterized Queries: Another example is utilizing parameterized queries and stored procedure calls to reduce the work needed to generate execution plans. This again results in a shorter code path for execution.
  3. LOCK_HASH Contention: Contention on certain lock structure or hash bucket collisions is unavoidable in some cases. Even though the SQL Server engine partitions the majority of lock structures, there are still times when acquiring a lock results in access the same hash bucket. For example, an application the accesses the same row by many threads concurrently (i.e. reference data). This type of problems can be approached by techniques which either scale out this reference data within the database schema or leverage NOLOCK hints when possible.
The diagram below provides a conceptual view of the spinlock algorithm:


Here are some additional resources I found on the topic:

Paul Randal
SOS_SCHEDULER_YIELD waits and the LOCK_HASH spinlock, April 2011
Advanced SQL Server performance tuning, April 2011

Thomas Kejser
Thread Synchronization In SQL Server, November 2011
Diagnosing And Fixing SOS_OBJECT_STORE Spins for Singleton INSERTS, May 2011
Secret Scale Sauce: Heavy Singleton INSERT Spinlocks, May 2011

Arvind Shyamsundar
Exotic spinlocks: XTS_MGR, January 2014

Fabricio Catae
Spinlock (Part I), July 2010
Spinlock (Part II), July 2010
Spinlock (Part III), July 2010
Spinlock Contention, December 2009

Wednesday, May 7, 2014

SQL Server Wait Stats Harvesting and Reporting

One the the keys - if not THE key - to identifying SQL Server performance issues is by investigating wait statistics.  As is the case with various SQL Server performance metrics, however, is the fact that these stats are cumulative.  That means you can't just query sys.dm_os_wait_stats and determine the root cause of an issue at a particular point in time unless your issue is happening right now.  If that were the case, you run one of the various scripts to calculate the differential of multiple sys.dm_os_wait_stats executions (example from Brent Ozar).  But what if your issue occurred off-hours, during the night, while you were traveling or in a meeting?

While I realize there are various ways to go about analyzing wait stats, I decided to put together a solution that was very easy to manage, deploy and write custom reports from.  You will need to configure a Central Management Server (CMS) and register your target servers - the servers you want to deploy this solution to.

Once you have your CMS configured, you will create a SQL Server Agent job ("Collect and Cycle Wait Stats", for example).  The job is going to have four steps:
  1. Check for DBADB database, which is the database I keep all DBA related scripts and data in
    • If not present, create it
    • Get script here
  2. Check for table that will house the wait stats
    • If not present, create it
    • Get script here
  3. Collect wait stats
    • I personally harvest the wait stats hourly but you may adjust as you see fit
    • This is a pretty typical query you will find in most blogs
    • Get script here
  4. Clear wait stats
    • By clearing the wait stats, I know each hourly harvest are the wait stats just for that hour and I don't have to do any differential calculations
    • If I want to know wait stats to a point in time, I can add them up accordingly
    • Get script here
  5. Report on it!
    • Preview of report below
    • Get the RDL here
By checking for and creating the database and table if they aren't present, you can deploy this by going to your CMS checking a box to target another instance.

Below is a preview of a report I just put together.  I'll likely update it in the coming days/weeks to better highlight anomalies, but this works for now.  You choose a date and the top 95% of waits are summarized by hour, which you can further drill into by clicking the plus sign by the hour on the left.  I added a few helpful reminders/notes from Paul Randal and Joe Sack's blogs.  As you can see, I had a CPU issue on this particular day at or around 10am as indicated by the elevated/irregular SOS_SCHEDULER_YIELD signal waits.


Useful SQL Server Articles

I used to have these on My DBA Toolbox post, but it was getting a little full so I'm pulling out my list of favorite articles and sharing them in their own post.  Enjoy!

Paul Randal (b | t)
Misconceptions around Trace Flag 1118
Turn off backup success messages, Trace Flag 3226
Wait statistics, or please tell me where it hurts
Script - open transactions with text and plans

Kimberly Tripp (b | t)
Transaction Log VLFs - too many or too few?
Ever-increasing clustering key - the Clustered Index Debate.... again!

Brent Ozar (b | t)
Triage Wait Stats in SQL Server
What is the CXPACKET Wait Type, and How Do You Reduce It?
SQLIO Tutorial: How to Test Disk Performance
SQL Server 2005/2008/2012/2014 Setup Checklist

Joe Sack (b | t)
Performance Tuning with Wait Stats

Thomas LaRock (b | t)
Are you Using the Right SQL Server Performance Metrics

Kalen Delaney (b | t)
Geek City: Reducing Cache Bloat and a Metadata Bug

Aaron Bertrand (b | t)
SQL Server DDL Triggers to Track All Database Changes
Useful, free resources for SQL Server
Bad habits : Putting NOLOCK everywhere

Robert Davis (b | t)
Schema-only Backups and Restores

Paul White (b | t)
Parameter Sniffing, Embedding, and the RECOMPILE Options
Temporary Tables in Stored Procedures
Temporary Table Caching Explained

Allan Hirt (b | t)
How to Properly Configure DTC for Clustered Instances of SQL Server

Jonathan Kehayias (b | t)
Great SQL Server Debates: Lock Pages in Memory (LPiM)
Extended Events - Targets Week - ring_buffer New!

Other