Tuesday, August 29, 2017

Monitor and stop long-running SQL Server Agent Jobs

In recent days I've been having a periodic issue with the distribution cleanup job for replication. The vast majority of the time it runs in a few seconds.  There are times at night when it may take a few minutes which is not an issue.  But recently I've had a couple cases where the job ran for 13 or 14 hours and that subsequently led to prolonged blocking of other replication activities such as the distribution agent SPIDs, etc.  That in turn led to all sorts of other issues that I won't detail here.

To address the situation, I would essentially have to stop the SQL Server Agent job called 'Distribution clean up: distribution'.  Oddly enough, subsequent runs would be right back on track.

To mitigate the issue as quickly as possible (perhaps there are other solutions out there) I decided to put together a simple monitoring job to check how long the cleanup job has been running.  If it has been running for more than 10 minutes, stop the job.

Here is the code I used.  It has to run in msdb and I scheduled it to run every 10 seconds.  You could use this syntax to monitor any jobs - perhaps an index maintenance job is running abnormally long on the weekend and bloating TempDB or something - but you wouldn't know right away.  This will at least stop the activity until there is time to further investigate.

The syntax:

DECLARE @Duration_secs INT;

SELECT @Duration_secs = DATEDIFF(ss, ja.start_execution_date, GETDATE())
FROM   msdb.dbo.sysjobs jobs
       LEFT JOIN msdb.dbo.sysjobactivity ja ON ja.job_id = jobs.job_id
            AND ja.start_execution_date IS NOT NULL
WHERE  jobs.name = 'Distribution clean up: distribution'
AND stop_execution_date IS NULL;

IF @Duration_secs > 600 -- 10 minutes
dbo.sp_stop_job N'Distribution clean up: distribution';