Monitor SQL Server Replication Jobs
- by Yaniv Etrogi
The Replication
infrastructure in SQL Server is implemented using SQL Server Agent to execute the various components involved in the form of a
job (e.g. LogReader agent job, Distribution agent job, Merge agent job)
SQL Server jobs
execute a binary executable file which is basically C++ code. 
You can download all the scripts for this article here
SQL Server Job Schedules
By default
each of job has only one schedule that is set to Start automatically when SQL
Server Agent starts. This schedule ensures that when ever the SQL Server
Agent service is started all the replication components are also put into
action. This is OK and makes sense but there is one problem with this default configuration
that needs improvement  -  if for any reason one of the components fails it remains
down in a stopped state.  
Unless you
monitor the status of each component you will typically get to know about such
a failure from a customer complaint as a result of missing data or data that is
not up to date at the subscriber level. Furthermore, having any of these
components in a stopped state can lead to more severe problems if not corrected
within a short time.
The action
required to improve on this default settings is in fact very simple. Adding a
second schedule that is set as a Daily Reoccurring schedule which runs
every 1 minute does the trick. SQL Server Agent’s scheduler module knows how to
handle overlapping schedules so if the job is already being executed by another
schedule it will not get executed again at the same time. So, in the event of a
failure the failed job remains down for at most 60 seconds.
Many DBAs are
not aware of this capability and so search for more complex solutions such as
having an additional dedicated job running an external code in VBS or another
scripting language that detects replication jobs in a stopped state and starts
them but there is no need to seek such external solutions when what is needed
can be accomplished by T-SQL code.
SQL Server Jobs Status 
In addition
to the 1 minute schedule we also want to ensure that key components in the
replication are enabled so I can search for those components by their Category,
and set their status to enabled in case they are disabled, by executing the
stored procedure MonitorEnableReplicationAgents.
The jobs that
I typically have handled are listed below but you may want to extend this, so
below is the query to return all jobs along with their category.
SELECT
category_id, name FROM
msdb.dbo.syscategories
ORDER BY
category_id;
Distribution
Cleanup
LogReader Agent
Distribution Agent
 Snapshot Agent Jobs
By default
when a publication is created, a snapshot agent job also gets created with a
daily schedule. I see more organizations where the snapshot agent job does not
need to be executed automatically by the SQL Server Agent  scheduler than organizations
who   need a new snapshot generated automatically. To assure this setting is in
place I created the stored procedure MonitorSnapshotAgentsSchedules which
disables snapshot agent jobs and also deletes the job schedule.
It is worth
mentioning that when the publication property immediate_sync is turned
off then the snapshot files are not created when the Snapshot agent is executed
by the job. You control this property when the publication is created with a
parameter called @immediate_sync passed to sp_addpublication
and for an existing publication you can use sp_changepublication.
Implementation
The
scripts assume the existence of a database named PerfDB. 
Steps:
Run the scripts to
create the stored procedures in the PerfDB database.
Create a job that
executes the stored procedures every hour.
-- Verify that the 1_Minute schedule exists.
EXEC PerfDB.dbo.MonitorReplicationAgentsSchedules @CategoryId = 10; /* Distribution */
EXEC PerfDB.dbo.MonitorReplicationAgentsSchedules @CategoryId = 13; /* LogReader     */
-- Verify all replication agents are enabled.
EXEC PerfDB.dbo.MonitorEnableReplicationAgents @CategoryId = 10; /* Distribution */
EXEC PerfDB.dbo.MonitorEnableReplicationAgents @CategoryId = 13; /* LogReader   */
EXEC PerfDB.dbo.MonitorEnableReplicationAgents @CategoryId = 11; /* Distribution clean up   */
-- Verify that Snapshot agents are disabled and have no schedule
EXEC PerfDB.dbo.MonitorSnapshotAgentsSchedules;
Want to read
more of about replication? Check at my replication posts
at my blog.