Restoring MSDB across instances, a bad idea
We've recently been migrating instances of SQL Server between different clusters in order to dismantle an old cluster. To move the jobs we've been copying the MSDB from the old instance to the new one by performing a restore and we didn't run into any problems … until yesterday. A job that used CmdExec instead of a T-SQL jobstep wouldn't run, put the job in suspended state and give the following error:
Unable to start execution of step 1 (reason: The CmdExec subsystem failed to load [see the SQLAGENT.OUT file for details]; The job has been suspended). The step failed.
Checking the agent log provided the following error message:
Step 1 of job '<jobname>' (<job_id>) cannot be run because the CmdExec subsystem failed to load. The job has been suspended
After further examination of the agent log I found the following startup errors:
Subsystem 'ActiveScripting' could not be loaded (reason: The specified module could not be found)
Subsystem 'CmdExec' could not be loaded (reason: The specified module could not be found)
Subsystem 'Snapshot' could not be loaded (reason: The specified module could not be found)
Subsystem 'LogReader' could not be loaded (reason: The specified module could not be found)
Subsystem 'Distribution' could not be loaded (reason: The specified module could not be found)
Subsystem 'Merge' could not be loaded (reason: The specified module could not be found)
Subsystem 'QueueReader' could not be loaded (reason: The specified module could not be found)
Subsystem 'ANALYSISQUERY' could not be loaded (reason: The specified module could not be found)
Subsystem 'ANALYSISCOMMAND' could not be loaded (reason: The specified module could not be found)
Subsystem 'SSIS' could not be loaded (reason: The specified module could not be found)
As it turns out, SQL agent stores a path name to the DLL files required to perform the non T-SQL jobsteps in the MSDB. This is documented here: Microsoft KB 903205
Moving (reinstalling) the server caused the directory of these DLL's to change (in this case from MSSQL.6 to MSSQL.12) and the agent was unable to load them. As we rarely use any non T-SQL jobsteps we hadn't run into this problem yet.
The solution was performing an update to the msdb.dbo.syssubsystems table updating the values of subsystem_dll column to the current situation and restarting the agent.
EDIT: It appears only SQL 2005 is affected, SQL 2000 doesn't store these references in the MSDB. I've compiled a script to fix wrong references:
SET NOCOUNT ON
DECLARE @SQLRoot NVARCHAR(512)
IF @@VERSION LIKE ('% 9.%')
BEGIN
EXEC master.dbo.xp_instance_regread
N'HKEY_LOCAL_MACHINE'
,N'SOFTWAREMicrosoftMSSQLServerSetup'
,N'SQLPath', @SQLRoot OUTPUT
UPDATE msdb.dbo.syssubsystems
SET subsystem_dll = @SQLRoot+RIGHT(subsystem_dll,LEN(subsystem_dll)-CHARINDEX('MSSQLbinn',subsystem_dll)-5)
WHERE
subsystem_dll NOT LIKE @SQLRoot + '%'
AND subsystem_dll <> '[Internal]'
IF @@ROWCOUNT > 0
BEGIN
PRINT 'Bad references found on '+@@SERVERNAME
PRINT 'Server root: '+@SQLRoot
END
END