Recovering workflows that did not complete work

*Moved to: http://fluentbytes.com/recovering-workflows-that-did-not-complete-work/

Last week I have been working on our workflow solution where someone discovered that in some cases it appeared that certain workflows did not complete and would never wake up again.

It took me a while to figure out why this was the case. We host our workflows in IIS and there we use the ManualWorkflowSchedulerService to schedule the workflow. What we do is have a WCF call coming in, persist some data in the database and return an ID that the customer can use for future reference. Then we use a threadpool (a custom implementation, since we needed to prioritize the initial requests from the background work done later) to schedule the remaining work that needs to complete in the background. We use the SqlWorkflowPersistenceService to make sure the workflows can be recovered if the IS worker process is recycled.

We did a lot of testing previously(as you might have read before), and there we have seen the workflows did get recovered.

So what happened…?

<edit>
During performance testing we found that the configuration section for workflow contained the a wrong name for the workflow configuration. This caused a second workflow runtime to be started with default settings. The default settings use the default scheduler and that was causing problems in terms of a fight for threads in the host process. (Lots of context switching) this made us decide to do two things. First remove the wrong configuration, so we would only have one workflow runtime and secondly write a custom threadpool implementation where we could throttle the number of requests we want to process.
</edit>

What we did not realize when we removed the erroneous entry, that this also killed our auto recovery of workflows. The second workflow runtime got loaded with the same services as the main workflow runtime, except that it used the default scheduler. As you might recall this scheduler uses a thread pool to schedule its workflows and that was where the workflow recovery took place.

In our main workflowruntime we use the ManualWorkflowSchedulerService and the SQLWorkflowPersistenceService, but this does not take care of the actual recovery of the workflows.

It appears (after some reflection workJ) that the SQLWorklflowPersistenceService does do recovery of instances, but it assumes these will be automatically scheduled. Unfortunately this is not documented, but when you dig around in the implementation (using reflector of course) you can see that the only thing the Recovery does is get all running instances from the database and call a WorkflowRuntime.GetWorkflow(ID).Load().

This means it will only load the workflow into the workflow runtime memory, but not schedule the workflow to actually proceed with its work!

As a matter a fact the whole implementation for recovery does not take in account e.g. the fact that we have multiple nodes running in a load balancing scenario where you perhaps want to recover the workflows in a balanced fashion as well. So to get around this problem I wrote a custom scheduler that does nothing more than deriving from the manual scheduler, except that it also takes care of the recovery of workflows. Since this might be a problem you will run into yourself, I have pasted the code below for you to reuse J

The only thing you need to do is copy the code compile it and then configure your workflow runtime to use this scheduler service.

It will just bahave the same as the manual scheduler with the only difference that it will recover your workflows when needed on the threadpool in the background. You can trim the number of workflows that are recovered as one batch, you can configure the wait time after startup, to start the recovery (normaly you want to start after a minute or so, since the first request needs to be serviced first before you want to start do recovery) and the poll time is configurable as well.

Hope you find it usefull.

Cheers,

Marcel

[Serializable]

publicclassRecoveringWorkflowSchedulerService : ManualWorkflowSchedulerService , IDisposable

{

privateTimer _recoveryPollTimer;

privateint _recoveryDuePeriod;

privateint _recoveryPollPeriod;

privateint _recoveryBatchSize;

privatebool _disposed = false;

// default to one minute, so we don’t frustrate the initial request that activates the service

privatestaticreadonlyint DEFAULTRECOVERYDUEPERIOD = 60000;

// default to every 5 minutes to check if there are workflows to recover

privatestaticreadonlyint DEFAULTRECOVERYPOLLPERIOD = 300000;

// default to recovery batch size of 5

privatestaticreadonlyint DEFAULTRECOVERYBATCHSIZE = 5;

public RecoveringWorkflowSchedulerService(bool useActiveTimers) : base(useActiveTimers)

{

_recoveryPollPeriod = DEFAULTRECOVERYPOLLPERIOD;

_recoveryDuePeriod = DEFAULTRECOVERYDUEPERIOD;

_recoveryBatchSize = DEFAULTRECOVERYBATCHSIZE;

}

public RecoveringWorkflowSchedulerService(bool useActiveTimers, int recoveryDuePeriod,

int recoveryPollPeriod, int recoveryBatchSize):base(useActiveTimers)

{

_recoveryPollPeriod = recoveryPollPeriod;

_recoveryDuePeriod = recoveryDuePeriod;

_recoveryBatchSize = recoveryBatchSize;

}

public RecoveringWorkflowSchedulerService(NameValueCollection parameters):base(parameters)

{

string recoveryDuePeriodString = parameters[“RecoveryDuePeriod”];

string recoveryPollPeriod = parameters[“RecoveryPollPeriod”];

string recoveryBatchSize = parameters[“RecoveryBatchSize”];

if (!int.TryParse(recoveryDuePeriodString, out _recoveryDuePeriod))

{

_recoveryDuePeriod = DEFAULTRECOVERYDUEPERIOD;

}

if (!int.TryParse(recoveryPollPeriod, out _recoveryPollPeriod))

{

_recoveryPollPeriod = DEFAULTRECOVERYPOLLPERIOD;

}

if (!int.TryParse(recoveryBatchSize, out _recoveryBatchSize))

{

_recoveryBatchSize = DEFAULTRECOVERYBATCHSIZE;

}

}

protectedoverridevoid OnStarted()

{

base.OnStarted();

// initialize a recovery timer, where we pickup anny instances that got terminated by

// system failure

// like power outage, IISReset commands or other process recycles.

_recoveryPollTimer = newTimer(RecoveryThreadCallback, this.Runtime,

_recoveryDuePeriod, _recoveryPollPeriod);

}

protectedoverridevoid OnStopped()

{

base.OnStopped();

// cleanup the polling timer

if (_recoveryPollTimer != null)

{

_recoveryPollTimer.Dispose();

}

}

///<summary>

void RecoveryThreadCallback(object stateInfo)

{

if (!_disposed)

{

try

{

RecoveringWorkflowSchedulerService schedulerService =

this.Runtime.GetService<RecoveringWorkflowSchedulerService>();

if (schedulerService != null)

{

SqlWorkflowPersistenceService persistenceService =

this.Runtime.GetService<SqlWorkflowPersistenceService>();

// there is an persistence service so get workflows that need recovery

if (persistenceService != null)

{

if (Runtime.IsStarted)

{

var allPersistedWorkflows = persistenceService.GetAllWorkflows();

var allRecoverableWorkflows =

(from persistedWorkflow in allPersistedWorkflows

where !persistedWorkflow.IsBlocked &&

persistedWorkflow.Status == WorkflowStatus.Running

select

persistedWorkflow.WorkflowInstanceId).Take<Guid>(_recoveryBatchSize);

foreach (Guid workflowInstanceID in allRecoverableWorkflows)

{

// process the recovery work on the threadpool thread;

ThreadPool.QueueUserWorkItem(WorkflowWaitCallback,

workflowInstanceID);

}

}

}

}

}

catch (System.ObjectDisposedException)

{

// appdomain is beeing teared down, gracefully swallow exception
// and exit by clearing the timer

_recoveryPollTimer.Dispose();

}

}

}

///<summary>

/// This is where the execution continues after we queue a recovery workflow

/// on the threadpool.
///</summary>

///<param name=”stateInfo”></param>

void WorkflowWaitCallback(object stateInfo)

{

// here we get the threadpool thread donated, now use it to schedule the requested

// workflow

this.RunWorkflow((Guid)stateInfo);

}

#region IDisposable Members

publicvoid Dispose()

{

_disposed = true;

if (_recoveryPollTimer != null)

{

_recoveryPollTimer.Dispose();

}

}

#endregion

}

Follow my new blog on http://fluentbytes.com