State machine workflows

At the beginning of last week a report came in from our users which said that they were experiencing timeout issues while working with the application that has been running in production quite stable since the beginning of this year. Because this was a blocking issue we quickly started investigating what went wrong. After looking at the code for about half an hour we decided we needed to put on tracing to find out what really happens. So we did that and we found out that the events we were sending to our workflows weren’t delivered. We couldn’t quickly find out what the problem was, so we decided to remove the running workflow from the persistence store, which would cause the application to create a new workflow and force it to the correct state. It wasn’t the most ideal solution, but at least it meant that the users could continue working with the application. We did however request a backup of the database so we could debug the problem a bit more.

So after I had restored the backup from the production environment on my local development machine and got everything set up I found out that the reason the events weren’t delivered to the workflow was because the scheduler was busy executing another activity. It took quite some debugging and reflectoring to find out that when a message is written to the workflow queue, the event doesn’t get fired until the SchedulerBusy flag on the StateMachineExecutionState internal class is set to false. But I still couldn’t figure out why this specific flag was set to True, instead of False.

A colleague of mine suggested to use the Workflow Monitor, which is one of the applications that comes with the Windows Workflow Foundation samples. It queries on a SQL tracking database and displays the current state of all the running workflows. Quite a nifty tool if you ask me. So I started it up and had a look at the workflows that were working and the workflows that were broken. Although it wasn’t obvious immediately, I found out that the workflows that were broken were currently executing a ConfigurableDelayActivity.

This was an activity that was written by my colleague Robert and it was supposed to solve two problems. The first one was that there was a requirement to be able to configure various delays that were used by the workflow. The second problem it tried to solve was a bug that we found in Windows Workflow Foundation. You can find the full details of this bug in the linked post, but in short it meant that if you were using the ManualWorkflowScheduler (which is recommended when hosting in IIS) and you want a timer that expires more than 47 days in the future, you get an exception. So with this activity we thought we had a workaround.

Unfortunately we found out that our work around wasn’t working correctly. What was happening is that when the ConfigurableDelayActivity becomes the next activity to execute, it would create a timer queue for itself and set it to some time in the future. When that time in the future was more than the earlier mentioned 47 days away, it would make a timer for 47 days. Then, when the timer went off, the activity started executing, so the Execute method was called. There we determined whether the actual delay has expired. If it hadn’t expired yet, we would wait some more by repeating the process. After that, the activity would return ActivityExecutionState.Executing. This is were the problems start.

Because the ConfigurableDelayActivity implements the IEventActivityListener interface, the activity itself isn’t really subscribed to the event that gets raised when a message is put in the queue. Rather, the activity hosting the ConfigurableDelayActivity, which would be an EventDrivenActivity gets sent the message, which will in turn call the Execute method on the activity implementing IEventActivityListener. The problem we were having is that once the 47 days timer went off, the activity gets executed and then says it is still executing, which resulted in the scheduler being busy. This blocks any other EventDrivenActivity in the same state from receiving a message on a queue, because the SchedulerBusy flag is set.

Now, this might all be the way it is supposed to work. It’s too bad thought that there isn’t any documentation on it, or at least we haven’t been able to find any. In fact, I had a look at what the out of the box DelayActivity does, and it explicitly checks whether it is running inside an EventDrivenActivity, and if it is, it will always return ActivityExecutionState.Closed. But as this is all internal stuff, how are we supposed to know? So that’s why I thought I’d write a blog about it, because I’m sure somebody else will run into these problems.