Bug in WF SQLTracking when using pure XOML workflows

*Moved to: http://fluentbytes.com/bug-in-wf-sqltracking-when-using-pure-xoml-workflows/

Last week we ran in to a nasty problem with windows workflow foundation in our project. We ran one of our load tests for more than a few hours. We noticed that after a few hours all workflows ended up as being aborted for no apparent reason.

Digging further into the problem we decided that we would add some more exception handling into the fault handling of the workflow and logged that to the eventlog. From that we could see that we got an SQL exception from the SQLTracking service stating that there was an error converting a value into a small int. What we found out is that when we cleaned the tracking database all workflows would work again and when keeping the database after we ran into this error we would keep the error hitting us with every subsequent workflow we start.

This marked the tracking database as the suspect of the problems. So we ran a debug session on the faulting workflow with the filled database and we could nicely debug into the faulting steps. There we found that the error was caused by the tracking infra structure and was related to the InsertWorkflow stored procedure.

After some more digging around in the stored procedure and using reflector on the SQLTracking assemblies, we found that if you are using pure XOML based workflows (so not the XOML versions with code behind from visual studio) the tracking infrastructure will insert the workflow InstanceID into the type table of the tracking database. Now that on its own is not a real problem, but the stored procedure appeared to have a flaw in it. The Type table uses an integer value as its primary key and can hold 2^32 items. But the stored procedure assumes the typeID returned to be of type smallint!

This was causing our trouble. Luckily the tracking services database is provide as a script. So I could go and change it myself. I asked on the MSDN forums of someone can confirm this problem but I did not get any response yet, so I decided to post it as a bug into connect (https://connect.microsoft.com/wf/feedback/ViewFeedback.aspx?FeedbackID=275636)

You can download the corrected stored procedure here(https://accblogs.infosupport.com/files/folders/marcelv/entry12099.aspx)  and I also created a repro project (that you can find here https://accblogs.infosupport.com/files/folders/marcelv/entry12100.aspx )to show the bug is there. (And that it is fixed with my changes J)

The limitation of the fix is that we still will hit a problem after we executed 2^32 workflows, but in our case that will be in about 20 Years. I expect that we will have a next version of workflow by that time that fixed this issue already, so we decided this is fine for us now J

Hope this helps you as well fixing the problem if you hit it.

Follow my new blog on http://fluentbytes.com