Warning:ORCAS B2 build bug that can cause your machine to be wiped!

*Moved to: http://fluentbytes.com/warningorcas-b2-build-bug-that-can-cause-your-machine-to-be-wiped/

First of all I want to make very clear that we hit this bug, but the chance you will hit it as well is very small. You can only run into this issue when you have a fake or customized build as we have. Let me explain what we ran into and when that could affect your installation as well.

During our upgrade process last night and today we got confronted with the fact that after we upgraded our server and we deleted an "old" build from the server (using the new delete build feature of ORCAS), the server got wiped completely including the OS.

This happened because of the following: We have a customized build that runs in NAnt. In that build we used the TFS Build object model to report all build information back to the server so our build shows the same integration features as the default Team Build feature.

Hence we use the object model to create a FAKE Build and in that process we need to specify the drop location of the build we have. In our case we already have our scripts copying the build results to our portal, so in our implementation we just passed c: as the drop location. (Something that is not possible with the new object model since you need to specify and UNC path there)

We upgraded the server to B2 and there we discovered that our complete build history is placed under the build definition "Unknown". This makes viewing all builds rather impossible so we decided we would like to delete those builds. The great thing is that in ORCAS there is a new feature where you can delete a build. And this is where things went wrong.

When we delete a build, ORCAS will do an attempt to delete the files at the drop location as well. This is fine if this is pointing to a network location, but in our case that points to c:

Now that would not have gotten us in trouble if it wasn't the case that our TFS Services ran under administrative privileges (Which is discouraged in the installation guide, but it was there since our server is running since the early betas of Whidbey and in the first version we needed admin privileges. We never changed that afterwards) At that moment TFS started to delete the c: location recursive resulting in an almost clean C: hard drive, also wiping the OS completely. After rebooting the server I got the message that NToskernel could not be found anymore and we had to pronounce the server officialy dead 🙁

I must say it took us a lot of time (we reproduced the problem again at 3 Am and then discovered why it was caused) but the support we got from Microsoft was great. We got scripts updating the database within a few hours and we were able to do the upgrade again from the backups of the server we had.

So what have we learned from this?

1) Run the TFS services under a non admin account! As the install manual states, this would have prevented us hitting the problem in the first place. Check this before your upgrade in case you have a server running as long as us, you might have still the old privileges as we did.

2) Always make good backups of your servers when you run a BETA upgrade, you can hit bugs never encountered by others!

3) Always test your upgrade thoroughly. This is why we did not have this problem on our production server but only in the test environment. We where now able to address the problem in test and have a painless migration on the production server.

Microsoft will address this problem in the RTM bits, but for now be very careful when you have a customized build environment with the deletion of the builds. Assure you don't have the drop location pointing to something you don't want deleted and ensure your service accounts run under restricted privileges.

Hoop this helps you keep out of trouble

Cheers,
Marcel

 

 

Follow my new blog on http://fluentbytes.com