Tuesday, April 11, 2006

Crappy Release - events to the rescue

We had a monthly release of an auto insurance system I spend a lot of my time working on this past weekend.

It didn't go very well. The week leading up to the official code freeze uncovered some critical defects. Then we found some post code freeze. The team pulled it together and got the system in good enough shape to release.

The release was deployed to the production environment without incident on Saturday. I multitasked showing my aunt and uncle who were in town from Sun Valley, ID Multnomah Falls and occassionally glanced at the CrackBerry. We got through Monday with only one problem and it was "just" the monitoring software. So we'll fly blind a few days not knowing what Mercury Biz Activity Monitor says our performance is in Atlanta vs. California etc. Big deal right? My boss says we missed a Choice Point outage so actually it is somewhat of a big deal, but anyway ...

But today, the wheels fell off. 2 major problems were discovered.

This is going to be a total pain the rest of the week to resolve, but, the good thing is, the users of the system have no idea this is going on. The only reason they don't is because the system uses an event driven architecture. Yeah ok, they would have had the same result if the features in question were async, but we get bailed out by this a lot because most of what we do is async - event driven.

Typically, in this system, recoverable errors are rolled back to the ESB/messaging layer where they are paused and then resent. Non-recoverable errors are simply routed to an appropriate error queue. We have an application that manages the error queues. Pretty standard stuff - browse all queues, browse specific queue, browse message, delete, re-send, edit-resend. This type of error handling gets it done quite nicely for 99.99% of our errors. But today the critical defect was causing threads to hang and events to hang up in 2 of the services. Hung threads don't rollback and don't route the poisoned message to an error queue.

So .. what to do besides panic?

Luckily, as I stated above, events and more specifically guaranteed delivery & durable subscriptions came to the rescue anyway. From examining the thread dumps of one of the services, it looked like the service just woke up on the wrong side of the bed. So we restarted it. Whatever pissed it off before had passed ... it happily drained its durable subscription - as its threads had been hung, it had not ackd any of its events - they all were automatically resent to the service. Now what if this wasn't event driven? What if it used web services (i.e., JAX-RPC or JAX-WS) ... you would be hosed ... yes that is exactly right. Or, your poor developers would have to account for this stuff in each of your service impls. Good luck getting that right.

<sarcasm text="Remember, web services make interop / integration easier. You don't need all the complexity of any integration products ... just roll it all your self. Interop is easy. You just need web services. The tool support is great."/>

Sadly, we did not get so lucky on our second problem. This one is going to leave a mark for some people for most of the week I imagine. Looks like we have to drain a durable subscription and purge out some poisoned messages and then redrive the good ones. And likely some other misery ... but without an event driven architecture and more specifically the goodness that guaranteed delivery and durable subscriptions provide, we'd be totally f'd.

No comments: