I don’t think there is a single sysadmin on this planet who hasn’t made a mistake significant enough to cost their employer or customers cash or reputation. So, what do you do when faced with such a situation?
I’ve read some ‘I stuffed up, should I ‘fess up?’ posts on Reddit recently. I and others I’ve worked with are no strangers to this question. Should one come clean with mistakes, and why?
Some years ago I was part of a team migrating some back-end storage to a shiny new SAN. The migration included moving the data of some not-so-shiny (read ‘legacy’) servers to the new SAN. These servers were hosting critical functions and were becoming quite long in the tooth. However the cost of replacing them was turning out to be far greater than the cost of keeping them limping along, so we were stuck with them for the present. It’s a common enough story in the field.
The customer brief was to migrate the legacy servers’ data to the new SAN with limited impact to services – i.e, online. We did our research and planned the migration as carefully as was possible given the age of the environment and the limited vendor support available. We decided to mitigate possible trouble ahead by breaking the work up into manageable stages.
Finally using a carefully reviewed work plan and change management approvals, at a time of low user activity (i.e, around midnight!), the first stage of the migration began. But part-way through, it all went wrong. One server metaphorically kicked another one in the guts, forcing a reset, and the hosted services ground to a halt.
The team was onto service recovery like a flash. But the work that had been planned to be online had now become offline. Critical services were disrupted. It was tempting to pretend that the outage had been caused by something other than the work I’d been doing. The servers had a history of instability and it would be easy to point the finger there.
I chose to be up-front and let my manager and the customer know what had caused the outage. Why? Your reasons might be different, but some of mine were:
1. There was an agreed process that needed to be followed.
The organisation I worked for had clear engagement rules in place, internally and externally, around change and incident management. This included things like what to do if scheduled work went wrong. If I didn’t follow these processes, I would be breaking faith with both my employer and the customer. To me this is the most compelling reason to ‘fess up when we stuff up. However I have other reasons too.
2. Stakeholders shouldn’t be kept in the dark.
My manager and the customer’s critical incident manager needed to know what had really happened so they could do their job. If I didn’t let them know, they wouldn’t have the information they needed. For instance, my manager needed to have full disclosure from me in order to act appropriately if the incident got escalated. Similarly, the customer’s incident manager needed to know what was going on so he could manage escalations at his end.
This might sound like stating the proverbial obvious, but non-stakeholders didn’t need to know. It would have been completely inappropriate for me to share what happened with, say, a competitor of my customer! Likewise, I have withheld identifying details from you as a reader of this post. 🙂
3. ‘Fessing up is a trust-building exercise.
- My manager knew I wasn’t hiding facts to cover my behind. Why? Because I have a track record of making full disclosures of my mistakes, and by doing so again now, I was confirming our relationship of trust. A trusted relationship with one’s boss is of great value!
- The customer’s incident manager didn’t yell down the phone at me or start a witch-hunt afterwards – both of which he could have done. Instead he listened to the facts and did his best to help me manage the incident. Why? Because I’d built up trust with this customer over time, by being honest when I didn’t know something, being willing to help when I could, and being upfront with them when I’d made mistakes before. They knew they would get a straight story from me. My honest communication in this situation helped maintain that trust. The incident manager and I later worked on a report, which fed into a modified work plan for the rest of the migration after we identified that yet another unresolved bug had caused our work plan to go awry.
4. The truth is easier to sustain than a lie!
Always. It’s much less exhausting to tell the truth than keep up a lie. Even in fiction. Agatha Christie’s Hercule Poirot believed that people found it a relief to tell the truth. A lie is too much effort to sustain. The great Sheldon Cooper would agree, spinning his “un-unravelable web” in Season 1 Episode 10 of The Big Bang Theory (‘The Loobenfield Decay’)!
5. Last but not least, it’s about reputation.
I am not really such an unselfish character as this story may make me sound. There is something in it for me. Making short-term choices like this helps my reputation. And that benefits me too in the long run.
Growth happens when we acknowledge our mistakes, don’t make excuses, and instead make a plan to move forward.
Unfortunately in today’s litigation-and-blame culture we are not encouraged to come clean. It can be difficult to straddle the line between transparency and liability. I think we have gone overboard with this, to the point that most tech professionals work in an atmosphere of blame management. I’m pointing the finger at me too here. Blame management isn’t healthy, and it doesn’t produce the kind of relationships that in turn produce dividends in the long run.
What do you think? Please feel free to leave a comment below.
Featured image by Andrea Piacquadio from Pexels.com.
2 thoughts on “What do you do when you stuff up?”
Great post and so very, very true. Make a mistake and put up your hands. Everyone thanks you for it in the end. Don’t make them go through a huge investigation and root cause analysis and burn more hours and money in the process. You’ll really get canned when you get found out.
My worst was mistakenly upgrading the live system and not the sandbox. To be fair, they were on the same host. But it took down an entire set of hospital wards for recording patient observations for 20 minutes. I put my hands up straight away, the customer didn’t blink, just wanted a recovery time. Management didn’t yell and called it a learning exercise and there was no point beating me up about it, I was doing that myself.
LikeLiked by 1 person
Thank you Warlord for sharing your insights and experience! So true: don’t make others burn hours and money investigating something when you are sitting on the knowledge of what happened all along… That’s just a massive bridge-burning exercise!
LikeLiked by 1 person