At HUMAN, we keep a SaaS B2B service that handles high throughput of requests, at a minimal RTT, and very strict SLA. We pride ourselves on maintaining a very strict SLA to ensure every request is analyzed for potential risk, leaving no loopholes for attackers to abuse.

How do we keep our system at five nines?

Quite simply, by incorporating a learning system, the debrief. Utilizing this approach helped us design and improve our software architecture and processes.

We debrief A LOT on any internal item we consider as an SLO breach (usually well before the actual SLA is breached or an outage affecting our customers). We have been doing it so often that it has become a practice where teams do it automatically after every incident and improve our system.

We even got an added benefit from the process: Having the learning documented so new employees can learn how and why we started doing things in a certain way.

Now, debriefs can be a source of great value, but they can also wreak havoc. The main ingredients that will get you the more desired outcomes are communication and trust.

In our case, the goal is simple – ask a few questions, usually the same ones, and see if we can do better next time.

In the words of W. Edward Deming:

“A bad system will beat a good person every time.”

It doesn’t matter who did it, since at the essence of things if one person broke the system, tomorrow another might as well. The fact we didn’t have the right measures, controls, or protections in place is what caused an incident to happen.

What do we ask:

What happened? – timeline
What did we do?
How could we have identified the issue in under 5 minutes? 30 seconds?
How could we have fixed it in under 10 minutes? 1 minute? automatically?
What do we need to do so we will be able to answer yes to the above questions?

What do we avoid:

Blame
Focus on why someone did something

Keeping to these questions, every time for multiple debriefs instilled a healthy culture of positive discussions and minimal backfires.

The key here is to be consistent in how you approach it. And like every process change management, start every debrief with the purpose and guidance on how the meeting will be conducted. Sticking to a rigorous, repeatable, and simple process creates confidence in the people going through it, as well as the people who consume the results (i.e. higher mgmt. tiers and customers).

Keeping a consistent approach isn’t easy. Sometimes you debrief a severe incident that had a significant impact on your system and customer, or an event caused by pure recklessness. The thing to keep in mind here is the goal – today it happened for a particular reason, tomorrow it can happen for another – how do we improve?

Over time, I found myself not needed in the debriefs and only reading the summaries. After the first few I did spot checks with team members on how was the tone, what questions were raised, and the general conduct within the debriefs. This was mostly done to validate that the guidance stuck. I was happy to see the team kept the same mindset.

Several months in, the process has been instilled. It is now part of our RnD culture (a topic to discuss on its own).

Our Template:

Overview

Short description of what happened

Timeline

Dates, and hours if relevant, of what happened.

Impact

What was the incident impact overall
Root cause
Make sure we understand what caused it.
Sometimes this is complicated, so we set an AI to keep investigating.

Action Items

Every AI has a prioritized ticket assigned to someone so we can review and validate that suggested improvements are being incorporated into the system, code, and processes.

What is your experience with such a process? Do you have a different approach?

Spread the Word

PREVIOUS POST Next Post

How to Create a Learning Culture using Debriefs

Our Template:

Overview

Timeline

Impact

Action Items

Spread the Word

Platform

Advertising Protection Solutions

Application Protection Use Cases

Industries

Company

Learn

Features

Partners

Contact Us