5 Why

We encounter issues in our work environments. Most of the time we just find a solution and move on. But some issues keep on repeating and need a deeper understanding. Some issues are significant and have a large impact on the business or on you. Such issues need to be dealt with in different ways. We need to find the root cause for these issues. It is always better to resolve the root cause than the symptoms of the problems.

I often use the 5 whys method to perform RCA. This is simple and effective.




Let us consider a simple example.

Let us take an example from day-to-day life. Today I was driving bit rash and did some risky overtaking. That was really unnecessary. Let us check why I was putting myself in unnecessary danger?


Why was I driving rashly?

Cause: I was late for a meeting,


Why was I late for the meeting?

Cause: I woke up late.


Why did I wake up late?

Cause: I slept late yesterday.


Why did you sleep late despite needing to be on time for the meeting?

Cause: There was a support call yesterday night, it took a while to find a solution.


Now you know how a support call caused driving recklessly. Solution should be informing the meeting organizer you won't be able to attend the meeting. No need for heroism.


Now consider a practical example.

Few days back there was a production issue. We went through the log files and determined a deadlock was causing the issue. We fixed the transactions and released a patch. It is a good practice to do RCA for any issue in production.



Problem: Production server stopped responding for some inventory queries.

Why did the production server stop responding?

Cause: Database connection was timing out,

Evidence: Attached relevant entries from the log file...


Why was the database connection timing out?

Cause: There was a deadlock on a table.

Evidence: Attached entries from DB logs.


Why was there a deadlock?

Cause: New feature added caused this issue.

Evidence: Log file entries attached.


Why do new feature cause this issue?

Cause: Transaction handling was not right.

Evidence: Code changes for the fix. 


Why did testing or code review not catch this issue?

Cause: Code review was rushed to include the important feature.

Evidence: Internal communication.


If you observe this RCA was done after the issue was fixed. Many times RCA is done while finding the solution for the issue. But it is not documented as RCA. We do it intuitively. It is recommended to document the RCA for all major breakdowns. Sometimes it brings out issues which are different from the fixed issue. There might be a better way to resolve the issue. During production issues, it is important to find the quickest solution. After the issue is fixed, take time to identify the root cause and fix it properly.


Some points to note when doing RCA with 5 whys.

In some cases there will be more than one root cause. In such cases you will see two answers to a why. They will diverge when you explore deeper. 

In some cases they will converge into a single cause. Do not be discouraged when you find more than one cause. Explore them.

In cases where you find more than two root causes check the analysis. There might be some issues in the analysis. Such a system will be highly unstable.







Comments

Popular posts from this blog

ChatGPT for cavemen

Greedy computing with API

Event driven architecture for micro services