What is Root Cause Analysis?
Root cause analysis is a systematic method of understanding the underlying cause of a problem.
Correcting only the immediate cause may eliminate a symptom of a problem, but not the problem itself, making it possible to manifest itself in other parts of the system.
Root cause analysis prevents recurring issues, thereby saving time and money for the business and giving the end-user a much better experience.
A Real-world Example
While driving, Mike noticed that it took more braking effort than usual to stop the car.
Mike took the car to a mechanic. The mechanic reported that the brake pads were worn out and needed to be replaced.
The mechanic removed the brake pads and noticed that the coolant oil was unusually black. He concluded that the spoilt coolant was the problem and that it needed to be replaced. However, he decided to have a closer look. The mechanic inspected the coolant tank thoroughly and noticed that the oil filter was broken.
The mechanic arrived at the root of the problem. He explained that the oil was getting contaminated due to the broken filter. This debased oil led to the brakes not cooling adequately, causing them to generate excessive heat, which resulted in them getting worn out quicker than expected.
Without this root cause analysis, another mechanic would have just replaced the brake pads. Mike, frustrated, would soon return with worn-out brake pads and, in all probability, the mechanic would lose Mike as a customer.
Just replacing the brakes was not enough; the mechanic had to look beyond it to arrive at the root cause.
Root cause analysis saved Mike time and money; the mechanic did not lose a customer, and everybody benefitted in the end.
The Process of Root Cause Analysis
Define the Problem
Problems can be solved only when understood. The first step is to describe the issue clearly. This clarity makes it easier to gather data and better analyze the situation. For example, instead of identifying the problem generally as - pages are slow to load, it will be better to state it as - pages are slow to load on Sundays.
Gather as much quantitative and qualitative data as possible related to the problem. In our example of slow performance of the website, questions that require answers would be:
- Who all reported the problem?
- When was the slowness observed?
- Which pages?
- What data can we get from our server logs?
Based on the data, systematically list all possible explanations.
For example, from the data, you may realize that the problem happens only between 2 AM to 5 AM EDT.
There could be several reasons:
The next step is to drill-down and gather evidence to support or contradict these hypotheses.
Identify the Root Cause
As you gather data and analyze it, you will finally come to the issue's leading cause - the root of the problem.
In our example, you may conclude that a full system backup occurs during that time, and that causes the rest of the system to slow down.
Once you identify the principal reason behind the problem, the next step is to put in solutions that fix the issue at hand and prevent its recurrence.
In our example, the solution could be to take incremental backups regularly, which is better anyway, than one backup every week.
Root Cause Analysis Techniques for Software Defects
There are two main methods for doing Root Cause Analysis while investigating software defects.
The 5 Whys Technique
Five whys is a technique used to explore the cause-and-effect relationships underlying a particular defect or problem. The technique's primary goal is to determine the root cause of the issue by repeating the question "Why?". Each answer forms the basis of the next question.
The "five" in the name derives from practical observations of the number of iterations needed to resolve the problem.
The idea behind this is to ask five whys to arrive at the potential root cause of the problem.
Let's take a simple example:
Problem: You got a speeding ticket for overspeeding while going to the office.
- Why? – I was in a hurry.
- Why? – I was late for the office.
- Why? – I got up late.
- Why? – The alarm clock did not ring.
- Why? – I forgot to set the alarm last night.
The last answer will usually point to a process that is not working well or does not exist. So that either the process can be put in place or improved.
In the above example, the solution could be to set a repeating alarm on your mobile phone.
Fishbone analysis is a visual technique used to break down the problem and its causes till you arrive at its fundamental reasons.
This technique is useful when there could be multiple reasons and aspects to the problem rather than just a simple cause.
It's like reverse-engineering the problem to arrive at the possible causes. As you can see from the image at the top of this page, it's a cause-and-effect diagram with the fish mouth as the problem and spine leading to its potential causes.
The technique first lists down all the areas that affect the problem. It lists all the primary causes, followed by the secondary reasons that are responsible for the primary ones.