Root Cause Analysis (RCA)
Why an incident happened — not just what broke
Every near miss, injury, and system failure has a surface cause and an underlying system cause. A good RCA finds the second one. The standards are not complex, but the discipline required to stop at the real answer — instead of the first convenient answer — is what separates safety programs that improve from safety programs that repeat the same incidents.
What Root Cause Analysis Actually Is
Root cause analysis (RCA) is a structured process for tracing an incident, near miss, or system failure past its immediate trigger to the organizational or design factor that made the incident possible. It is not a blame-finding tool. It is not a five-minute checklist at the bottom of the incident form. Done seriously, it takes hours to days and produces a short list of system changes that reduce the probability of the same category of incident ever recurring.
The core insight: every accident is the final step of a chain. A worker falls off a ladder — the immediate cause is the slip. The proximate cause is the wet rung. The contributing cause is the worker rushing to finish before a storm. The root cause is a work-order system that rewards speed over weather assessment. Fix only the slip and the next worker falls off the same ladder in the next storm.
The single biggest RCA failure mode
Stopping at “operator error” or “worker did not follow procedure.” People make mistakes every day — the system is what fails when a single mistake causes harm. An investigation that ends with a re-training slip and no system change will see the same incident again with a different employee.
When an RCA Is Triggered
Different frameworks set different thresholds. The common ones you will encounter in a US facility:
The rule of thumb: if you had to stop work, file a report, or tell a regulator, you also owe an RCA. Near misses are the highest-leverage opportunity — they give you the same lesson at zero cost of harm.
The Five Methods You Should Know
RCA is a family of methods, not a single recipe. Pick the one that matches the complexity of your incident.
1. The 5 Whys (Toyota)
Start with a clear problem statement. Ask “Why?” Answer it. Ask “Why?” of that answer. Repeat until you hit a systemic cause you can actually change. The name is a guideline — some chains end at 3, others at 7.
Worked example — sprinkler failed to flow
- Problem: Sprinkler in Room 214 did not flow during fire activation on March 12.
- Why? The zone control valve feeding Room 214 was closed.
- Why? A plumber closed it during a bathroom remodel and never reopened.
- Why? There was no written impairment procedure for contractor work on fire protection valves.
- Why? The impairment program covered in-house maintenance but not third-party contractors.
- Why? The impairment SOP was written in 2009 and never revised after the facility started outsourcing plumbing.
Stopping point: question 6 reveals the actionable system failure — a stale SOP that does not reflect current operating reality. The fix is to revise the SOP to cover all personnel who can operate a fire-protection valve, plus add an annual SOP-review step. “Re-train the plumber” would have been a shallow answer; the plumber in question will never work on the building again.
2. Fishbone / Ishikawa Diagram
Named after Kaoru Ishikawa (Tokyo, 1960s). Draw the problem at the head of a fish skeleton. Each major bone is a category of potential cause. Classic categories for manufacturing are the 6 Ms: Man, Machine, Material, Method, Measurement, Mother Nature (environment). For service/healthcare work the 4 Ps are often used: People, Policies, Procedures, Plant.
For each bone, brainstorm every contributing factor. Then drill each factor with a short 5 Whys. Use Fishbone when the incident clearly has multiple contributing streams that need to be mapped before any one can be drilled.
When to reach for Fishbone over 5 Whys
- Multiple people or teams involved in the failure path
- Multiple equipment or system failures occurring at once
- A pattern of similar incidents that suggests a deeper systemic issue
- You want to prove to auditors that you considered every angle, not just the first one
3. Fault Tree Analysis (FTA)
A top-down logic diagram originating at Bell Labs in 1962 for the Minuteman missile program. The top event is the failure under investigation. Each branch below uses AND gates (all sub-events required) or OR gates (any sub-event sufficient) to decompose how the top event can occur. Probabilities assigned to each leaf let you calculate the probability of the top event mathematically.
FTA is the right tool for complex life-safety systems — fire pump failure analysis, clean-agent system nonactivation, ICU power-loss cascade. Overkill for a sprained ankle. Use when the incident involves multiple systems and you need to quantify which branch deserves the limited corrective-action budget.
4. Swiss Cheese Model (James Reason, 1990)
Rather than a method, the Swiss Cheese model is a mental framework. Every organizational defense — engineering controls, procedures, training, supervision, final inspection — is a slice of Swiss cheese with holes in it (the latent conditions and active failures that plague every system). An incident happens when holes in every slice momentarily align.
Use during an RCA to ask: which defense layers were the holes in? Why was each slice weak? The answer points directly at which defenses need reinforcement and which ones are working. This framing is especially useful in healthcare (TJC sentinel event RCAs) where the chain from latent system flaw → active failure → harm is long.
5. 8D Problem Solving
Ford's 8-discipline process: (D1) team, (D2) problem description, (D3) containment, (D4) root cause, (D5) corrective action, (D6) validation, (D7) preventive action, (D8) congratulate the team. Common in manufacturing and supplier quality. The value for safety work is the explicit separation of containment (stop the bleeding now) from preventive action (change the system so it cannot bleed again).
Corrective vs Preventive Action
Every RCA should produce at least one of each. Stopping at only corrective action leaves the underlying defect intact.
Corrective Action
Fixes the specific problem now.
- Replace the damaged guard
- Re-train the injured employee and crew
- Post the missing warning label
- Close the valve that was left open
- Re-pour the missing thrust block
Preventive Action
Changes the system so the problem cannot recur.
- Redesign the machine so the guard is not removable during operation
- Add the failure mode to the monthly walkthrough checklist
- Automate label verification through asset-management software
- Rewrite the impairment SOP to cover contractors
- Change the design spec to require restrained joints on every hydrant
The hierarchy of controls applies here: preventive actions at the top of the hierarchy (elimination, substitution, engineering) are more durable than preventive actions at the bottom (administrative, PPE). A corrective action plan that consists entirely of bottom-of-hierarchy interventions is a warning sign that the RCA stopped too soon.
Common RCA Pitfalls
- Stopping at “human error.” Human error is nearly always a symptom, not a cause. Ask why the system allowed the error to reach harm.
- Picking the first plausible explanation. The 5 Whys name forces you to keep going past the first comfortable answer. Honor the methodology.
- Confusing correlation with causation. “The last time we painted the valve room, a valve was later found closed” is a pattern worth looking at, not a proven cause.
- Writing the RCA before the evidence is gathered. Photograph the scene, preserve physical evidence, interview witnesses separately, pull logs and maintenance records — then analyze.
- Blame language. The RCA report should describe system failures, not character failures. “The technician failed to verify” is less useful than “The procedure did not require a verification checkpoint, and none was performed.”
- No follow-through. Preventive actions must have an owner, a deadline, and a verification step — otherwise they are opinions. OSHA, TJC, and CMS all look for the closure evidence in audits.
- Narrow scope. If three similar incidents happened in the same quarter, the RCA scope is the pattern, not the most recent event.
- Retrospective bias. After the fact, every missed signal looks obvious. Ask yourself whether the signal was genuinely detectable at the time, with the information the team had — or only after you knew the outcome.
How LifeSafetyWiki Tools Fit In
We built three interactive tools that together form a working RCA workflow:
Was this article helpful?
Rate this article to help us improve
Discussion (2)
Great breakdown of the technical details. The NFPA 25 maintenance table is exactly what I needed for my ITM schedule.
Really clear explanation. Would love to see a companion video walkthrough of the inspection process.