What is the difference between a cause and a root cause?

A cause is any factor that contributed to the incident — the faulty ladder, the rushed schedule, the missing guard. A root cause is the underlying system failure that allowed the cause to exist in the first place — why was the ladder inspection program not finding faults? Why did the schedule pressure override the lockout procedure? The litmus test: if you fixed the identified factor, would the same incident still be possible via a slightly different path? If yes, keep asking why.

Is "operator error" a valid root cause?

No. Stopping at "the worker made a mistake" is the #1 symptom of a shallow investigation. People make mistakes constantly — the question is why the system did not catch or prevent that mistake. A useful RCA asks why the job was structured so that a single human error could cause harm. If you blame the worker and close the case, the next worker will have the same accident.

How many "whys" is enough?

There is no fixed number. Three is typical, five is a common target, but depth is driven by where the causal chain stops producing actionable answers. Stop when the next "why" lands outside your organization's control (industry norms, human biology, regulatory ceilings), OR when further drilling yields the same upstream failure you have already identified from a different branch.

Does OSHA require a root cause analysis after a recordable injury?

OSHA does not mandate a specific RCA methodology, but 29 CFR 1904.35 requires employers to investigate work-related injuries and illnesses, and the agency's 2015 Recommended Practices for Safety and Health Programs (Core Element 6) directly instructs employers to "conduct investigations with the goal of identifying the root causes." State plans including Cal/OSHA + North Carolina OSH go further and make RCA-style investigation explicit for serious injuries. Joint Commission PE standards require RCA for healthcare sentinel events within 45 days.

What is the difference between corrective and preventive action?

Corrective action fixes the immediate problem so the same incident cannot happen tomorrow (replace the broken guard, re-train the crew, re-mount the torn label). Preventive action fixes the system that allowed the problem to exist (redesign the job so no guard is needed, add the step to the monthly walkthrough checklist, automate the label-tracking). Every RCA should produce at least one of each — an investigation with only corrective actions is rebuilding the same house on the same sinkhole.

Which RCA method should I use for a fire-protection system failure?

For a single-mode failure (one sprinkler did not flow, one alarm did not sound), the 5 Whys is usually enough — quick, cheap, and traces the failure back through inspection history and maintenance records. For complex multi-factor failures (sprinkler did not activate AND alarm delayed AND suppression system discharged late), use Fishbone to catalog every contributing branch (equipment, people, process, environment, management) before drilling any single branch. For life-safety catastrophes with multiple cascading failures, escalate to Fault Tree Analysis (FTA) to quantify probability at each branch.

←

PROGRAMOSHATJC / CMS

Root Cause Analysis (RCA)
Why an incident happened — not just what broke

Every near miss, injury, and system failure has a surface cause and an underlying system cause. A good RCA finds the second one. The standards are not complex, but the discipline required to stop at the real answer — instead of the first convenient answer — is what separates safety programs that improve from safety programs that repeat the same incidents.

By Samektra · 14 min read · Last updated April 2026

What Root Cause Analysis Actually Is

Root cause analysis (RCA) is a structured process for tracing an incident, near miss, or system failure past its immediate trigger to the organizational or design factor that made the incident possible. It is not a blame-finding tool. It is not a five-minute checklist at the bottom of the incident form. Done seriously, it takes hours to days and produces a short list of system changes that reduce the probability of the same category of incident ever recurring.

The core insight: every accident is the final step of a chain. A worker falls off a ladder — the immediate cause is the slip. The proximate cause is the wet rung. The contributing cause is the worker rushing to finish before a storm. The root cause is a work-order system that rewards speed over weather assessment. Fix only the slip and the next worker falls off the same ladder in the next storm.

The single biggest RCA failure mode

Stopping at “operator error” or “worker did not follow procedure.” People make mistakes every day — the system is what fails when a single mistake causes harm. An investigation that ends with a re-training slip and no system change will see the same incident again with a different employee.

When an RCA Is Triggered

Different frameworks set different thresholds. The common ones you will encounter in a US facility:

FrameworkTriggerDeadline / Cadence

OSHARecordable + serious + fatalityNo fixed deadline, but expected before the 300A posting cycle; fatalities 8-hr call per 29 CFR 1904.39

TJC (healthcare)Sentinel event (unexpected death, major permanent harm, suicide, etc.)Comprehensive systematic analysis within 45 days — PE.06.01.01

CMSImmediate Jeopardy citation; patient harm events45-day Plan of Correction with cause-analysis backbone

NFPA 1600Any activation of the Emergency Operations PlanAfter-action report — no fixed rule; best practice 30 days

ISO 45001Every nonconformity + every incidentClosed per management-system cadence (typically monthly review)

Internal policyNear misses (the most important category)Recommended: any near miss where a repeat would be recordable

The rule of thumb: if you had to stop work, file a report, or tell a regulator, you also owe an RCA. Near misses are the highest-leverage opportunity — they give you the same lesson at zero cost of harm.

The Five Methods You Should Know

RCA is a family of methods, not a single recipe. Pick the one that matches the complexity of your incident.

1. The 5 Whys (Toyota)

Start with a clear problem statement. Ask “Why?” Answer it. Ask “Why?” of that answer. Repeat until you hit a systemic cause you can actually change. The name is a guideline — some chains end at 3, others at 7.

Worked example — sprinkler failed to flow

Problem: Sprinkler in Room 214 did not flow during fire activation on March 12.
Why? The zone control valve feeding Room 214 was closed.
Why? A plumber closed it during a bathroom remodel and never reopened.
Why? There was no written impairment procedure for contractor work on fire protection valves.
Why? The impairment program covered in-house maintenance but not third-party contractors.
Why? The impairment SOP was written in 2009 and never revised after the facility started outsourcing plumbing.

Stopping point: question 6 reveals the actionable system failure — a stale SOP that does not reflect current operating reality. The fix is to revise the SOP to cover all personnel who can operate a fire-protection valve, plus add an annual SOP-review step. “Re-train the plumber” would have been a shallow answer; the plumber in question will never work on the building again.

2. Fishbone / Ishikawa Diagram

Named after Kaoru Ishikawa (Tokyo, 1960s). Draw the problem at the head of a fish skeleton. Each major bone is a category of potential cause. Classic categories for manufacturing are the 6 Ms: Man, Machine, Material, Method, Measurement, Mother Nature (environment). For service/healthcare work the 4 Ps are often used: People, Policies, Procedures, Plant.

For each bone, brainstorm every contributing factor. Then drill each factor with a short 5 Whys. Use Fishbone when the incident clearly has multiple contributing streams that need to be mapped before any one can be drilled.

When to reach for Fishbone over 5 Whys

Multiple people or teams involved in the failure path
Multiple equipment or system failures occurring at once
A pattern of similar incidents that suggests a deeper systemic issue
You want to prove to auditors that you considered every angle, not just the first one

3. Fault Tree Analysis (FTA)

A top-down logic diagram originating at Bell Labs in 1962 for the Minuteman missile program. The top event is the failure under investigation. Each branch below uses AND gates (all sub-events required) or OR gates (any sub-event sufficient) to decompose how the top event can occur. Probabilities assigned to each leaf let you calculate the probability of the top event mathematically.

FTA is the right tool for complex life-safety systems — fire pump failure analysis, clean-agent system nonactivation, ICU power-loss cascade. Overkill for a sprained ankle. Use when the incident involves multiple systems and you need to quantify which branch deserves the limited corrective-action budget.

4. Swiss Cheese Model (James Reason, 1990)

Rather than a method, the Swiss Cheese model is a mental framework. Every organizational defense — engineering controls, procedures, training, supervision, final inspection — is a slice of Swiss cheese with holes in it (the latent conditions and active failures that plague every system). An incident happens when holes in every slice momentarily align.

Use during an RCA to ask: which defense layers were the holes in? Why was each slice weak? The answer points directly at which defenses need reinforcement and which ones are working. This framing is especially useful in healthcare (TJC sentinel event RCAs) where the chain from latent system flaw → active failure → harm is long.

5. 8D Problem Solving

Ford's 8-discipline process: (D1) team, (D2) problem description, (D3) containment, (D4) root cause, (D5) corrective action, (D6) validation, (D7) preventive action, (D8) congratulate the team. Common in manufacturing and supplier quality. The value for safety work is the explicit separation of containment (stop the bleeding now) from preventive action (change the system so it cannot bleed again).

Corrective vs Preventive Action

Every RCA should produce at least one of each. Stopping at only corrective action leaves the underlying defect intact.

Corrective Action

Fixes the specific problem now.

Replace the damaged guard
Re-train the injured employee and crew
Post the missing warning label
Close the valve that was left open
Re-pour the missing thrust block

Preventive Action

Changes the system so the problem cannot recur.

Redesign the machine so the guard is not removable during operation
Add the failure mode to the monthly walkthrough checklist
Automate label verification through asset-management software
Rewrite the impairment SOP to cover contractors
Change the design spec to require restrained joints on every hydrant

The hierarchy of controls applies here: preventive actions at the top of the hierarchy (elimination, substitution, engineering) are more durable than preventive actions at the bottom (administrative, PPE). A corrective action plan that consists entirely of bottom-of-hierarchy interventions is a warning sign that the RCA stopped too soon.

Common RCA Pitfalls

Stopping at “human error.” Human error is nearly always a symptom, not a cause. Ask why the system allowed the error to reach harm.
Picking the first plausible explanation. The 5 Whys name forces you to keep going past the first comfortable answer. Honor the methodology.
Confusing correlation with causation. “The last time we painted the valve room, a valve was later found closed” is a pattern worth looking at, not a proven cause.
Writing the RCA before the evidence is gathered. Photograph the scene, preserve physical evidence, interview witnesses separately, pull logs and maintenance records — then analyze.
Blame language. The RCA report should describe system failures, not character failures. “The technician failed to verify” is less useful than “The procedure did not require a verification checkpoint, and none was performed.”
No follow-through. Preventive actions must have an owner, a deadline, and a verification step — otherwise they are opinions. OSHA, TJC, and CMS all look for the closure evidence in audits.
Narrow scope. If three similar incidents happened in the same quarter, the RCA scope is the pattern, not the most recent event.
Retrospective bias. After the fact, every missed signal looks obvious. Ask yourself whether the signal was genuinely detectable at the time, with the information the team had — or only after you knew the outcome.