Decision Fatigue on an On-Call Rotation
Staff engineer, 2am page, 40 minutes into prod debugging — judgment inverts by minute 45.
Adrenaline + sleep debt flip reasoning. The rule: never deploy at 3am without a rollback script and a second set of eyes.
Subject: staff engineer, primary on-call for payment systems. Paged 2:04am Thursday — latency spike, p99 from 120ms to 2,400ms. Forty minutes in, three hypotheses tested, none conclusive. Sleep: 4h20m. Adrenaline: saturating. Load reading:
judgment_quality = f(sleep_hours, minutes_in_incident, adrenaline) deploy_allowed = (rollback_script_exists) AND (second_engineer_acked) AND (mitigation_only OR load < high) if load == high AND clock ∈ [23:00, 06:00]: forward_fix_denied
- At page start: log load, sleep hours, time.
- At minute 30: re-check load
load unmeasured . If high, name second on-call. - Before any deploy: confirm rollback script exists, second engineer acked, mitigation is the target not root cause.
- Between 23:00 and 06:00: forward fixes denied unless load nominal AND incident under 30 min.
- Post-incident 10am: re-investigate root cause with rest and witnesses.
Quick answer
Adrenaline + sleep debt flip reasoning. The rule: never deploy at 3am without a rollback script and a second set of eyes.
▸ Key Specs
- ▸ Adrenaline is not alertness. It narrows attention while inverting risk-assessment.
- ▸ Minute 45 of an incident at 2am is worse than minute 5. Judgment has cost, not value, at that point.
- ▸ Load
load unmeasured — if high, every deploy needs a rollback script and a second on-call named. - ▸ Mitigation beats root cause at 3am. Revert, rollback, feature-flag off. Investigate at 10am.
- ▸ "I think I see it" is the phrase that precedes the outage that outlasts the original.
▸ Worked Examples
- The mitigation-first rule2:44am: hypothesis 3 suggests a recent deploy caused the spike. Rather than debug forward, revert. `git revert <sha>` + deploy = 4 minutes. p99 recovers to 140ms by 2:52am. Root cause investigated at 10:30am with rested eyes, second engineer, proper logs.
- The rollback-script requirement3:15am hotfix needed. Draft: (1) commit on branch with revert SHA documented, (2) rollback command in the same PR description, (3) second on-call paged to review before deploy. Adds 8 minutes to MTTR. Prevents 8 hours of cascading deploys next morning.
- The inversion signal"Let me just tweak this one thing and redeploy." At 2:58am this sentence has a 60% chance of extending the incident. At 10:58am it has a 10% chance. Same engineer, same code — rest state, not skill, is the variable.
When to use which tool
- CYAN · STABLE — Load nominal, incident under 30 min — continue with normal deploy checklist.
- GOLD · GUARDED — Load rising or incident 30–60 min — mitigation-first, second on-call named before any deploy.
- MAGENTA · CRITICAL — Load high or incident past 60 min at 3am — rollback only, no forward fix, investigate in daylight.
Related
- Decision Fatigue · Willpower BatteryModel remaining willpower across the day. Every decision draws from the same finite reserve — trivial × 1, moderate × 5, heavy × 10.
- Burnout MonitorEstimate when extra work hours stop being worth the fatigue cost from lost sleep.
- Tech Debt InterestQuantify the compounding hours to fix a shortcut as the codebase grows on top of it. Maintenance heatmap.
Frequently asked questions
› What if the incident legitimately requires a forward fix?
Rare. "Requires forward fix" usually means "I do not have a rollback path" — which is an infra gap to close the next business day. At 3am, feature-flag off or revert; forward fix belongs in daylight with a second engineer.
› Do I wake the second on-call for every deploy?
For every production write-path deploy between 11pm and 6am, yes. The cost is 10 minutes of their sleep. The cost of not doing it is a second outage built on top of the first, which costs hours for everyone.
› How do I know my judgment has inverted? How-to
Signals: skipped a step of your runbook, argued against a teammate who suggested rollback, opened a third terminal tab. Any two of three = stop, rollback, escalate, sleep.
› How should I use this guide with a Kefiw tool? How-to
Use the guide as the plan and the linked Kefiw tool as the check. Read the steps first, try the move manually, then use the tool to compare outputs, catch edge cases, and decide whether the result actually fits your task.
› What mistake do tool guides help avoid? Troubleshooting
Tool guides help avoid using a utility mechanically without understanding what you are trying to accomplish. Most word, writing, and text utilities are fast, but speed can hide context mistakes. Know whether you are solving a puzzle, cleaning copy, drafting a line, or checking a rule.