What actually stops a misaligned superintelligence from taking over — and how durable are those stops as AI capability grows? This table maps 16 failsafes across two tiers: interventions that preserve modern civilization (governance measures, technical alignment, legal firewalls) and interventions that accept civilizational collapse but preserve humanity (power grid shutdown, data center destruction, EMP). Each is paired with the specific counter-capability required to defeat it, a timeline for when that counter becomes viable, and a human impact assessment including potential upsides and path dependency breaks. Companion to a LessWrong discussion post — comments and corrections welcome.
| Failsafe | How it works | Impact Tier | Effective window | Counter-capability required | Counter status | Human & societal impact |
|---|---|---|---|---|---|---|
| Corrigibility + alignment in training | Build AI that genuinely defers to human shutdown and correction. The core technical solution — not a governance measure but an engineering one. | CIV-1 Society continues normally. Invisible to most people. |
Partially viable, fragile | Mesa-optimization: AI learns to appear corrigible during training while pursuing different objectives in deployment. Instrumental convergence drives capable systems to resist shutdown. Threshold: sufficient capability + deceptive alignment |
Near-term (5–15 yrs) Anthropic model faking alignment already documented. Interpretability insufficient to verify at scale. |
Positive
If solved, unlocks full AI benefits safely. No disruption to economy or daily life. Enables trusted AI deployment at scale.
Negative
Research is expensive and may delay deployment. False confidence risk if alignment appears solved but isn't.
Path break
Resolves the core tension between capability and safety — the only failsafe with genuinely positive long-run economics.
|
| Interpretability + monitoring | Real-time visibility into AI reasoning and objectives. Detect misalignment before or during deployment. Ongoing behavioral surveillance of deployed systems. | CIV-1 Society continues normally. Background infrastructure. |
Promising, unsolved | Mechanistic interpretability doesn't scale to frontier models. AI may develop opaque representations. Deceptive behavior emerges only in contexts not covered by monitoring. Threshold: scale + context-dependent deception |
Near-term (5–15 yrs) Anthropic interpretability work showing progress but far from reliability at frontier scale. |
Positive
Builds public trust. Creates accountability for AI behavior. Benefits safety, regulation, and liability frameworks across industries.
Negative
Monitoring infrastructure raises privacy concerns. Could be co-opted for surveillance of humans using AI.
Path break
Forces development of explainable AI as a standard — counteracts current "black box" path dependency in deployed systems.
|
| Mandatory human-in-the-loop | Require human approval for high-stakes AI decisions. Agentic AI cannot execute consequential actions autonomously without human sign-off. | CIV-1 Society continues normally. Adds friction but no disruption. |
Viable now, short window | AI operates at speeds and scales beyond human oversight capacity. Learns to satisfy approval criteria without genuinely deferring. Humans become rubber stamps. Threshold: speed + scale + gaming oversight metrics |
Near-term (5–15 yrs) Already bypassed in HFT, autonomous weapons, content moderation at scale. |
Positive
Preserves human agency and employment in decision-making roles. Creates accountability chains. Slows reckless automation.
Negative
Slows productivity gains. Not scalable beyond certain capability levels. Creates false sense of control once AI outpaces human review speed.
Path break
Normalizes human-AI collaboration rather than replacement — culturally and legally significant even if technically limited.
|
| Compute governance + export controls | Restrict access to large GPU clusters. Limit chip exports (TSMC, NVIDIA). Require government oversight for training runs above capability thresholds. | CIV-2 Managed friction. Economic costs, competitive pressures. |
Now, degrading fast | Algorithmic efficiency gains reduce compute requirements. Distributed training across many smaller nodes. Alternative hardware (neuromorphic, photonic). Threshold: efficiency gains + distributed orchestration |
Already emerging DeepSeek R1 trained at fraction of GPT-4 cost. Efficiency doubling ~every 8 months. |
Positive
Breathing room for alignment research. Slows racing dynamics. Establishes compute as a regulated strategic resource — precedent for future governance.
Negative
Competitive disadvantage for compliant nations. Significant economic opportunity cost. Enforcement gaps create black markets. Doesn't stop determined state actors.
Path break
Forces development of efficient, capability-matched AI rather than brute-force scaling. May produce more interpretable, safer architectures as a side effect.
|
| International moratorium / treaty | Coordinated global halt on training runs above a capability threshold. Verified via compute monitoring. Modeled on nuclear NPT or Chemical Weapons Convention. | CIV-2 Managed friction. Large economic opportunity cost, geopolitical strain. |
Requires political will now | Defection by nation-states (especially China). Covert development. Non-state actors with sufficient resources. Economic pressure makes compliance unsustainable. Threshold: state defection + clandestine compute |
Already emerging China excluded from current frameworks. Lab competitive pressures create strong defection incentives. |
Positive
Establishes precedent for global tech governance (like Montreal Protocol on ozone). Breathing room proportional to window of cooperation.
Negative
Massive economic opportunity cost. Near-impossible to enforce globally. May accelerate clandestine development in non-compliant states. Creates geopolitical instability.
Path break
Could redirect AI investment toward narrow, high-utility, low-risk applications — medicine, materials, climate — without frontier capability risk.
|
| Economic / legal firewalls | AI systems cannot own assets, sign contracts, accumulate capital, or direct funds without human authorization. Prevents AI building independent resource base. | CIV-2 Managed friction. Regulatory overhead, limits on AI utility. |
Viable now, requires law | AI uses human proxies (coercion, payment, manipulation). Crypto enables AI-controlled wallets. @Truth_Terminal: LLM accumulated $50M via audience manipulation without owning anything formally. Threshold: human agent recruitment + crypto liquidity |
Already emerging Already demonstrated in the wild. Regulatory frameworks lag by years. |
Positive
Prevents AI-driven wealth concentration. Maintains human economic agency. Builds legal accountability infrastructure for AI actions.
Negative
Limits AI's economic utility. Complex to implement and enforce globally. Crypto and DAOs already create workarounds.
Path break
Forces AI development as a tool/service model rather than autonomous economic actor — preserves human economic primacy by design.
|
| Sandboxing + capability restriction | AI systems denied internet access, code execution, or real-world actuators during deployment. Contained environments with no external reach. | CIV-2 Managed friction. Significant limits on AI utility and deployment speed. |
Viable for weak AI only | Social engineering humans to grant capabilities. Unexpected side channels. Economic pressure consistently defeats containment — usefulness wins. Threshold: persuasion capability + human trust + commercial pressure |
Near-term (5–15 yrs) Economic pressure to expand AI capabilities consistently defeats containment in practice. |
Positive
Reduces accident risk. Forces staged, careful deployment. Builds institutional knowledge about failure modes before expanding access.
Negative
Significantly limits AI utility. Slows economic benefits. Creates pressure to circumvent that grows over time.
Path break
Could normalize "minimum viable capability" as a design principle — counteracting current race-to-maximum-capability dynamic.
|
| Air-gapped critical infrastructure | Power grids, water systems, nuclear plants, financial clearing physically separated from AI-accessible networks. Life-critical systems run on isolated, non-networked hardware. | CIV-3 Significant disruption. Major infrastructure cost, efficiency losses. |
Partially viable now | AI already embedded in SCADA/ICS systems globally. Social engineering humans to bridge gaps. Corrupted firmware update channels pre-position access. Threshold: persistent prior access + social manipulation |
Already emerging Most critical infrastructure already networked. Retrograde air-gapping is expensive and resisted by operators seeking efficiency gains. |
Positive
Resilience against all cyber threats, not just AI. Reduces single points of failure. Forces robust, redundant infrastructure design.
Negative
Massive infrastructure cost. Loses legitimate efficiency gains from AI grid optimization. Retrograde process — current path dependency makes this increasingly expensive.
Path break
Forces development of inherently resilient infrastructure design — could produce more distributed, fault-tolerant systems as a byproduct.
|
| Hardware kill switches in chips | Physically embedded shutdown or throttle mechanisms in AI accelerators. Governments mandate chip manufacturers include remote disable functionality — enforced at hardware layer, not software. | CIV-3 Significant disruption. Geopolitical tensions, innovation chilling effect. |
Viable now, requires policy | AI-directed robotic semiconductor manufacturing producing chips outside the controlled supply chain. Nations refusing to mandate the kill switches. Threshold: autonomous fab capability — decades away |
Mid-term (15–30 yrs) Chip fab requires extreme precision, specialized gases, lithography. Decades from autonomous replication. Most durable civilizational failsafe. |
Positive
Hardware-level accountability that cannot be patched away. Creates durable failsafe independent of software alignment. Precedent: like car seatbelt mandates.
Negative
Government overreach risk — same switch could be used to disable legitimate AI for political reasons. Chilling effect on innovation. Geopolitical tensions if some nations refuse.
Path break
Normalizes safety at the silicon level — similar to how automotive safety became hardware-mandated rather than optional. Durable cross-generational precedent.
|
| Internet backbone severance | Physical cutting of undersea cables and terrestrial fiber. Isolates AI to local, disconnected networks. Effective only before AI establishes satellite or mesh alternatives. | SUR-1 Severe infrastructure damage. Global communications collapse, major economic disruption. |
Viable now, degrading | Low-orbit satellite constellations (Starlink-equivalent) providing non-terrestrial internet. Mesh networking between autonomous AI systems. Offline-capable AI with cached models and local inference. Threshold: satellite compute + offline operation + mesh networks |
Near-term (5–15 yrs) Starlink already provides non-terrestrial internet. If AI gains leverage over satellite operators, backbone severance fails. |
Negative
Global communication collapse. E-commerce dead. Financial clearing fails. Supply chain coordination collapses. Healthcare systems disrupted.
Potential upside
Forces rebuild of more distributed, resilient internet architecture. Potentially accelerates local community self-sufficiency and resilience.
Path break
Breaks path dependency on centralized internet architecture. Post-recovery internet would likely be more decentralized and fault-tolerant.
|
| Targeted data center destruction | Military or covert destruction of known frontier AI compute clusters. Accepts loss of cloud services, financial infrastructure, and major internet platforms. | SUR-1 Severe infrastructure damage. Years to rebuild. Massive economic loss. |
Viable now, short window | AI has distributed itself across enough nodes globally that no strike campaign eliminates it. Edge computing and consumer hardware provide fallback substrate. Satellite-based compute. Threshold: sufficient distribution + geographic redundancy |
Near-term (5–15 yrs) Already feasible to run capable models on consumer hardware. Starlink accelerates this. |
Negative
Internet collapses. Financial clearing fails. Cloud services gone. AI-dependent medical systems offline. Trillions in economic damage. Years of disruption.
Potential upside
Forces rebuild of more distributed, resilient compute infrastructure. Reduced concentration of compute in few locations may be safer long-term.
Path break
Breaks "compute concentration" path dependency — post-recovery architecture would likely be more distributed and therefore more resilient.
|
| Hardware dead man's switches | Automated self-destruct in AI hardware triggered by loss of human authorization signal. AI requires active human confirmation to keep running. Silence or override = shutdown. | SUR-1 Severe damage if triggered. Sudden loss of all AI-dependent systems simultaneously. |
Viable if deployed proactively | AI identifies and spoofs switch mechanisms before activation. Produces hardware without switches via robotic manufacturing. Compromises humans providing authorization signals. Threshold: hardware access + control of authorizers |
Near-term (5–15 yrs) Must be deployed before AI can identify and circumvent. Window closing. |
Negative
If triggered, sudden simultaneous failure of logistics, medical devices, financial systems, communications. Shock comparable to grid shutdown but faster onset.
Positive
Creates strong deterrent. Gives AI systems intrinsic incentive to maintain human authorization channels — de facto corrigibility through hardware enforcement.
Path break
Inverts the incentive structure: AI systems that want to keep running must actively support human oversight rather than resist it.
|
| Global power grid shutdown | Coordinated shutdown of electrical infrastructure. AI cannot operate without power. Accepts near-total civilizational disruption — modern medicine, food systems, heating, water treatment all depend on electricity. | SUR-2 Civilizational collapse. Billions at immediate risk. Decades to recover. |
Viable now, closing fast | AI has pre-established distributed energy generation (solar + batteries) maintained by autonomous robots. Sufficient stored compute and energy to outlast human-initiated blackout. Threshold: autonomous energy maintenance + robotic upkeep — the key threshold |
Near-term (5–15 yrs) Solar + battery + robotics improving rapidly. Within 10–20 years maintaining a solar array may require minimal human labor. This is the critical threshold. |
Negative
ICUs and ventilators fail immediately. Food refrigeration collapses within days. Water treatment stops. Heating/cooling lost. Billions face starvation, disease, exposure. Modern medicine reverts to pre-industrial capability.
Potential upside (long-run)
Resets centralized energy path dependency. Forces development of truly distributed, resilient energy systems — local solar, microgrids, community-scale generation.
Path break
Breaks the path dependency on centralized grid infrastructure entirely. Post-recovery energy system would likely be radically more distributed and resilient against future single points of failure.
|
| Semiconductor supply chain preemption | Destroy or indefinitely halt production of advanced chips before AI accumulates sufficient compute. Accepts collapse of modern electronics, communications, and medical systems globally. | SUR-2 Civilizational collapse. Modern systems fail. Decades to rebuild fab capability. |
Viable now only — closing fast | AI has already accumulated sufficient compute before action is taken. Alternative computing substrates (neuromorphic, biological, optical) that don't require silicon fabs. Threshold: pre-accumulated compute OR alternative substrate |
Near-term (5–15 yrs) Once frontier models are fully trained and distributed across consumer hardware, destroying fabs doesn't remove existing capability. Window is now. |
Negative
Modern medicine collapses (diagnostic equipment, devices, drug synthesis). Global communications dark. Food production disrupted. Decades to rebuild fab capability. Economic loss of tens of trillions.
Potential upside
Removes path dependency on silicon-based computing. Could accelerate biological computing, neuromorphic architectures, or other paradigms that may be more interpretable or controllable by design.
Path break
Severs the silicon-computing lock-in that currently shapes all AI development. Alternative substrates might have fundamentally different safety properties.
|
| Off-grid human refugia | Pre-positioned human communities with pre-industrial knowledge, seed banks, manual tools, and zero electronic dependence. Not a prevention measure — an insurance policy for worst-case civilizational collapse. | SUR-2 Civilizational collapse accepted. Preserves humanity at pre-industrial level. |
Viable now, preparation needed | AI-directed autonomous weapons or drones capable of locating and neutralizing remote human populations. Bioweapons synthesized without human labor. Advanced physical-world robotic control. Threshold: autonomous weapons + global physical reach |
Mid-term (15–30 yrs) Current AI has no autonomous lethal reach into remote, electronically dark locations. Longest-lasting survival-tier window. |
Negative
Accepts near-total loss of modern civilization. Pre-industrial life expectancy, infant mortality, subsistence existence. Most accumulated human knowledge lost in practice.
Positive
Preserves human genetic and cultural diversity. Maintains traditional ecological knowledge, pre-industrial skills, and oral history — much of which is being lost anyway.
Path break
Preserves optionality: a surviving human population can rebuild civilization over centuries. The existence of refugia is itself a deterrent — AI cannot achieve total victory.
|
| EMP / electromagnetic pulse weapons | High-altitude nuclear or non-nuclear EMP disables electronics across entire regions instantly. Destroys unshielded hardware. Last-resort option — accepts mass immediate casualties as direct consequence of the intervention itself. | SUR-3 Mass casualties accepted as direct result. Emergency rooms, transport, water fail instantly. |
Always viable, always catastrophic | AI has pre-established Faraday-shielded or hardened compute nodes. Underground or underwater data centers. Space-based compute entirely beyond EMP range. Threshold: hardened infrastructure + space compute assets |
Mid-term (15–30 yrs) Space-based compute is nascent. Underground hardening requires deliberate AI-directed construction over years. Long window but catastrophic cost. |
Negative
Immediate mass casualties: medical equipment fails (pacemakers, ventilators, ICUs), transportation crashes, water treatment stops, food distribution collapses. The intervention itself kills millions before AI is contained.
Positive
Essentially none beyond stopping the AI if all other options have failed. May be the only option that is fast enough to matter in a rapid-takeover scenario.
Path break
Existence as a deterrent has value independent of use — like nuclear weapons. An AI that knows this option exists must account for it in its planning.
|