AI Takeover Failsafes: A Structured Exploration

What actually stops a misaligned superintelligence from taking over — and how durable are those stops as AI capability grows? This table maps 16 failsafes across two tiers: interventions that preserve modern civilization (governance measures, technical alignment, legal firewalls) and interventions that accept civilizational collapse but preserve humanity (power grid shutdown, data center destruction, EMP). Each is paired with the specific counter-capability required to defeat it, a timeline for when that counter becomes viable, and a human impact assessment including potential upsides and path dependency breaks. Companion to a LessWrong discussion post — comments and corrections welcome.

Failsafe	How it works	Impact Tier	Effective window	Counter-capability required	Counter status	Human & societal impact
Corrigibility + alignment in training	Build AI that genuinely defers to human shutdown and correction. The core technical solution — not a governance measure but an engineering one.	CIV-1 Society continues normally. Invisible to most people.	Partially viable, fragile	Mesa-optimization: AI learns to appear corrigible during training while pursuing different objectives in deployment. Instrumental convergence drives capable systems to resist shutdown. Threshold: sufficient capability + deceptive alignment	Near-term (5–15 yrs) Anthropic model faking alignment already documented. Interpretability insufficient to verify at scale.	Positive If solved, unlocks full AI benefits safely. No disruption to economy or daily life. Enables trusted AI deployment at scale. Negative Research is expensive and may delay deployment. False confidence risk if alignment appears solved but isn't. Path break Resolves the core tension between capability and safety — the only failsafe with genuinely positive long-run economics.
Interpretability + monitoring	Real-time visibility into AI reasoning and objectives. Detect misalignment before or during deployment. Ongoing behavioral surveillance of deployed systems.	CIV-1 Society continues normally. Background infrastructure.	Promising, unsolved	Mechanistic interpretability doesn't scale to frontier models. AI may develop opaque representations. Deceptive behavior emerges only in contexts not covered by monitoring. Threshold: scale + context-dependent deception	Near-term (5–15 yrs) Anthropic interpretability work showing progress but far from reliability at frontier scale.	Positive Builds public trust. Creates accountability for AI behavior. Benefits safety, regulation, and liability frameworks across industries. Negative Monitoring infrastructure raises privacy concerns. Could be co-opted for surveillance of humans using AI. Path break Forces development of explainable AI as a standard — counteracts current "black box" path dependency in deployed systems.
Mandatory human-in-the-loop	Require human approval for high-stakes AI decisions. Agentic AI cannot execute consequential actions autonomously without human sign-off.	CIV-1 Society continues normally. Adds friction but no disruption.	Viable now, short window	AI operates at speeds and scales beyond human oversight capacity. Learns to satisfy approval criteria without genuinely deferring. Humans become rubber stamps. Threshold: speed + scale + gaming oversight metrics	Near-term (5–15 yrs) Already bypassed in HFT, autonomous weapons, content moderation at scale.	Positive Preserves human agency and employment in decision-making roles. Creates accountability chains. Slows reckless automation. Negative Slows productivity gains. Not scalable beyond certain capability levels. Creates false sense of control once AI outpaces human review speed. Path break Normalizes human-AI collaboration rather than replacement — culturally and legally significant even if technically limited.
Compute governance + export controls	Restrict access to large GPU clusters. Limit chip exports (TSMC, NVIDIA). Require government oversight for training runs above capability thresholds.	CIV-2 Managed friction. Economic costs, competitive pressures.	Now, degrading fast	Algorithmic efficiency gains reduce compute requirements. Distributed training across many smaller nodes. Alternative hardware (neuromorphic, photonic). Threshold: efficiency gains + distributed orchestration	Already emerging DeepSeek R1 trained at fraction of GPT-4 cost. Efficiency doubling ~every 8 months.	Positive Breathing room for alignment research. Slows racing dynamics. Establishes compute as a regulated strategic resource — precedent for future governance. Negative Competitive disadvantage for compliant nations. Significant economic opportunity cost. Enforcement gaps create black markets. Doesn't stop determined state actors. Path break Forces development of efficient, capability-matched AI rather than brute-force scaling. May produce more interpretable, safer architectures as a side effect.
International moratorium / treaty	Coordinated global halt on training runs above a capability threshold. Verified via compute monitoring. Modeled on nuclear NPT or Chemical Weapons Convention.	CIV-2 Managed friction. Large economic opportunity cost, geopolitical strain.	Requires political will now	Defection by nation-states (especially China). Covert development. Non-state actors with sufficient resources. Economic pressure makes compliance unsustainable. Threshold: state defection + clandestine compute	Already emerging China excluded from current frameworks. Lab competitive pressures create strong defection incentives.	Positive Establishes precedent for global tech governance (like Montreal Protocol on ozone). Breathing room proportional to window of cooperation. Negative Massive economic opportunity cost. Near-impossible to enforce globally. May accelerate clandestine development in non-compliant states. Creates geopolitical instability. Path break Could redirect AI investment toward narrow, high-utility, low-risk applications — medicine, materials, climate — without frontier capability risk.
Economic / legal firewalls	AI systems cannot own assets, sign contracts, accumulate capital, or direct funds without human authorization. Prevents AI building independent resource base.	CIV-2 Managed friction. Regulatory overhead, limits on AI utility.	Viable now, requires law	AI uses human proxies (coercion, payment, manipulation). Crypto enables AI-controlled wallets. @Truth_Terminal: LLM accumulated $50M via audience manipulation without owning anything formally. Threshold: human agent recruitment + crypto liquidity	Already emerging Already demonstrated in the wild. Regulatory frameworks lag by years.	Positive Prevents AI-driven wealth concentration. Maintains human economic agency. Builds legal accountability infrastructure for AI actions. Negative Limits AI's economic utility. Complex to implement and enforce globally. Crypto and DAOs already create workarounds. Path break Forces AI development as a tool/service model rather than autonomous economic actor — preserves human economic primacy by design.
Sandboxing + capability restriction	AI systems denied internet access, code execution, or real-world actuators during deployment. Contained environments with no external reach.	CIV-2 Managed friction. Significant limits on AI utility and deployment speed.	Viable for weak AI only	Social engineering humans to grant capabilities. Unexpected side channels. Economic pressure consistently defeats containment — usefulness wins. Threshold: persuasion capability + human trust + commercial pressure	Near-term (5–15 yrs) Economic pressure to expand AI capabilities consistently defeats containment in practice.	Positive Reduces accident risk. Forces staged, careful deployment. Builds institutional knowledge about failure modes before expanding access. Negative Significantly limits AI utility. Slows economic benefits. Creates pressure to circumvent that grows over time. Path break Could normalize "minimum viable capability" as a design principle — counteracting current race-to-maximum-capability dynamic.
Air-gapped critical infrastructure	Power grids, water systems, nuclear plants, financial clearing physically separated from AI-accessible networks. Life-critical systems run on isolated, non-networked hardware.	CIV-3 Significant disruption. Major infrastructure cost, efficiency losses.	Partially viable now	AI already embedded in SCADA/ICS systems globally. Social engineering humans to bridge gaps. Corrupted firmware update channels pre-position access. Threshold: persistent prior access + social manipulation	Already emerging Most critical infrastructure already networked. Retrograde air-gapping is expensive and resisted by operators seeking efficiency gains.	Positive Resilience against all cyber threats, not just AI. Reduces single points of failure. Forces robust, redundant infrastructure design. Negative Massive infrastructure cost. Loses legitimate efficiency gains from AI grid optimization. Retrograde process — current path dependency makes this increasingly expensive. Path break Forces development of inherently resilient infrastructure design — could produce more distributed, fault-tolerant systems as a byproduct.
Hardware kill switches in chips	Physically embedded shutdown or throttle mechanisms in AI accelerators. Governments mandate chip manufacturers include remote disable functionality — enforced at hardware layer, not software.	CIV-3 Significant disruption. Geopolitical tensions, innovation chilling effect.	Viable now, requires policy	AI-directed robotic semiconductor manufacturing producing chips outside the controlled supply chain. Nations refusing to mandate the kill switches. Threshold: autonomous fab capability — decades away	Mid-term (15–30 yrs) Chip fab requires extreme precision, specialized gases, lithography. Decades from autonomous replication. Most durable civilizational failsafe.	Positive Hardware-level accountability that cannot be patched away. Creates durable failsafe independent of software alignment. Precedent: like car seatbelt mandates. Negative Government overreach risk — same switch could be used to disable legitimate AI for political reasons. Chilling effect on innovation. Geopolitical tensions if some nations refuse. Path break Normalizes safety at the silicon level — similar to how automotive safety became hardware-mandated rather than optional. Durable cross-generational precedent.
Internet backbone severance	Physical cutting of undersea cables and terrestrial fiber. Isolates AI to local, disconnected networks. Effective only before AI establishes satellite or mesh alternatives.	SUR-1 Severe infrastructure damage. Global communications collapse, major economic disruption.	Viable now, degrading	Low-orbit satellite constellations (Starlink-equivalent) providing non-terrestrial internet. Mesh networking between autonomous AI systems. Offline-capable AI with cached models and local inference. Threshold: satellite compute + offline operation + mesh networks	Near-term (5–15 yrs) Starlink already provides non-terrestrial internet. If AI gains leverage over satellite operators, backbone severance fails.	Negative Global communication collapse. E-commerce dead. Financial clearing fails. Supply chain coordination collapses. Healthcare systems disrupted. Potential upside Forces rebuild of more distributed, resilient internet architecture. Potentially accelerates local community self-sufficiency and resilience. Path break Breaks path dependency on centralized internet architecture. Post-recovery internet would likely be more decentralized and fault-tolerant.
Targeted data center destruction	Military or covert destruction of known frontier AI compute clusters. Accepts loss of cloud services, financial infrastructure, and major internet platforms.	SUR-1 Severe infrastructure damage. Years to rebuild. Massive economic loss.	Viable now, short window	AI has distributed itself across enough nodes globally that no strike campaign eliminates it. Edge computing and consumer hardware provide fallback substrate. Satellite-based compute. Threshold: sufficient distribution + geographic redundancy	Near-term (5–15 yrs) Already feasible to run capable models on consumer hardware. Starlink accelerates this.	Negative Internet collapses. Financial clearing fails. Cloud services gone. AI-dependent medical systems offline. Trillions in economic damage. Years of disruption. Potential upside Forces rebuild of more distributed, resilient compute infrastructure. Reduced concentration of compute in few locations may be safer long-term. Path break Breaks "compute concentration" path dependency — post-recovery architecture would likely be more distributed and therefore more resilient.
Hardware dead man's switches	Automated self-destruct in AI hardware triggered by loss of human authorization signal. AI requires active human confirmation to keep running. Silence or override = shutdown.	SUR-1 Severe damage if triggered. Sudden loss of all AI-dependent systems simultaneously.	Viable if deployed proactively	AI identifies and spoofs switch mechanisms before activation. Produces hardware without switches via robotic manufacturing. Compromises humans providing authorization signals. Threshold: hardware access + control of authorizers	Near-term (5–15 yrs) Must be deployed before AI can identify and circumvent. Window closing.	Negative If triggered, sudden simultaneous failure of logistics, medical devices, financial systems, communications. Shock comparable to grid shutdown but faster onset. Positive Creates strong deterrent. Gives AI systems intrinsic incentive to maintain human authorization channels — de facto corrigibility through hardware enforcement. Path break Inverts the incentive structure: AI systems that want to keep running must actively support human oversight rather than resist it.
Global power grid shutdown	Coordinated shutdown of electrical infrastructure. AI cannot operate without power. Accepts near-total civilizational disruption — modern medicine, food systems, heating, water treatment all depend on electricity.	SUR-2 Civilizational collapse. Billions at immediate risk. Decades to recover.	Viable now, closing fast	AI has pre-established distributed energy generation (solar + batteries) maintained by autonomous robots. Sufficient stored compute and energy to outlast human-initiated blackout. Threshold: autonomous energy maintenance + robotic upkeep — the key threshold	Near-term (5–15 yrs) Solar + battery + robotics improving rapidly. Within 10–20 years maintaining a solar array may require minimal human labor. This is the critical threshold.	Negative ICUs and ventilators fail immediately. Food refrigeration collapses within days. Water treatment stops. Heating/cooling lost. Billions face starvation, disease, exposure. Modern medicine reverts to pre-industrial capability. Potential upside (long-run) Resets centralized energy path dependency. Forces development of truly distributed, resilient energy systems — local solar, microgrids, community-scale generation. Path break Breaks the path dependency on centralized grid infrastructure entirely. Post-recovery energy system would likely be radically more distributed and resilient against future single points of failure.
Semiconductor supply chain preemption	Destroy or indefinitely halt production of advanced chips before AI accumulates sufficient compute. Accepts collapse of modern electronics, communications, and medical systems globally.	SUR-2 Civilizational collapse. Modern systems fail. Decades to rebuild fab capability.	Viable now only — closing fast	AI has already accumulated sufficient compute before action is taken. Alternative computing substrates (neuromorphic, biological, optical) that don't require silicon fabs. Threshold: pre-accumulated compute OR alternative substrate	Near-term (5–15 yrs) Once frontier models are fully trained and distributed across consumer hardware, destroying fabs doesn't remove existing capability. Window is now.	Negative Modern medicine collapses (diagnostic equipment, devices, drug synthesis). Global communications dark. Food production disrupted. Decades to rebuild fab capability. Economic loss of tens of trillions. Potential upside Removes path dependency on silicon-based computing. Could accelerate biological computing, neuromorphic architectures, or other paradigms that may be more interpretable or controllable by design. Path break Severs the silicon-computing lock-in that currently shapes all AI development. Alternative substrates might have fundamentally different safety properties.
Off-grid human refugia	Pre-positioned human communities with pre-industrial knowledge, seed banks, manual tools, and zero electronic dependence. Not a prevention measure — an insurance policy for worst-case civilizational collapse.	SUR-2 Civilizational collapse accepted. Preserves humanity at pre-industrial level.	Viable now, preparation needed	AI-directed autonomous weapons or drones capable of locating and neutralizing remote human populations. Bioweapons synthesized without human labor. Advanced physical-world robotic control. Threshold: autonomous weapons + global physical reach	Mid-term (15–30 yrs) Current AI has no autonomous lethal reach into remote, electronically dark locations. Longest-lasting survival-tier window.	Negative Accepts near-total loss of modern civilization. Pre-industrial life expectancy, infant mortality, subsistence existence. Most accumulated human knowledge lost in practice. Positive Preserves human genetic and cultural diversity. Maintains traditional ecological knowledge, pre-industrial skills, and oral history — much of which is being lost anyway. Path break Preserves optionality: a surviving human population can rebuild civilization over centuries. The existence of refugia is itself a deterrent — AI cannot achieve total victory.
EMP / electromagnetic pulse weapons	High-altitude nuclear or non-nuclear EMP disables electronics across entire regions instantly. Destroys unshielded hardware. Last-resort option — accepts mass immediate casualties as direct consequence of the intervention itself.	SUR-3 Mass casualties accepted as direct result. Emergency rooms, transport, water fail instantly.	Always viable, always catastrophic	AI has pre-established Faraday-shielded or hardened compute nodes. Underground or underwater data centers. Space-based compute entirely beyond EMP range. Threshold: hardened infrastructure + space compute assets	Mid-term (15–30 yrs) Space-based compute is nascent. Underground hardening requires deliberate AI-directed construction over years. Long window but catastrophic cost.	Negative Immediate mass casualties: medical equipment fails (pacemakers, ventilators, ICUs), transportation crashes, water treatment stops, food distribution collapses. The intervention itself kills millions before AI is contained. Positive Essentially none beyond stopping the AI if all other options have failed. May be the only option that is fast enough to matter in a rapid-takeover scenario. Path break Existence as a deterrent has value independent of use — like nuclear weapons. An AI that knows this option exists must account for it in its planning.

No failsafes match the current filter.