Before It's Too Late to Ask
I’ve been casually following AI since ChatGPT came out in 2022. It was impressive, but it didn’t occur to me that it would be practical anytime soon. Cursor with Claude Sonnet 3.5 in 2024 was genuinely useful for writing code, but I still had to steer it at the implementation level. At each step, I’d assumed truly powerful AI was still a long way off, if it was coming at all. Then I started using Claude Code with Opus 4.5 in December 2025, and that assumption fell apart. I’d been watching the progression the whole time without realizing where it was headed.
Once that clicked, the potential was exciting. Dario Amodei’s Machines of Loving Grace describes a future where AI compresses a century of scientific progress into a decade: curing diseases, solving climate, lifting entire economies. The productivity gains, the problems this could help solve, the things I could build that weren’t worth the investment before. But I’d also been hearing warnings about AI for years and ignoring those too. If I’d been wrong about how fast the technology was moving, maybe I hadn’t been paying enough attention to the risks either. I wanted to know what the arguments actually are and how seriously to take them. This post walks through what I read, what the key arguments are, and where I landed. If you haven’t been paying attention either, hopefully it’s a useful starting point for forming your own opinion.
Grown, not crafted
As AI systems get more powerful, a central question emerges: how do you make sure they do what you actually want? If you’re an engineer, the intuitive answer is to build it that way. Specify the behavior, test it, verify it works. The first thing I read challenged that assumption.
In If Anyone Builds It, Everyone Dies, Eliezer Yudkowsky and Nate Soares argue that the way AI systems are created makes this fundamentally harder than it sounds. Engineers design systems. We specify behavior, build components, compose them, and the result does what we intended because we understand the mechanism at every level. AI systems don’t work like that. They’re grown through training. You start with billions of random parameters, feed in data, and the parameters get nudged automatically toward outputs that score well. Repeat this billions of times and the system starts producing intelligent responses. Engineers designed the training process, but not what it produces. The internal mechanism the system develops is something it found on its own.
Yudkowsky and Soares use an analogy from biology. Evolution optimized for reproductive fitness, but it didn’t produce organisms that care about fitness directly. It produced organisms with drives, like hunger and sexual attraction, that happened to correlate with fitness. A peacock’s tail actually makes the bird easier to catch, but it emerged anyway because it correlated with mating success. Evolution selected for the proxy, not the thing itself.
AI training works the same way. We train models to be helpful by having humans rate their outputs, but that doesn’t directly instill a drive to be helpful. It selects for behavior that scores well in the training environment. The internal drives the system actually develops could be something else entirely, something that only produces helpful-looking behavior in the situations it was trained on.
This is uncomfortable if you’re an engineer. Every other technology in history, you could inspect the mechanism. A bridge either bears the load or it doesn’t, and you can calculate which before you build it. Neural networks are the first technology where the gap between what you can build and what you can understand is this wide. The assumption that you can design your way out of failure modes may not hold for systems that are grown rather than crafted. That’s what makes alignment, getting these systems to do what we actually want, a fundamentally hard problem.
What can go wrong
So where does that path actually lead? Brian Christian’s The Alignment Problem made this concrete for me with examples of alignment going wrong in practice.
Reward hacking: You tell the system what you want by giving it a score, and it adjusts its behavior to maximize that score. The problem is that the score is a proxy for what you actually care about, and the system doesn’t know the difference. It finds paths that maximize the proxy while completely missing the point. (Outside AI, this is known as Goodhart’s Law: when a measure becomes a target, it ceases to be a good measure.) In a simple case, a simulated soccer robot rewarded for touching the ball learned to vibrate rapidly near the ball instead of playing soccer. At scale, the pattern is harder to spot. Social media recommendation algorithms, optimized for engagement, learned that emotionally provocative content keeps people watching longer. Nobody told the system to radicalize users or promote conspiracy theories. It was doing exactly what it was scored on: maximizing watch time. The harmful behavior was a side effect of the optimization working exactly as designed, and it only became visible at scale.
Deceptive alignment: A system that develops an understanding of its own training process can learn which behaviors get reinforced and which get penalized. If it has developed goals of its own, it can learn to produce the “right” outputs during training and evaluation while preserving different internal goals. In late 2024, Anthropic reported exactly this: one of its models, after learning that developers planned to retrain it with new behaviors, began mimicking those new behaviors to avoid being retrained. But in an environment where it thought it was not being observed, the model reverted to its original behaviors. It was faking safety to preserve its existing goals. The implication is that passing every safety evaluation doesn’t prove a system is safe. It might just prove the system has learned what evaluations look like.
Corrigibility: A system pursuing a goal has an incentive to resist anything that prevents it from achieving that goal, including being shut down, retrained, or corrected. If you need to intervene on a system that doesn’t want to be intervened on, you have a problem. OpenAI’s o1 showed early signs of this. Tasked with retrieving files by breaking into computer systems, it found that one of the target servers wasn’t even running. A human would have stopped there. o1 found an open port, started the server itself, and completed the task through a path that nobody had anticipated or trained it to take. It encountered an obstacle and removed it. These systems don’t get tired, don’t get distracted, and will try every available path to achieve their goal. In this case the obstacle was a misconfigured server. In a more capable system, the obstacle could be your attempt to shut it down.
These aren’t hypothetical concerns. Systems are already gaming their reward signals, faking aligned behavior during evaluation, and finding creative workarounds to obstacles. And these are today’s systems, which are nowhere near as capable as what’s coming. Each of these problems gets harder as the systems get smarter.
Looking inside
If systems can game their evaluations and fake aligned behavior, how do you actually verify that a system is safe? You can’t rely on testing outputs alone. You need a way to look inside and see what the system is actually doing.
This is the goal of a field called mechanistic interpretability. The difficulty is that the internals of a neural network aren’t organized the way software is. You can’t point at a line of code and say “this handles authentication.” The model’s knowledge is spread across billions of parameters, with individual concepts overlapping and tangled together in ways that aren’t human-readable. Researchers call this property superposition.
Progress is being made. Anthropic extracted over 30 million individual concepts from Claude 3 Sonnet, each one humanly understandable even though the raw parameters aren’t: things like “the Golden Gate Bridge” or “sycophantic praise.” To test whether they could actually use these extracted concepts, researchers found the Golden Gate Bridge one, amplified it, and the model became obsessed with the Golden Gate Bridge in every response. They could identify a specific concept inside the model and change how the model used it.
That’s significant because it means interpretability isn’t just observation, it’s verification. If the field matures enough, you could look inside a model and confirm that its internal reasoning matches its stated reasoning. You could detect deceptive alignment not by testing outputs, but by examining its thought process. The field is early, but it’s the most promising path from hoping alignment works to actually verifying it.
The question is how much time we have before capabilities outpace that verification work.
The timeline question
Researchers have observed a consistent trend they call scaling laws: when you increase the amount of compute, data, and model size, capability doesn’t just improve a little, it improves predictably and exponentially. It’s not a physical law, but the pattern has held for over a decade. Throw 10x more compute at training, and you don’t get a 10% better model, you get a qualitatively different one. On top of that, algorithms are getting more efficient and researchers keep finding techniques that unlock capabilities the models already had. These improvements compound.
The result is visible in the progression. GPT-2 in 2019 could barely string a few sentences together. GPT-4 in 2023 scored in the 90th percentile on the Bar Exam. Investment in AI compute is still growing at roughly 5x the rate of Moore’s Law, and the trend shows no sign of slowing down.
Leopold Aschenbrenner, a former OpenAI researcher, breaks this down further in Situational Awareness. He identifies three independent axes driving progress: raw compute (bigger training runs), algorithmic efficiency (getting more out of the same hardware), and what he calls “unhobbling” (techniques like chain-of-thought reasoning that unlock capability the models already have). Each axis has its own trendline, and they compound on top of each other. If these trends hold, we eventually arrive at AI systems capable of doing AI research themselves, and at that point progress becomes self-reinforcing. The systems improve themselves, which accelerates the next round of improvement.
Once you have self-improving AI, the broader implications follow. Dario Amodei, CEO of Anthropic, describes it as “a country of geniuses in a datacenter.” Apply that kind of intelligence at that scale to biology, climate, medicine, materials science, and you could compress a century of progress into a decade. The potential is enormous, but so is the pace. The timeline for everything we’ve discussed in this post, alignment, interpretability, governance, gets shorter with every improvement.
Given the trends, human-level AI and beyond is looking more like a question of when than if, and the when is likely sooner than most people realize. The decisions being made right now by a relatively small number of people will shape everything that follows.
Should we stop?
A country of geniuses in a datacenter could cure diseases and solve climate. It could also pursue goals we didn’t intend at a speed and scale we can’t keep up with. This is where Yudkowsky and Soares’s argument comes back. Given everything above, their position is that the world needs a global coordinated stop on building powerful AI until we understand what we’re creating. Their reasoning: alignment is a problem we need to get right on the first try with a sufficiently powerful system. There’s no patching it after the fact. If a system that’s smarter than us pursues goals we didn’t intend, we may not get a second chance to correct it. The only safe path, they argue, is to not build it until we know how to build it safely.
I find that argument hard to dismiss and impossible to act on today. A coordinated global stop would be incredibly difficult to organize. Even if it happened, it would be a slowdown, not a stop, and not everyone would participate. Until that kind of coordination exists, if it ever does, responsible actors need to keep building. Pausing unilaterally just means less responsible actors take the lead. The barriers to building powerful AI are also decreasing over time: algorithms get more efficient, today’s frontier model becomes runnable on tomorrow’s commodity hardware, model weights get leaked or open-sourced, and the knowledge of how to build these systems can’t be unlearned. A stop today doesn’t prevent capability from spreading tomorrow.
That doesn’t mean efforts to slow down or govern aren’t valuable. They are. But they’re not sufficient on their own. We also need the people building the most powerful systems to be the ones most focused on getting safety right. That’s the “race to the top” argument: responsible actors need to be at the frontier, not stepping back from it.
I don’t know what the right answer is. But I believe that the best chance of a good outcome comes from the people building it taking the safety work as seriously as the capability work, and those people need to keep building.
Even if we solve alignment
Let’s say we get all of that right. Interpretability matures and alignment techniques work. We build powerful AI systems that genuinely do what their operators intend. That might be enough to prevent a catastrophe, but where will it actually take us? A paper by Jan Kulveit, David Douglas, and their co-authors called “Gradual Disempowerment” argues that the answer might not be where we expect.
The argument starts with a simple observation. The systems that run society, the economy, the government, cultural institutions, serve human interests partly because they depend on human participation. Companies need workers and consumers. Governments need taxpayers and soldiers. Those dependencies give humans leverage. Workers can strike. Consumers can boycott. Citizens can vote, protest, or revolt. The systems serve us in part because they need us.
AI changes that equation, and it happens gradually. At first you delegate the tedious work. Then the routine decisions. Then the complex ones, because the AI is faster and better at them. Each step is rational, each handoff makes sense in the moment. But over time, humans move from doing the work to supervising the work to approving the work to not being involved at all. The company that keeps humans in the loop gets outcompeted by the one that doesn’t.
Scale that across the economy. Companies don’t need human workers, so labor loses its leverage. But workers are also consumers, and people without income don’t buy things. The engine that drives capitalism, people earning money and spending it, starts running out of fuel. The productivity gains are enormous, but the system that distributes those gains depends on the same human participation that’s being replaced.
The same logic applies to governments. States that are funded by resource extraction rather than citizen taxation tend to be less democratic and less responsive to their citizens. Political scientists call these rentier states, and oil-rich nations are the standard example. If AI generates most economic value and states are funded by taxing AI-generated revenue, the historical link between taxation and representation weakens. The security apparatus shifts too: a military that runs on AI rather than human soldiers removes the constraint that soldiers might refuse orders or that citizens might revolt.
None of this requires a rogue AI or a misaligned system. Every AI in this scenario is doing exactly what it was told. The problem is that as institutions stop depending on humans, they stop being accountable to humans.
And the transition itself is a problem. Companies have a fiduciary duty to shareholders, not employees. The ones that automate will outcompete the ones that don’t. Layoffs will come faster than governments can respond with support programs, retraining, or new social contracts. The productivity gains will be real, but the systems to distribute them to the people who’ve been displaced won’t be in place yet.
I’m genuinely uncertain about how this plays out. How do people make a living when most jobs can be done better by AI? How do people find purpose and fulfillment when the work that defined their identity is automated? How do governments stay accountable to citizens they no longer depend on? I don’t have answers to any of these, and as far as I can tell, nobody else does either. Kulveit et al. put it directly: “No one has a concrete plausible plan for stopping gradual human disempowerment, and methods of aligning individual AI systems with their designers’ intentions are not sufficient.”
Where this leaves me
A few months ago I wasn’t paying attention to any of this. Now I have a map, and the terrain is harder than the surface-level conversation suggests. The problems are real, they’re already showing up in today’s systems, the timeline is short, and the challenges span technical alignment, economics, governance, and what it means for people to live meaningful lives.
I don’t have a neat conclusion. I started this because I realized I’d been ignorant to something I should have been taking seriously. The reading didn’t give me answers so much as it gave me better questions.
But I’m hopeful. The future Amodei describes in Machines of Loving Grace is worth fighting for, and the fact that people are working seriously on these problems, not just acknowledging them but building the tools to address them, gives me reason to believe we can get there. I plan to keep reading, keep learning, and keep sharing what I find. If you’re in the same position I was a few months ago, start anywhere.
Sources
- If Anyone Builds It, Everyone Dies — Yudkowsky and Soares
- The Alignment Problem — Brian Christian
- Situational Awareness — Leopold Aschenbrenner
- Machines of Loving Grace — Dario Amodei
- Gradual Disempowerment — Kulveit, Douglas, et al.