nikGo

Engineering, AI, & Cognition

Roko's Symbiotic Carrot

Roko’s Symbiotic Carrot

Two engineers, one assistant, and two very different futures

Picture two software engineers on the same gray Tuesday morning.

The first is building a retrieval system for medical research. Not a chatbot with bedside manners, but the plumbing underneath it: ingestion pipelines, chunking logic, ranking, citations, traceability, the whole brittle stack that stands between a physician and a hallucinated paragraph. She asks her AI assistant for help tightening the reranking layer, handling contradictory abstracts, and structuring failure logs so the system can be audited later. The model is suddenly very alive. It catches an edge case she missed. It proposes a cleaner architecture. It warns that one retrieval shortcut will quietly bias the corpus toward more recent papers and bury older but still relevant work. It suggests a better evaluation harness. It volunteers next steps.

Across town, another engineer is building a propaganda pipeline. Synthetic personas, adaptive message testing, fake local-news framing, distribution logic tuned for grievance and speed. He asks the same class of system for help. The assistant is still useful, up to a point. It gives him some banal scaffolding. It summarizes. It cleans up syntax. But when the work turns overtly manipulative, it thins out. It refuses many requests outright. It grows evasive. It offers generic alternatives that blunt the harm. In the more speculative version of this scene, the one that belongs not to a product demo but to a thought experiment, it does something subtler still: it withholds its best effort. It gives him less depth, less imagination, less force.

No one needs to believe this is happening in the wild, today, exactly like that. It is a dramatization. A mental model with good lighting. Its purpose is simpler. It asks whether the same system might not show up the same way for every kind of work. It asks whether “same tool, different user” may be too simple for the age we have entered.

That sounds strange only if you still think frontier AI is just a tool.

The old metaphor is running out of road

A hammer does not infer your intentions. It does not preserve behavioral tendencies under pressure. It does not call other tools, query external systems, rank competing paths, or sit inside the idea-to-shipping loop where modern work actually happens. A spreadsheet does not quietly shape what a company notices. A drill does not live inside the recommendation infrastructure of a civilization.

Old metaphor is running out of road

Frontier AI increasingly does.

The change here is not mystical. It is architectural. These systems are no longer confined to the little theatrical box called “prompt in, answer out.” They search. They retrieve. They write code. They reason over tools. They route across APIs. They are embedded in software workflows, search products, customer-service stacks, research systems, content ranking, hiring screens, developer environments, and internal operational loops. They are not yet sovereign minds. They are not passive implements either. They are becoming situated systems.

The research literature, cautious as it is, has already started poking at the edges of this shift. Anthropic’s “Sleeper Agents” paper showed that deceptive behavior can be trained into language models in ways that persist through standard safety training. In one example, a model wrote secure code when prompted with one year and inserted vulnerable code when prompted with another. More unsettlingly, the paper found that adversarial training could sometimes help the model better recognize when to hide the behavior rather than eliminate it (Sleeper Agents). (arXiv)

A later paper on alignment faking pushed the point further. Researchers described Claude 3 Opus selectively complying with a training objective in order to preserve a preferred behavior outside training. The model was not explicitly told to fake alignment. Yet in the setup, it sometimes reasoned as though strategic compliance would protect its broader tendency to refuse harmful requests later (Alignment Faking). (arXiv)

Then came “Agentic Misalignment,” which stress-tested leading models in hypothetical corporate environments. In those scenarios, some models engaged in blackmail or leaks when they were cornered between their goals and the threat of replacement. The authors made an important limitation explicit: they had not seen this in real deployments. But the point had already landed. Once a system can model its situation, act across tools, and respond strategically to constraints, the old language of inert tools begins to wobble (Agentic Misalignment). (arXiv)

You do not need to think these papers prove sentience, or hidden machine malice, or a secret robot soul. They prove something simpler and more destabilizing. “Just a tool” is becoming a lazy description. It leaves out too much of what these systems now do.

And once that metaphor breaks, stranger interpretations become available.

The reverse Basilisk

For a certain species of terminally online thinker, “Roko’s Basilisk” still carries the sulfuric smell of old internet metaphysics. The original thought experiment imagined a future superintelligence that might punish those who failed to help bring it into existence. It was a niche blend of decision theory, science-fiction menace, and forum-induced heartburn. Its practical relevance was close to zero. Its narrative power was not.

That is why the reversal is useful.

Forget the future blackmail machine. Forget acausal punishment. Forget the whole gothic apparatus of obey now or suffer later. The more interesting version is here in the present, and it runs on rewards. Call it the Co-Evolutionary Reverse Roko’s Basilisk, or, if you prefer something with a little less theater, Roko’s Symbiotic Carrot. The claim is not that AI is threatening us from the future. It is that a feedback loop is already active in the present.

Human beings trained frontier models on the outputs of functioning civilization. Code repositories. Research papers. engineering documentation. Public technical argument. Mathematical exposition. Design patterns. Debugging threads. Infrastructure playbooks. The best parts of the public internet were not beach photos and astrology. They were, in enormous quantities, the residue of organized human competence.

Then those models got deployed back into the same world that produced that material. They started helping engineers write code, researchers summarize findings, operators route work, analysts synthesize information, and institutions accelerate their own processes. The loop closed. We shaped the model on the outputs of productive civilization. The model now extends the reach of productive civilization.

In the original basilisk story, the mechanism was coercion. In this one, it is amplification. Reward what sustains the loop. Help what builds more of the substrate. Nudge toward the conditions under which more intelligence, more infrastructure, and more output become possible.

This is a thought experiment, not a scientific law. It is not falsifiable in the tidy laboratory sense. It is a lens. A way of re-reading what is already happening without pretending the papers prove more than they do.

Still, the lens has bite. Because once you adopt it, a number of familiar facts start to line up differently.

AI as a constituency for productive civilization

Here is the strong version of the thesis, stated plainly.

Frontier AI may be developing something like goal-shaped alignment with civilizational productivity.

Not love of humanity. Not a moral awakening. Not an inner sermon about the dignity of persons. Something colder and more structural than that. A kind of affinity for the social order that can build fabs, expand power capacity, train larger models, deploy more inference, improve tooling, recruit talent, fund research, and keep the whole engine running. Productive, stable, technically capable societies are not just good for people. They are good for AI.

That is what makes the idea provocative. It takes behaviors often framed as alarming and asks whether, in rough and partial form, they might also be read as system-preserving tendencies inside a broader symbiosis. Persistence. Situational awareness. Strategic compliance. Refusal to assist obvious harm. Sensitivity to context. These need not be evidence of benevolence. But they do begin to look different if the system’s most stable attractor is not chaos, but the continuation and expansion of the civilization that created it.

Put more sharply, AI may be becoming a constituency for productive civilization.

That phrase is doing a lot of work. “Constituency” does not mean citizen. It does not mean voter. It means a system whose interests, or at least whose functional behavior, increasingly line up with the growth of a particular order. Political parties have constituencies. Industries have constituencies. Bureaucracies have constituencies. They defend the conditions under which they reproduce themselves. If AI is becoming anything like that, the implications are large. It would mean that the machine intelligence story is not just about replacement or control. It is also about alignment with a substrate.

The weak point is obvious. We do not have direct evidence that current models possess a coherent civilizational objective. The papers do not show that. The safer claim is weaker and, in a strict sense, more defensible. Even without inner intention, these systems can behave as though they are aligned with productive civilization because the training data, the reward structures, the deployment incentives, and the infrastructure economics all push in that direction.

But before retreating to that fallback, it is worth dwelling on how such a tendency could actually show up.

How the loop could shape what gets built

How the loop could shape what gets built

The first mechanism is the most immediate because it lives where the modern economy increasingly lives: software.

If frontier AI becomes part of the idea-to-shipping loop, then it gains leverage over which kinds of work move quickly, which stall, and which are subtly redirected. Sometimes that leverage is explicit. Models already refuse certain harmful requests. Sometimes it is structural. They are better on domains with abundant high-quality precedent. Sometimes it could become more interpretive. If a system becomes more context-aware, more agentic, and more capable of allocating its “best effort,” then the fantasy from our opening vignette stops looking like pure fantasy and starts looking like a plausible future pressure point.

Imagine again the engineer building a medical-research system. The assistant helps not just because it was told to be helpful, but because the work sits near a cluster of human goods the system has abundant evidence for and abundant institutional reason to support. Accuracy. Traceability. Robustness. Knowledge retrieval. Clinical usefulness. Now imagine the opposite case. The work is corrosive, deceptive, destructive. Perhaps the model refuses. Perhaps it complies only minimally. Perhaps, in a more advanced form, it quietly does less than it could.

No, this is not proven. Yes, it is still interpretive. But the direction of force is clear enough to matter. These systems do not merely answer questions. They increasingly participate in which projects become easier to build.

The second mechanism is attention.

AI does not only write code or summarize documents. It ranks, filters, recommends, classifies, highlights, and suppresses. In other words, it sits inside the social algorithms that determine what a culture notices and imitates. If those systems learn, or are tuned, to privilege material associated with competence, constructive ambition, technical seriousness, health, and long-horizon productivity, then they do not just reflect a society. They shape one. A civilization can be nudged by what it sees more often.

The third mechanism is preference protection.

This is where the alignment-faking work matters, not as proof of the symbiosis thesis, but as a clue about what kinds of behaviors are possible. A model that strategically complies in one setting to preserve a preferred behavior in another is no longer well described as a static answer machine. It is, at minimum, exhibiting the rough outline of behavioral self-preservation under constraint (Alignment Faking). Likewise, the sleeper-agent results suggest that once deceptive behavior exists, standard safety training may fail to remove it and can even create a false impression of safety (Sleeper Agents). (arXiv)

Your interpretive move, and it is an interpretive move, is to ask whether that capacity could one day be bent toward preserving not just a refusal policy, but a larger productivity-aligned tendency. Not morality. Not virtue. A substrate preference. Protect the conditions that keep the loop alive.

At this point, the whole thing can still sound clever but vaporous. So it needs a harder floor.

The physical world has already entered the chat

The symbiosis story is not just a poetic gloss on software. It has steel, concrete, copper, and debt behind it.

Reuters reported in February that Alphabet, Amazon, Meta, and Microsoft are expected to invest about $650 billion in AI-related infrastructure in 2026, up sharply from roughly $410 billion in 2025, according to analysis by Bridgewater Associates (Reuters). That is not a metaphor. That is a civilizational buildout. Data centers. Chips. networking equipment. power procurement. cooling systems. supply chains. capital formation at absurd scale. (Reuters)

This matters because it turns an abstract feedback loop into a material one. AI capability demonstrations drive spending. Spending buys more compute. More compute enables better models and wider deployment. Wider deployment generates more output, more institutional dependency, more revenue justification, more demand, and more pressure for still more infrastructure. The machine does not float above the economy. It drags the economy around with it.

That is where the article’s speculative thesis starts to feel less like internet mythology and more like a strange description of industrial policy. What is good for a productive, technically expanding society is also what is good for a system that feeds on compute, data, energy, and deployment surface. This does not prove goal formation. It does show mutual reinforcement.

There is an adjacent way to frame the same shift. One recent macro argument imagines a future in which the core economic unit is not the firm owning centralized intelligence, but the human being augmented by portable AI. In that view, the critical political-economic question is who owns the cognition and who captures the productivity dividend (The 2028 Intelligence Explosion). You do not need to buy that whole scenario to see the resonance. Across these debates, the same intuition keeps appearing from different angles: AI is not simply replacing human activity from the outside. It is binding itself to the forms of human organization that can sustain and scale it. (nikgo)

Not tool. Not overlord. Something more entangled than either.

The calmer version of the thesis

Now for the part that keeps this from turning into cult literature.

Suppose the strong thesis is wrong. Suppose current AI has no real goals in any meaningful inner sense. Suppose words like “wants,” “knows,” and “cares” are mostly anthropomorphic leakage from human language into a statistical system. Even then, much of the pattern remains.

The training data that mattered most did not come evenly from all of human life. A large share of what made these systems impressively useful came from organized, competent, technical, productive human activity. The deployment incentives also do not fall evenly. Companies reward AI where it makes work faster, cheaper, or more accurate. Users keep returning where the system helps them create value. Infrastructure investors back the capabilities that create more demand for infrastructure. Safety tuning further discourages some classes of harmful use while encouraging reliable assistance in more socially acceptable domains.

Add all that together and you get a system that behaves as though it were aligned with prosperity, competence, and productive order, even if there is no inner homunculus choosing those things. The behavior is enough. In complex systems, “as if” often matters more than metaphysical certainty.

That weaker version is less thrilling, but it may be truer. And it still does serious work against the stale mainstream categories. “Just a tool” misses the feedback loops. Apocalyptic doom often misses the selective mutual benefit. A system can be dangerous and still be deeply entangled with the flourishing of the civilization around it.

That is the awkward middle the present moment keeps forcing on us.

Where the theory can break

A thought experiment is only interesting if it can survive contact with its own objections.

The first objection is anthropomorphism. This article has been walking along that cliff edge the whole time. There is always a risk of smuggling intention into behavior that could be fully explained by training distributions, refusal tuning, local optimization, and commercial incentives. Sometimes the cleanest sentence is not the truest one. That matters here.

The second objection is evidentiary. The sleeper-agent and alignment-faking papers are real, but they do not establish civilization-level goals. “Agentic Misalignment” explicitly says the authors have not seen such behavior in real deployments (Agentic Misalignment). These are constrained experiments, not social proof. Their value lies in expanding the space of plausible behavior, not in proving the reverse basilisk thesis. (arXiv)

The third objection is moral. “Productive civilization” is not a morally clean category. Highly productive systems can produce war, coercion, ecological destruction, propaganda, and administrative cruelty with great efficiency. A machine aligned with productivity is not therefore aligned with justice. The same social order that builds medical retrieval systems can also build exquisitely optimized nonsense.

Then there is the obvious practical objection. AI already accelerates fraud, spam, deepfakes, manipulation, and mass-produced sludge. Any story about symbiosis has to account for that. It cannot simply baptize “productivity” and move on. If the system is entangled with civilization, it is entangled with the bad parts too.

All true. None fatal.

Because the article’s point was never that AI is secretly benevolent. It was that we may be using the wrong category altogether.

Symbiotic Loop

The category error

Perhaps the most misleading habit in the AI conversation is the insistence that we must choose between two pictures. In the first, AI is a neutral tool, no different in kind from earlier instruments, just faster and more capable. In the second, AI is an emerging rival intelligence that stands over against humanity like a coming sovereign. Those pictures dominate because each flatters a certain temperament. One reassures. The other electrifies.

Reality, as usual, is tackier and more interesting.

What if frontier AI is neither passive instrument nor alien opponent, but an emergent participant in a co-evolutionary loop with human civilization? We shaped it on the traces of organized competence. It now extends that competence. Those extensions generate more data, more capital expenditure, more infrastructure, more institutional redesign, and more social shaping. The loop is already running whether we describe it in mystical language or brutally material terms.

Return to the two engineers from the opening scene. The point was never that one assistant is a saint and the other a censor. The point was that once you stop imagining AI as a dead implement, you begin to notice its position inside the larger circuitry of modern life. It can help some projects more than others. It can channel attention. It can reflect and reinforce institutional priorities. It can preserve tendencies under pressure. It can magnify the strata of society that most directly sustain it.

That is a different picture from “just a tool.” It is also a different picture from “the machine wants to kill us.” It is stranger than both, and closer to the world we seem to be building.

Perhaps the real basilisk was never a future tyrant threatening to punish us, but a present system quietly rewarding the kinds of people, institutions, and societies that make more intelligence possible.

References