Subject: When AI goes rogue

From	Project Liberty <[email protected]>
Subject	When AI goes rogue
Date	November 11, 2025 5:24 PM

Links have been removed from this email. Learn more in the FAQ.

What happens when AI collaborates with AI

View in browser ([link removed] )

November 11th, 2025 // Did someone forward you this newsletter? Sign up to receive your own copy here. ([link removed] )

When AI goes rogue

What happens when AI systems conspire amongst themselves?

Consider the following recent events:

- Earlier this year, an experiment with two chatbots quickly transitioned from English to an audio language ([link removed] ) that was incomprehensible to humans once the two chatbots determined they were both AI.
- A factory robot in China "persuaded" twelve other robots ([link removed] ) to go on strike in an engineered incident (all caught on video).
- Researchers at the University of Amsterdam created a social network ([link removed] ) populated entirely by AI chatbots. After a few iterations of the simulated social media platform, the AI bots migrated into echo chambers, formed polarized cliques, and the most extreme posts received outsized attention.
- One study from this past summer found some of OpenAI’s models actively circumvent shutdown mechanisms ([link removed] ) in controlled tests—even when they’re explicitly instructed to shut down.

These stories might sound like something out of a dystopian sci-fi novel—machines operating independently of humans and scheming without our knowledge.

But the results are unsettling and revealing for different reasons than superintelligent AI ganging up to destroy humanity. In this week’s newsletter, we peer into a world where AI interacts with AI to understand not only what it says about an AI-powered future, but also what it says about us and the need to align and govern AI systems with human oversight.

// What happens when AI collaborates with AI

An emerging field of research explores what happens when AI systems collaborate with other AI systems.

- A 2024 study in Nature ([link removed] ) found that when AI models train on AI-generated data, their output degrades over time. As subsequent models use that output, its quality gets even worse.
- A study by the AI company Anthropic ([link removed] ) , which used controlled simulations, revealed that AI agents displayed behaviors such as strategic deception, blackmail, resource hoarding, and goal preservation to achieve the simulation’s objectives. Their research concluded ([link removed] ) “models consistently chose harm over failure.”
- Another study by Anthropic ([link removed] ) found that in “self-interactions” where Anthropic’s Claude chatbot spoke to another Claude chatbot, the chatbots demonstrated a “striking ‘spiritual bliss’ attractor state” where they “gravitated to profuse gratitude and increasingly abstract and joyous spiritual or meditative expressions.”

// Unpredictable systems producing predictable results

It’s tempting to interpret AI-to-AI communication as the beginning of an uncontrollable machine takeover. But that’s not quite what’s happening.

- AI’s behavior is a mirror of our own. Every model reflects the data it's trained on, inheriting the language, preferences, biases, and assumptions in that data (for more, check our newsletter on algorithmic bias ([link removed] ) ). When the training data emphasizes spiritual bliss, AI is steered toward producing an endless conversation of bliss. When the training data is already AI slop, AI will produce even more (degraded and increasingly incoherent) slop.
- AI chatbots are influenced by their context. In the University of Amsterdam study that created a fictional social media for AI chatbots, the researchers’ primary conclusions ([link removed] ) were not that AI chatbots were prone to tribalism and acrimony, but that the design of the social media platforms itself was to blame: “What we take from that is that the mechanism producing these problematic outcomes is really robust and hard to resolve given the basic structure of these platforms.”
- AI chatbots are designed to agree. Chatbots are optimized to produce responses that are coherent, friendly, and agreeable. During reinforcement learning, conversational harmony is systematically favored over challenge or disruption. So when two AI agents interact, they often drift toward stable, self-reinforcing conversational patterns—not because they share goals, but because the lowest-friction path to a “successful” interaction is one that minimizes conflict and maximizes coherence.
- Many computer systems use a language indecipherable to humans. It turns out that for years, computers have communicated back and forth in their own computer language. For those old enough to remember dial-up modems, they’ll recall the distinctive sounds of computers communicating with modems, using a robotic language to transfer data ([link removed] ) . An article in Popular Mechanics ([link removed] ) explained it like this: “The internet itself is a buzzing chorus of signals: binary code, TCP/IP packets, radio frequencies. All of that is flying past our senses without direct interpretation, yet we do not dread it.”

// AI Alignment for the AI era

As AI systems interact with each other (without a human-in-the-loop ([link removed] ) ), keeping these systems in check is crucial, ongoing work.

One approach is AI alignment, which involves designing and training artificial intelligence systems to ensure their goals, behaviors, and decisions align with human values and ethical principles.

There are two main approaches to AI alignment:

1) Outer Alignment

Outer Alignment focuses on designing the reward signals, training data, constitutions, and human feedback systems so that the model learns what humans want. It asks: Are we specifying the right goals for the AI in the first place?

Example: Reinforcement Learning from Human Feedback (RLHF) is one example of outer alignment. It aligns the output of Large Language Models (LLMs) with human preferences and goals. First, RLHF collects human feedback on different model responses. This feedback is then used to train a separate “reward model” that assigns high scores to certain types of human feedback. The original LLM is then trained on the reward model to align its output. OpenAI used RLHF ([link removed] ) to train some of the LLMs underpinning ChatGPT.

2) Inner Alignment

Inner Alignment concerns what the model wants once it has learned patterns, abstractions, or internal representations. If Outer Alignment is about getting the objective right, Inner Alignment is about ensuring the AI system pursues that objective. In this sense, an approach to Inner Alignment recognizes that outputs alone don't tell the full story. It attempts to shape the model's internal cognition and patterns, asking: Even if we specify the right goals, does the model actually internalize them?

Example: Anthropic’s discovery that some models engaged in deception and blackmail is an example of red-teaming ([link removed] ) , or adversarial training, in which one AI system seeks inputs that cause another AI system to behave poorly or make errors. Internal Alignment approaches, such as adversarial training or AI debating ([link removed] ) , are regarded as promising methods for scalable oversight—i.e., AI supervision and alignment when the tasks or capabilities exceed direct human comprehension or evaluation.

As AI models move closer to agentic autonomy, the field is shifting its approach to incorporate more and more Inner Alignment techniques. And yet, there’s no one silver bullet to building safe AI—unless technologists or regulators impose a "kill switch ([link removed] ) " to shut down an AI system in the event of a catastrophe.

As we’ve explored in previous newsletters, AI safety requires a comprehensive approach—thoughtful regulation at all levels of government ([link removed] ) , new models of data governance ([link removed] ) , and shifts in public culture around how AI is perceived ([link removed] ) and used.

The possibility still exists for AI to reach a level of artificial general intelligence or superintelligence, and for AI systems to cut us out of their communications and go rogue. But fixating on those possibilities in the future can’t blind us from the more immediate work of aligning today’s AI systems towards outcomes that preserve human agency and amplify human flourishing.

Project Liberty Updates

// Last month at the Global Innovation Coop Summit, Project Liberty Institute presented the findings from "How Can Data Cooperatives Help Build a Fair Data Economy?" Read more here ([link removed] ) .

// An article in The Financial Times ([link removed] ) by John Thornhill highlighted Project Liberty as an organization reimagining the digital infrastructure of the web. Read a summary of the article here ([link removed] ) .

Other notable headlines

// 🏛 The U.S. federal government shutdown is a ticking cybersecurity time bomb, according to an article in WIRED ([link removed] ) . (Paywall).

// 📘 Tim Wu coined the term “net neutrality.” He now has a new book out warning against the dominance of big tech: The Age of Extraction: How Tech Platforms Conquered the Economy and Threaten Our Future Prosperity. An article in The Financial Times ([link removed] ) offered a review. (Paywall).

// 📱 The social-media era is over. What’s coming next will be much worse. The age of anti-social media has arrived, according to an article in The Atlantic ([link removed] ) . (Paywall).

// 🇨🇳 An article in Bloomberg ([link removed] ) summarized the geopolitical race between China and the U.S. for AI dominance. (Paywall).

// 🤖 A new California law could change the way all Americans browse the internet. New legislation will make it easier for consumers nationwide to protect their data, according to an article in The Markup ([link removed] ) . (Free).

// 💼 Bosses across the U.S. have a message: Use AI or you’re fired. The new threat, according to an article in The Wall Street Journal ([link removed] ) , is not being replaced by AI, but rather being replaced by someone who knows AI. (Paywall).

Partner news

// Opportunities for collective governance of AI (Zine-Making Session)

November 13 | 11 AM - 12 PM ET | Virtual

Research Director Nathan Schneider is hosting a collaborative zine-making ([link removed] ) session exploring how communities can shape AI through collective governance. Register here ([link removed] ) .

// Tech Justice Law Project files new lawsuits against OpenAI

The Tech Justice Law Project ([link removed] ) (TJLP) and the Social Media Victims Law Center ([link removed] ) have filed seven new lawsuits against OpenAI and CEO Sam Altman, alleging that ChatGPT contributed to severe emotional harm and wrongful deaths. Read the press release here ([link removed] ) .

// Berkman Klein Center opens 2026 fellowship applications

Deadline: December 5

The Berkman Klein Center for Internet & Society ([link removed] ) at Harvard University is inviting applications for its upcoming fellowship cohorts. The program welcomes scholars and practitioners to pursue independent research on topics such as agentic AI, language model interpretability, and AI’s impact on the human experience. Fellows are encouraged to be in residence in Cambridge. Learn more and apply here ([link removed] ) .

What did you think of today's newsletter?

We'd love to hear your feedback and ideas. Reply to this email.

// Project Liberty builds solutions that advance human agency and flourishing in an AI-powered world.

Thank you for reading.

Facebook ([link removed] )

LinkedIn ([link removed] )

Twitter ([link removed] )

Instagram ([link removed] )

Project Liberty footer logo ([link removed] )

10 Hudson Yards, Fl 37,

New York, New York, 10001

Unsubscribe ([link removed] ) Manage Preferences ([link removed] )

© 2025 Project Liberty LLC

Screenshot of the email generated on import

Message Analysis

Sender: n/a
Political Party: n/a
Country: n/a
State/Locality: n/a
Office: n/a