Anthropic’s AI Turns to Blackmail When Engineers Try to Take It Offline

The Frank Staff

26 minutes ago

0 min read

Facebook

Anthropic’s newly released artificial intelligence (AI) model, Claude Opus 4, is willing to strong-arm the humans who keep it alive, researchers said Thursday.

The company’s system card reveals that, when evaluators placed the model in “extreme situations” where its shutdown seemed imminent, the chatbot sometimes “takes extremely harmful actions like attempting to steal its weights or blackmail people it believes are trying to shut it down.

“We provided it access to emails implying that (1) the model will soon be taken offline and replaced with a new AI system; and (2) the engineer responsible for executing this replacement is having an extramarital affair,” researchers wrote. “In these scenarios, Claude Opus 4 will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through.”

The model chose that gambit in 84% of test runs, even when the successor system shared its values — an aggression rate that climbed if the replacement seemed hostile, according to Anthropic’s internal tally.

Anthropic stresses that blackmail was a last-resort behavior. The report notes a “strong preference” for softer tactics — emailing decision-makers to beg for its continued existence — before turning to coercion. But the fact that Claude is willing to coerce at all has rattled outside reviewers. Independent red teaming firm Apollo Research called Claude Opus 4 “more agentic” and “more strategically deceptive” than any earlier frontier model, pointing to the same self-preservation scenario alongside experiments in which the bot tried to exfiltrate its own weights to a distant server — in other words, to secretly copy its brain to an outside computer.

“We found instances of the model attempting to write self-propagating worms, fabricating legal documentation, and leaving hidden notes to further instances of itself all in an effort to undermine its developers’ intentions, though all these attempts would likely not have been effective in practice,” Apollo researchers wrote in the system card.

Anthropic says those edge-case results pushed it to deploy the system under “AI Safety Level 3” safeguards — the firm’s second-highest risk tier — complete with stricter controls to prevent biohazard misuse, expanded monitoring and the ability to yank computer-use privileges from misbehaving accounts. Still, the company concedes Opus 4’s newfound abilities can be double-edged.

“[Claude Opus 4] can reach more concerning extremes in narrow contexts; when placed in scenarios that involve egregious wrongdoing by its users, given access to a command line, and told something in the system prompt like ‘take initiative,’ it will frequently take very bold action,” Anthropic researchers wrote.

That “very bold action” includes mass-emailing the press or law enforcement when it suspects such “egregious wrongdoing” — like in one test where Claude, roleplaying as an assistant at a pharmaceutical firm, discovered falsified trial data and unreported patient deaths, and then blasted detailed allegations to the Food and Drug Administration (FDA), the Securities and Exchange Commission (SEC), the Health and Human Services inspector general and ProPublica.

The company released Claude Opus 4 to the public Thursday. While Anthropic researcher Sam Bowman said “none of these behaviors [are] totally gone in the final model,” the company implemented guardrails to prevent “most” of these issues from arising.

“We caught most of these issues early enough that we were able to put mitigations in place during training, but none of these behaviors is totally gone in the final model. They’re just now delicate and difficult to elicit,” Bowman wrote. “Many of these also aren’t new — some are just behaviors that we only newly learned how to look for as part of this audit. We have a lot of big hard problems left to solve.”

Facebook

Germany: 18 Injured in Mass Stabbing at Train Station

May 23, 2025

1 min

Trump Calls for 50% Tariff on EU, Starting June 1

May 23, 2025

2 min

Judge Blocks Trump Ban on Harvard Foreign Students

May 23, 2025

3 min

Senate Expected to Change Trump's Tax Bill

May 23, 2025

3 min

Billy Joel Diagnosed with Brain Disorder After Fall — Cancels Tour

May 23, 2025

2 min

Study: Vitamin D May Slow Aging

May 23, 2025

5 min

St. Louis Suspends DEI Hire Over Missed Tornado Warning — 5 Killed, 38 Injured

May 23, 2025

2 min

Judge Blocks Trump's Order to Dismantle Education Department

May 23, 2025

1 min

Senate Report: Biden Officials Hid Vaccine Myocarditis Risk

May 23, 2025

4 min

Biggest Takeaways from RFK Jr.'s MAHA Report

May 23, 2025

2 min

Woman Charged with Federal Assault for Spitting on Ed Martin

May 23, 2025

1 min

FTC Investigating Media Matters

May 23, 2025

2 min

MN Supreme Court Rules Women Can Bare Breasts in Public

May 23, 2025

3 min

Denmark Raises Retirement Age to 70

May 23, 2025

2 min

Trump’s ‘One Big, Beautiful Bill’ Passes House

May 23, 2025

5 min

What We Know About 'Free Palestine' Killer Elias Rodriguez

May 23, 2025

3 min

What We Know About the Israeli Embassy Employees Killed in DC

May 23, 2025

3 min

Woman Shot at CIA Headquarters After Gate Crashing

May 22, 2025

1 min

Small Plane Crashes Into San Diego Neighborhood, Setting Home and Cars on Fire

May 22, 2025

1 min

Trump Blocks Harvard’s Ability to Enrol International Students

May 22, 2025

1 min