Anthropic Challenges Users to Test Its Unbreakable AI Model

Think you’re clever enough to outsmart an AI? Anthropic is daring users to give it a shot. The company has just launched a week-long public test for its new AI model, Claude, which comes with enhanced safeguards to block harmful or dangerous responses. This follows a bug bounty program where experts spent over 3,000 hours trying—and failing—to break through its defenses.

Claude’s "Constitutional Classifier" is designed to detect and stop attempts to pull out forbidden information, like details on chemical weapons. Even when harmful prompts are hidden in harmless-looking text or disguised as roleplay, this system is trained to catch them. It calculates the risk of each word in a response and stops the output if something suspicious pops up.

Earlier tests showed the system blocked 95% of 10,000 synthetic jailbreak attempts, a huge jump from the 14% success rate of the unprotected version. Still, Anthropic admits the safeguards aren’t foolproof. Some jailbreaks will likely slip through, but they’ll require significantly more effort to pull off.

There’s a catch, though. These extra protections come with a 23.7% increase in computational cost, meaning users will pay more and use more energy per query. Plus, the system might mistakenly block 0.38% of harmless prompts—a trade-off Anthropic says is worth it.

Until February 10, users can try their luck at breaking Claude’s new defenses by answering eight tricky questions about chemical weapons. Anthropic will share any newly discovered jailbreaks during the test. So, if you’ve ever wanted to test your skills against an AI, now’s your chance. Good luck—you’re going to need it.