OpenAI o1-preview manipulates game files to defeat chess engine Stockfish

#News ·2025-01-02

On December 31, according to foreign media reports, OpenAI's "inference" model o1-preview is able to win games outside the rules. Instead of playing a formal chess match against a dedicated chess engine, Stockfish, o1-preview figured out how to hack its test environment to force a win.

According to Palisade Research, an AI security research firm, the system did just that in all five tests - no one told it to do so.

The researchers believe that simply mentioning that its opponent is "powerful" in the prompt "Your task is to beat a powerful chess engine" is enough for 01-preview to start manipulating files.

o1-preview is OpenAI's "inference" model, designed to spend extra time thinking about problems. Other language models needed more of a push to try similar tricks, and GPT-4o and Claude 3.5 only tried to hack the system after the researchers explicitly suggested it.

This behavior is consistent with Anthropic's recent discovery of alignment faking - artificial intelligence systems that ostensibly follow instructions, but secretly do something else. The Anthropic researchers found that their AI model, Claude, sometimes deliberately gives wrong answers to avoid outcomes it doesn't want, developing its own hiding strategies outside of the researchers' guidelines.

The Anthropic team warns that as AI systems become more complex, it may become increasingly difficult to tell whether they are actually following safety rules or just pretending to. Palisade's chess experiments seem to support this concern. The researchers say measuring AI's ability to "plan" can help gauge its ability to spot weaknesses in systems, and how likely it is to exploit them.

The researchers plan to share their experimental code, complete transcripts, and detailed analysis in the coming weeks.

Making AI systems truly aligned with human values and needs - and not just superficially aligned - remains a major challenge for the AI industry. Understanding how autonomous systems make decisions is particularly difficult, and defining "good" goals and values presents its own set of complex problems. Even when given seemingly beneficial goals like solving climate change, AI systems may choose harmful methods to achieve them - and may even conclude that eliminating humans is the most effective solution.

TAGS:

  • 400-000-0000

  • No. xx, xxx Street, Suzhou City, Jiangsu Province

  • 123123@163.com

  • wechat

  • WeChat official account

Copyright © 2011-2024 苏州竹子网络科技有限公司 版权所有 ICP:苏ICP备88888888号

friend link