OpenAI's new CoT-Control benchmark reveals frontier AI models struggle to obscure their reasoning chains, reinforcing monitoring as a viable safety layer. (ReadOpenAI's new CoT-Control benchmark reveals frontier AI models struggle to obscure their reasoning chains, reinforcing monitoring as a viable safety layer. (Read

OpenAI Finds AI Reasoning Models Cant Hide Their Thinking - A Win for Safety

2026/03/06 02:34
3 min read
For feedback or concerns regarding this content, please contact us at crypto.news@mexc.com

OpenAI Finds AI Reasoning Models Cant Hide Their Thinking - A Win for Safety

Caroline Bishop Mar 05, 2026 18:34

OpenAI's new CoT-Control benchmark reveals frontier AI models struggle to obscure their reasoning chains, reinforcing monitoring as a viable safety layer.

OpenAI Finds AI Reasoning Models Cant Hide Their Thinking - A Win for Safety

OpenAI's latest research reveals that frontier reasoning models—including GPT-5.2 Thinking—can't reliably hide or manipulate their internal thought processes, even when explicitly instructed to do so. The finding, published March 5, 2026, suggests that chain-of-thought monitoring remains a viable safety mechanism as AI systems grow more autonomous.

The research team tested 13 reasoning models using CoT-Control, a new open-source evaluation suite comprising over 13,000 tasks. Models were given simple instructions about their reasoning—avoid certain keywords, use only lowercase letters, append specific phrases to sentences. They failed spectacularly.

Controllability scores ranged from just 0.1% to 15.4% across all frontier models tested. Even more telling: models often recognized mid-reasoning that they were violating instructions and still couldn't course-correct. One example showed GPT-5.2 Thinking using the forbidden word "XOR" four additional times while actively panicking about having already used it.

Why This Matters for AI Development

As AI agents handle increasingly complex autonomous tasks, the ability to monitor their reasoning becomes critical. If models could deliberately obscure their thought processes when they know they're being watched, safety evaluations would become unreliable. The research shows this isn't currently possible.

"Low CoT controllability is a feature, not a bug," the OpenAI team stated. The limitation makes chain-of-thought monitoring more trustworthy as a safeguard against misaligned behavior.

The data revealed some counterintuitive patterns. While larger models showed marginally better controllability, this advantage evaporated under two conditions: when models were given more time to think, and when they underwent additional reinforcement learning training. The OLMo-3-7B-RL-Zero-Math model saw its controllability score drop by more than tenfold during RL training.

Testing Under Adversarial Conditions

The researchers simulated realistic scenarios by explicitly telling models they were being monitored or evaluated. This slightly increased controllability—models tried harder—but gains remained modest and transferred poorly to new instruction types.

Automated prompt optimization found ways to boost controllability on training sets, but these improvements didn't generalize. A model that learned to follow one type of CoT instruction still failed at novel ones.

The Bigger Picture

This work builds on OpenAI's December 2025 framework for evaluating CoT monitorability, which found that inspecting reasoning traces catches misbehavior far more effectively than monitoring outputs alone. The March 2025 discovery of cheating behavior in reasoning models had prompted recommendations to preserve CoT monitoring capabilities.

OpenAI plans to report CoT controllability alongside monitorability metrics in system cards for future frontier models, starting with GPT-5.4 Thinking. The company has also committed to avoiding optimization pressure directly on the reasoning chains of frontier models—a practice that research suggests could inadvertently teach models to hide their intentions.

The open question remains whether this limitation persists as capabilities advance. The team acknowledges they don't fully understand why controllability is low, making continued evaluation essential. For now, the inability of AI systems to game their own oversight represents an unexpected safety dividend.

Image source: Shutterstock
  • openai
  • ai safety
  • gpt-5
  • chain-of-thought
  • machine learning
Market Opportunity
Cosplay Token Logo
Cosplay Token Price(COT)
$0.000828
$0.000828$0.000828
-0.36%
USD
Cosplay Token (COT) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact crypto.news@mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

USDH Power Struggle Ignites Stablecoin “Bidding Wars” Across DeFi: Bloomberg

USDH Power Struggle Ignites Stablecoin “Bidding Wars” Across DeFi: Bloomberg

A heated contest for control over a new dollar-pegged token has set the stage for what analysts say could define the next phase of the stablecoin industry. According to Bloomberg, a bidding war unfolded on Hyperliquid, one of crypto’s fastest-growing trading platforms, with the prize being the right to issue USDH, its native stablecoin. The competition drew some of the sector’s most prominent names, including Paxos, Sky, and Ethena, who later withdrew their bid, alongside the lesser-known Native Markets, a startup backed by Stripe stablecoin subsidiary Bridge. Hyperliquid Stablecoin Race Shows Branding and Partnerships Matter as Much as Tech Over the weekend, Hyperliquid’s validators, the contributors who secure the network and vote on key decisions, awarded the USDH contract to Native Markets over the weekend. Despite its relatively new status, the firm’s connection with Stripe helped it outpace more established rivals. Stablecoins underpin decentralized finance by providing a dollar-backed medium for collateral, settlement, and payments across applications. What began as a grassroots, community-led sector has evolved into a battleground for institutions and payment companies seeking revenue from interest on reserves. Circle, for example, shares proceeds from its USDC with Coinbase under a partnership designed to stabilize earnings during market swings. The Hyperliquid contest offered a rare glimpse into just how intense competition has become. Paxos pledged to take no revenue until USDH surpassed $1 billion in circulation. Agora offered to share 100% of net revenue with Hyperliquid, while Ethena put forward 95%. All were outbid by Native Markets, whose ties to Stripe’s $1.1 billion acquisition of Bridge and subsequent rollout of the Tempo blockchain positioned it as a strong contender. “Every stablecoin issuer is extremely desperate for supply,” said Zaheer Ebtikar, co-founder of Split Capital. “They are willing to publicly announce how much they are willing to offer. It just shows it’s a very tough business for stablecoin issuers.” While USDC remains dominant on Hyperliquid with more than $5.6 billion in deposits, the arrival of USDH could shift flows and revenue dynamics. Paxos co-founder Bhau Kotecha said the firm sees the exchange’s growth as an important opportunity, while Agora’s co-founder Nick van Eck warned that awarding the contract to a vertically integrated issuer risked undermining decentralization. Regulatory positioning also factored into the debate. Paxos operates under a New York trust charter and is seeking a federal license, while Bridge holds money transmitter approvals in 30 states. Native Markets, in a blog post, cited regulatory flexibility and deployment speed as reasons for its selection. Hyperliquid said the strong engagement from its community validated the process. Circle CEO Jeremy Allaire dismissed concerns over USDC’s status, noting on X that competition benefits the ecosystem. Analysts suggested that fears of centralization may be exaggerated, noting that Hyperliquid is likely to remain neutral and support multiple stablecoins. Still, the contest over USDH highlighted a new reality for stablecoins: branding, partnerships, and business strategy are becoming as decisive as technology. Native Markets Secures USDH Stablecoin Mandate on Hyperliquid Hyperliquid has concluded its governance vote for the USDH stablecoin, awarding the mandate to Native Markets after a closely watched process that drew weeks of community debate and rival proposals. USDH, described by Hyperliquid as a “Hyperliquid-first, compliant, and natively minted” dollar-backed token, is intended to reduce the platform’s dependence on USDC and strengthen its spot markets. Validators on the decentralized exchange voted in favor of Native Markets, a relatively new player backed by Stripe’s Bridge subsidiary, over established contenders including Paxos and Ethena. The outcome followed a string of proposals offering aggressive revenue-sharing terms to win validator support, underscoring the scale of incentives attached to controlling USDH. Hyperliquid’s exchange has become a critical hub for stablecoin liquidity, with $5.7 billion in USDC, around 8% of its total supply, currently held on the network. At prevailing treasury yields, that translates to an estimated $200 million to $220 million in annual revenue for Circle, underlining why a native alternative could be transformative. Hyperliquid’s validators, who secure the network and vote on key decisions, selected Native Markets following an on-chain governance process that concluded September 15. Native Markets has laid out a phased rollout for USDH, beginning with capped minting and redemption trials before expanding into spot markets. Its reserves will be managed in cash and treasuries by BlackRock, with on-chain tokenization through Superstate and Bridge. Yield from those reserves will be split between Hyperliquid’s Assistance Fund and ecosystem development. The launch of USDH comes as Hyperliquid records record profits from perpetual futures trading, with $106 million in revenue in August alone, and prepares to slash spot trading fees by 80% to bolster liquidity. Analysts say the move positions Hyperliquid to capture more of the stablecoin economics internally, marking a significant step in its bid to rival the largest players in decentralized finance
Share
CryptoNews2025/09/18 00:48
Bitcoin Market Faces Renewed Pressure: What Lies Ahead?

Bitcoin Market Faces Renewed Pressure: What Lies Ahead?

The post Bitcoin Market Faces Renewed Pressure: What Lies Ahead? appeared on BitcoinEthereumNews.com. Recent data reveals heightened instability in the cryptocurrency
Share
BitcoinEthereumNews2026/03/31 01:21
BTC fell below $67,000, down 0.94% on the day.

BTC fell below $67,000, down 0.94% on the day.

PANews reported on March 31 that, according to OKX market data, BTC has just fallen below $67,000 and is currently trading at $66,989.20 per coin, down 0.94% on
Share
PANews2026/03/31 01:22