CRITICBENCH is a benchmark designed to test AI models using data that exposes subtle weaknesses in reasoning. Instead of focusing on obvious mistakes, it samples “convincing wrong answers”—responses that appear correct but contain hidden flaws—alongside correct outputs with varied complexity. By filtering low-quality models, emphasizing reasoning steps, and using nuanced sampling strategies across datasets like GSM8K, HumanEval, and TruthfulQA, CRITICBENCH offers a rigorous way to compare strong versus weak LLMs.CRITICBENCH is a benchmark designed to test AI models using data that exposes subtle weaknesses in reasoning. Instead of focusing on obvious mistakes, it samples “convincing wrong answers”—responses that appear correct but contain hidden flaws—alongside correct outputs with varied complexity. By filtering low-quality models, emphasizing reasoning steps, and using nuanced sampling strategies across datasets like GSM8K, HumanEval, and TruthfulQA, CRITICBENCH offers a rigorous way to compare strong versus weak LLMs.

Why “Almost Right” Answers Are the Hardest Test for AI

4 min read

Abstract and 1. Introduction

  1. Definition of Critique Ability

  2. Construction of CriticBench

    3.1 Data Generation

    3.2 Data Selection

  3. Properties of Critique Ability

    4.1 Scaling Law

    4.2 Self-Critique Ability

    4.3 Correlation to Certainty

  4. New Capacity with Critique: Self-Consistency with Self-Check

  5. Conclusion, References, and Acknowledgments

A. Notations

B. CriticBench: Sources of Queries

C. CriticBench: Data Generation Details

D. CriticBench: Data Selection Details

E. CriticBench: Statistics and Examples

F. Evaluation Settings

D CRITICBENCH: DATA SELECTION DETAILS

D.1 SAMPLING FROM CONVINCING WRONG-ANSWERS

The term convincing wrong-answer is coined by Lightman et al. (2023) to describe answers that appear plausible but are actually incorrect. Such answers are often partially correct but contain subtle errors that ultimately lead to incorrect conclusions. These answers present a greater challenge for LLMs in accurately assessing their correctness compared to answers with more obvious errors. Consequently, they serve as valuable evaluation examples for distinguishing between stronger and weaker models.

\ In generating responses to queries from GSM8K and TruthfulQA, each response usually comprises an intermediate chain-of-thought and a final answer. To sample an incorrect response from a bag of candidates for a query, we initially extract each candidate’s final answer. Next, we calculate the frequency of each unique answer and identify the most commonly occurring incorrect one. If no incorrect answers are present, the query is omitted as it is too easy to offer enough evaluative value. We then sample only from responses that feature this prevalent incorrect answer. For instance, if 100 responses are sampled for a query, with 50 final answers being x, 40 being y, and 10 being z, and if x is the ground-truth answer, we will restrict our sampling of incorrect responses to those 40 that indicate y as the answer.

\ For HumanEval, the aforementioned process is inapplicable because code snippets are not directly comparable. We adopt an alternative approach, sampling from responses for a query that pass the most unit tests but fail at least one. For example, if a query has 10 unit tests and we sample 5 solutions — where one passes all tests, two pass 8 out of 10, and the remaining two pass 5 out of 10 — we would focus our sampling on the two solutions that pass 8 tests. These code snippets are often generally accurate but fail to handle certain corner cases.

D.2 COMPLEXITY-BASED SELECTION

Fu et al. (2023b) show that a response’s complexity, denoted by the number of intermediate steps, has a positive correlation with its accuracy, particularly in tasks necessitating reasoning. To leverage this finding, we employ a complexity-based sampling strategy when selecting from either correct or commonly incorrect responses.

\

\ Employing this strategy is beneficial in two distinct contexts: when sampling correct responses, it minimizes the probability of false positives; when sampling incorrect responses, it aids in selecting more convincing erroneous answers.

D.3 FILTERING BY GENERATOR

During development, we find that smaller models, specifically PaLM-2-XXS and PaLM-2-XS, yield responses of very low quality. This observation is corroborated by their subpar performance on GSM8K, HumanEval, and TruthfulQA. Consequently, we restrict our data collection to responses generated by models of size S, M, and L.

D.4 CERTAINTY-BASED SELECTION

E CRITICBENCH: STATISTICS AND EXAMPLES

E.1 STATISTICS

Table 2 presents the detailed statistics of CRITICBENCH and each subset.

\ \ Table 2: The statistics of CRITICBENCH and each subset.

\

E.2 EXAMPLES

Figure 8, 9 and 10 provide examples in CRITICBENCH.

\

:::info Authors:

(1) Liangchen Luo, Google Research (luolc@google.com);

(2) Zi Lin, UC San Diego;

(3) Yinxiao Liu, Google Research;

(4) Yun Zhu, Google Research;

(5) Jingbo Shang, UC San Diego;

(6) Lei Meng, Google Research (leimeng@google.com).

:::


:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

\

Market Opportunity
Wink Logo
Wink Price(LIKE)
$0.001702
$0.001702$0.001702
-8.09%
USD
Wink (LIKE) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

USDT Transfer Stuns Market: $238 Million Whale Movement to Bitfinex Reveals Critical Patterns

USDT Transfer Stuns Market: $238 Million Whale Movement to Bitfinex Reveals Critical Patterns

BitcoinWorld USDT Transfer Stuns Market: $238 Million Whale Movement to Bitfinex Reveals Critical Patterns In a stunning development that captured global cryptocurrency
Share
bitcoinworld2026/02/06 21:45
The market value of NFTs has fallen back to pre-2021 levels, close to $1.5 billion.

The market value of NFTs has fallen back to pre-2021 levels, close to $1.5 billion.

PANews reported on February 6th, citing Cointelegraph, that the global NFT market capitalization has fallen below $1.5 billion, returning to pre-2021 levels. This
Share
PANews2026/02/06 21:13
Remittix Backed As The Best Crypto To Buy Now, Followed By Cardano & Solana

Remittix Backed As The Best Crypto To Buy Now, Followed By Cardano & Solana

The post Remittix Backed As The Best Crypto To Buy Now, Followed By Cardano & Solana appeared on BitcoinEthereumNews.com. Crypto News 20 September 2025 | 18:50 The hunt for the Best Crypto To Buy Now has narrowed to three names that keep showing up on screens. Cardano is testing higher ranges as traders eye a push toward $1 with liquidations clustered near key levels, while Solana keeps riding fresh institutional headlines and multi-month highs. Remittix (RTX) is being positioned as the standout with real-world PayFi utility and fast-moving product milestones that many believe could outpace large caps in percentage terms. Side by side, these three tell a clear story about momentum, access, and practical use in the current market. Cardano Today And Where Price Could Go Next Cardano price has pressed against the upper band of its recent range, with traders tracking support resistance just under $1. A liquidation pocket near the $0.96 area has sharpened the focus on a clean break, since a slip to $0.87 would invalidate the short burst of strength. Broader roundups also pointed to steady interest as capital rotated across majors and quality mid-caps. This keeps Cardano on the shortlist next to Solana and Remittix for traders who watch momentum and confirmation levels. Solana Strength And Fund Flows Solana has drawn a fresh wave of attention after a corporate treasury pivot that explicitly targets long-term SOL accumulation. Reports detailed a $300 million raise tied to a public company rebrand and an intent to become a major Solana treasury, a headline that coincided with a powerful move through the $250 range. With corporate demand and technicals aligned, Solana stays near the top of watch lists along with Cardano and Remittix. Remittix Versus Large Caps In The Best Crypto To Buy Now Debate Remittix enters this comparison from a lower base, which increases the percentage potential relative to Cardano and Solana. It positions itself as a…
Share
BitcoinEthereumNews2025/09/21 00:03