Controlled Benchmark
GSM-DC enables systematic evaluation of LLM reasoning robustness through symbolic dependency graphs with precise distractor injection.
We introduce Grade School Math with Distracting Context (GSM-DC), a synthetic benchmark to evaluate Large Language Models' (LLMs) reasoning robustness against systematically controlled irrelevant context (IC). GSM-DC constructs symbolic reasoning graphs with precise distractor injections, enabling rigorous, reproducible evaluation. Our experiments demonstrate that LLMs are significantly sensitive to IC, affecting both reasoning path selection and arithmetic accuracy. Additionally, training models with strong distractors improves performance in both in-distribution and out-of-distribution scenarios. We further propose a stepwise tree search guided by a process reward model, which notably enhances robustness in out-of-distribution conditions.
“Design benchmarks that measure how models behave, not just task performance… The goal isn’t to chase SOTA but to interrogate the unknown.”
→ We focus on the fundamental question: How do LLMs maintain reasoning capabilities under irrelevant context? Rather than designing benchmarks solely for performance, we aim to understand the underlying mechanisms at play.
GSM-DC enables systematic evaluation of LLM reasoning robustness through symbolic dependency graphs with precise distractor injection.
We reveal how irrelevant context affects both reasoning path selection and arithmetic accuracy across various complexity levels.
Our stepwise tree search with Process Reward Model guidance improves out-of-distribution performance by up to 6.29%.
How we evaluate reasoning under distractions
Every step correct in topological order - measures arithmetic validity at each reasoning step.
Follows the gold path, not confused by distractors - evaluates reasoning path selection robustness.
Evaluates the final answer - allows redundant steps and shows where reasoning fails.
Seven robust findings distilled from IC-controlled evaluations (GSM-DC)
Even small IC shifts attention and induces errors on simple problems.
Error rises as distractors increase; GSM-DC shows consistent drops across models and rs.
Roughly follows a power-law E(m;rs) ∝ mδ(rs)
with δ growing with rs (deeper chains amplify distraction).
Full finetuning (continued pretraining) > LoRA on clean data across rs.
Exposure teaches models to down-weight distractors (higher SAcc/PAcc).
HARD-IC outperforms CLEAN / LIGHT / MEDIUM / MIX across ID & OOD; variety helps less than difficulty.
Stepwise ToT + PRM improves OOD SAcc (reported up to ~6.29%) while preserving ID accuracy.
Dataset Construction, key findings and insights from our controlled experiments
Discovering the path to robust reasoning
Training only on clean data does NOT provide robustness against irrelevant context distractions.
Training with challenging irrelevant context leads to strongest robustness & out-of-distribution generalization.
Continued pretraining with IC is more effective than LoRA fine-tuning for reasoning robustness.
@misc{yang2025llmreasoningdistractedirrelevant,
title={How Is LLM Reasoning Distracted by Irrelevant Context? An Analysis Using a Controlled Benchmark},
author={Minglai Yang and Ethan Huang and Liang Zhang and Mihai Surdeanu and William Wang and Liangming Pan},
year={2025},
eprint={2505.18761},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.18761}
}