How Is LLM Reasoning Distracted by Irrelevant Context?
An Analysis Using a Controlled Benchmark

1University of Arizona, 2Peking University, 3University of California, Santa Barbara
EMNLP 2025 Main Conference (Oral Recommendation)

Corresponding author
GSM-DC Framework Pipeline

Overview of the GSM-DC framework: dataset construction involves dependency graph construction, irrelevant context injection, and natural language realization.

Abstract

We introduce Grade School Math with Distracting Context (GSM-DC), a synthetic benchmark to evaluate Large Language Models' (LLMs) reasoning robustness against systematically controlled irrelevant context (IC). GSM-DC constructs symbolic reasoning graphs with precise distractor injections, enabling rigorous, reproducible evaluation. Our experiments demonstrate that LLMs are significantly sensitive to IC, affecting both reasoning path selection and arithmetic accuracy. Additionally, training models with strong distractors improves performance in both in-distribution and out-of-distribution scenarios. We further propose a stepwise tree search guided by a process reward model, which notably enhances robustness in out-of-distribution conditions.

Key Contributions

“Design benchmarks that measure how models behave, not just task performance… The goal isn’t to chase SOTA but to interrogate the unknown.”

→ We focus on the fundamental question: How do LLMs maintain reasoning capabilities under irrelevant context? Rather than designing benchmarks solely for performance, we aim to understand the underlying mechanisms at play.

🎯

Controlled Benchmark

GSM-DC enables systematic evaluation of LLM reasoning robustness through symbolic dependency graphs with precise distractor injection.

🔍

Comprehensive Analysis

We reveal how irrelevant context affects both reasoning path selection and arithmetic accuracy across various complexity levels.

🚀

Enhanced Robustness

Our stepwise tree search with Process Reward Model guidance improves out-of-distribution performance by up to 6.29%.

Evaluation Framework

How we evaluate reasoning under distractions

Step Accuracy (SAcc)

Every step correct in topological order - measures arithmetic validity at each reasoning step.

Arithmetic Precision
🧩

Path Accuracy (PAcc)

Follows the gold path, not confused by distractors - evaluates reasoning path selection robustness.

Path Selection
〰️

Extraction Accuracy (EAcc)

Evaluates the final answer - allows redundant steps and shows where reasoning fails.

Final Answer

Universal Laws from Controlled Experiments

Seven robust findings distilled from IC-controlled evaluations (GSM-DC)

0

LLM reasoning is easily distracted by irrelevant context

Even small IC shifts attention and induces errors on simple problems.

1

Performance decreases monotonically with IC intensity

Error rises as distractors increase; GSM-DC shows consistent drops across models and rs.

2

Degradation steepens at greater reasoning depth

Roughly follows a power-law E(m;rs) ∝ mδ(rs) with δ growing with rs (deeper chains amplify distraction).

3

Continued pretraining enhances robustness even without IC

Full finetuning (continued pretraining) > LoRA on clean data across rs.

4

Training with IC is the most effective route to robustness

Exposure teaches models to down-weight distractors (higher SAcc/PAcc).

5

Hard IC during training ⇒ strongest OOD generalization

HARD-IC outperforms CLEAN / LIGHT / MEDIUM / MIX across ID & OOD; variety helps less than difficulty.

6

PRM-guided Tree Search boosts reliability on hard IC & OOD

Stepwise ToT + PRM improves OOD SAcc (reported up to ~6.29%) while preserving ID accuracy.

Note: EAcc can overestimate robustness—SAcc / PAcc better capture distraction-free reasoning.

Training Insights

Discovering the path to robust reasoning

⚠️

Clean-only Training

Training only on clean data does NOT provide robustness against irrelevant context distractions.

Insufficient

Challenging IC Training

Training with challenging irrelevant context leads to strongest robustness & out-of-distribution generalization.

Optimal
📈

Pretraining > LoRA

Continued pretraining with IC is more effective than LoRA fine-tuning for reasoning robustness.

Superior

Paper Poster

BibTeX

@misc{yang2025llmreasoningdistractedirrelevant,
      title={How Is LLM Reasoning Distracted by Irrelevant Context? An Analysis Using a Controlled Benchmark}, 
      author={Minglai Yang and Ethan Huang and Liang Zhang and Mihai Surdeanu and William Wang and Liangming Pan},
      year={2025},
      eprint={2505.18761},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.18761}
      }