GSM-DC: How Is LLM Reasoning Distracted by Irrelevant Context?

Abstract

We introduce Grade School Math with Distracting Context (GSM-DC), a synthetic benchmark to evaluate Large Language Models' (LLMs) reasoning robustness against systematically controlled irrelevant context (IC). GSM-DC constructs symbolic reasoning graphs with precise distractor injections, enabling rigorous, reproducible evaluation. Our experiments demonstrate that LLMs are significantly sensitive to IC, affecting both reasoning path selection and arithmetic accuracy. Additionally, training models with strong distractors improves performance in both in-distribution and out-of-distribution scenarios. We further propose a stepwise tree search guided by a process reward model, which notably enhances robustness in out-of-distribution conditions.

Key Contributions

“Design benchmarks that measure how models behave, not just task performance… The goal isn’t to chase SOTA but to interrogate the unknown.”

→ We focus on the fundamental question: How do LLMs maintain reasoning capabilities under irrelevant context? Rather than designing benchmarks solely for performance, we aim to understand the underlying mechanisms at play.

🎯

Controlled Benchmark

GSM-DC enables systematic evaluation of LLM reasoning robustness through symbolic dependency graphs with precise distractor injection.

🔍

Comprehensive Analysis

We reveal how irrelevant context affects both reasoning path selection and arithmetic accuracy across various complexity levels.

🚀

Enhanced Robustness

Our stepwise tree search with Process Reward Model guidance improves out-of-distribution performance by up to 6.29%.

Evaluation Framework

How we evaluate reasoning under distractions

✅

Step Accuracy (SAcc)

Every step correct in topological order - measures arithmetic validity at each reasoning step.

Arithmetic Precision

🧩

Path Accuracy (PAcc)

Follows the gold path, not confused by distractors - evaluates reasoning path selection robustness.

Path Selection

〰️

Extraction Accuracy (EAcc)

Evaluates the final answer - allows redundant steps and shows where reasoning fails.

Final Answer

Universal Laws from Controlled Experiments

Seven robust findings distilled from IC-controlled evaluations (GSM-DC)

0

LLM reasoning is easily distracted by irrelevant context

Even small IC shifts attention and induces errors on simple problems.

1

Performance decreases monotonically with IC intensity

Error rises as distractors increase; GSM-DC shows consistent drops across models and rs.

2

Degradation steepens at greater reasoning depth

Roughly follows a power-law E(m;rs) ∝ m^δ(rs) with δ growing with rs (deeper chains amplify distraction).

3

Continued pretraining enhances robustness even without IC

Full finetuning (continued pretraining) > LoRA on clean data across rs.

4

Training with IC is the most effective route to robustness

Exposure teaches models to down-weight distractors (higher SAcc/PAcc).

5

Hard IC during training ⇒ strongest OOD generalization

HARD-IC outperforms CLEAN / LIGHT / MEDIUM / MIX across ID & OOD; variety helps less than difficulty.

6

PRM-guided Tree Search boosts reliability on hard IC & OOD

Stepwise ToT + PRM improves OOD SAcc (reported up to ~6.29%) while preserving ID accuracy.

Note: EAcc can overestimate robustness—SAcc / PAcc better capture distraction-free reasoning.

BibTeX

@misc{yang2025llmreasoningdistractedirrelevant,
      title={How Is LLM Reasoning Distracted by Irrelevant Context? An Analysis Using a Controlled Benchmark}, 
      author={Minglai Yang and Ethan Huang and Liang Zhang and Mihai Surdeanu and William Wang and Liangming Pan},
      year={2025},
      eprint={2505.18761},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.18761}
      }

How Is LLM Reasoning Distracted by Irrelevant Context?
An Analysis Using a Controlled Benchmark

Overview of the GSM-DC framework: dataset construction involves dependency graph construction, irrelevant context injection, and natural language realization.

Abstract

Key Contributions

Controlled Benchmark

Comprehensive Analysis

Enhanced Robustness

Evaluation Framework

Step Accuracy (SAcc)

Path Accuracy (PAcc)

Extraction Accuracy (EAcc)

Universal Laws from Controlled Experiments

LLM reasoning is easily distracted by irrelevant context

Performance decreases monotonically with IC intensity

Degradation steepens at greater reasoning depth

Continued pretraining enhances robustness even without IC

Training with IC is the most effective route to robustness

Hard IC during training ⇒ strongest OOD generalization

PRM-guided Tree Search boosts reliability on hard IC & OOD

Research Highlights

Future Directions

Related Work

Training Insights

Clean-only Training

Challenging IC Training

Pretraining > LoRA

Paper Poster

BibTeX

How Is LLM Reasoning Distracted by Irrelevant Context?An Analysis Using a Controlled Benchmark

Overview of the GSM-DC framework: dataset construction involves dependency graph construction, irrelevant context injection, and natural language realization.

Abstract

Controlled Benchmark

Comprehensive Analysis

Enhanced Robustness

Step Accuracy (SAcc)

Path Accuracy (PAcc)

Extraction Accuracy (EAcc)

LLM reasoning is easily distracted by irrelevant context

Performance decreases monotonically with IC intensity

Degradation steepens at greater reasoning depth

Continued pretraining enhances robustness even without IC

Training with IC is the most effective route to robustness

Hard IC during training ⇒ strongest OOD generalization

PRM-guided Tree Search boosts reliability on hard IC & OOD

Future Directions

Related Work

Clean-only Training

Challenging IC Training

Pretraining > LoRA

Paper Poster

BibTeX

How Is LLM Reasoning Distracted by Irrelevant Context?
An Analysis Using a Controlled Benchmark