Base Models Know How to Reason,
Thinking Models Learn When

Constantin Venhoff *1, Iván Arcuschin *2, Philip Torr 1, Arthur Conmy, and Neel Nanda

* Equal Contribution
1 University of Oxford, UK    2 University of Buenos Aires, Argentina

Why do "thinking" language models like DeepSeek R1 outperform their base counterparts? Despite consistent performance gains, the underlying mechanism, i.e., what differs between base and thinking models, remains unclear.

In this work, we show that, by deriving mechanism-specific steering vectors and activating the right vector at the right time in base models, we can elicit thinking-model-level reasoning chains. To ground our analysis, we introduce an unsupervised, bottom-up approach for uncovering human-interpretable reasoning mechanisms in thinking models. This approach provides a principled way to discover reasoning mechanisms without imposing manual or LLM-derived assumptions about what reasoning mechanisms should exist, reducing bias.

Across three base and four thinking models, on GSM8K and MATH500, our combined approach substantially lifts base-model accuracy, recovering up to 91% of the gap to thinking models without any weight updates. This suggests that reinforcement learning with verifiable rewards mainly teaches thinking models when to fire pre-trained skills, rather than how to execute them, enabling the model to productively use its inference-time compute. Our work reframes our understanding of the performance gains of thinking models and may shed light on why distillation and reinforcement learning are so effective for teaching reasoning to small LLMs.

Mechanistic Interpretability Workshop at NeurIPS 2025

Interactive Demo

🤖

Select a model pair to begin

Click one of the buttons above to explore how hybrid steering works

Introduction

Large Language Models (LLMs) have recently demonstrated remarkable capabilities in reasoning tasks when given additional inference time to think through problems step-by-step. Thinking models, also known as reasoning models, or models using inference-time compute, are a type of language model designed to generate long chains of reasoning before arriving at a final answer. Examples of models in this category include Anthropic's Claude 3.7 Sonnet, OpenAI's o3, Gemini 2.5 Pro, DeepSeek's R1, and Qwen's QwQ 32B.

All these thinking models significantly outperform their base counterparts on challenging reasoning benchmarks. However, a fundamental question remains: What is the difference between base and thinking models that allows the latter to achieve superior performance?

Prior work has suggested several hypotheses: (1) thinking models acquire entirely new reasoning capabilities through specialized training; (2) reinforcement learning (RL) teaches them to structure their reasoning more effectively; (3) RL teaches them to repurpose pre-existing base model representations for new mechanisms; or (4) the additional inference time simply allows more computation to be applied to difficult problems. In this paper, we present evidence for a more nuanced explanation: not only do base models already possess the fundamental reasoning capabilities, but thinking models learn when to deploy these capabilities in a structured sequence.

Key Contributions

To support our analysis, we develop an unsupervised clustering methodology to derive an interpretable taxonomy of reasoning mechanisms that thinking models employ during their chains of thought, forming the building blocks of complex problem-solving. We demonstrate that base models can perform each reasoning behavior when appropriate steering vectors are added to their activations. By identifying and applying the right vector at the right step, we can guide pretrained base models to reproduce complete reasoning chains of thinking models.

Unsupervised Taxonomy of Reasoning Mechanisms

Recent work on thinking models has primarily relied on manual inspection of reasoning traces to identify underlying mechanisms. While insightful, such approaches are inherently subjective and may overlook subtle patterns. We develop an unsupervised, bottom-up methodology to discover human-interpretable reasoning mechanisms in thinking models. Our goal is to construct a taxonomy that is: (1) interpretable, with each mechanism having a clear description of its cognitive function; (2) complete, covering the full range of reasoning steps the model uses; and (3) independent, with categories corresponding to distinct cognitive functions with minimal overlap.

We use Top-K Sparse Autoencoders (SAEs) to cluster sentence-level activations from reasoning traces. We train our SAEs on sentence-level activations from 12,102 MMLU-Pro prompts (430,122 sentences total). The SAE configuration directly matches our hypotheses: the dictionary size represents the number of distinct reasoning mechanisms we hypothesize exist, while the parameter K constrains how many mechanisms can be simultaneously active in a single sentence. We deliberately restrict the latent dimension to [5, 50], far smaller than the residual stream size, forcing the SAE to identify the most fundamental dimensions of reasoning variation.

For each discovered cluster, we use an LLM to analyze top exemplar sentences and generate interpretable category titles and descriptions. We evaluate the proposed taxonomies using three metrics: consistency (how well sentences can be classified into categories), completeness (confidence in classification), and independence (semantic distinctness between categories). Our grid search across five models reveals that reasoning mechanisms are well-represented using 10-20 categories.

Hybrid Model Architecture

To investigate whether base models already possess reasoning mechanisms, we propose a hybrid approach that combines base models with steering vectors—directions in activation space that, when added to intermediate activations, induce target behaviors. We leverage the reasoning taxonomies from our SAE analysis to identify steering vectors corresponding to each reasoning mechanism. Since SAEs identify variance-explaining rather than causally important directions, we use steering vector optimization to find the causal directions.

For each reasoning category, we: (1) identify sentences with top activation scores for that category, (2) extract prefix-completion pairs, and (3) optimize a steering vector that, when applied to the base model, maximizes prediction loss for the thinking model's completion while minimizing likelihood of the base model's completion. We select layers at ~37% of model depth (shown to be most causal in prior work) and train steering vectors at all token positions. Additionally, we train a general bias vector on randomly sampled thinking rollouts to capture general patterns like using first person, which we apply frozen alongside category-specific vectors.

During generation, our hybrid model computes SAE activations at each token position to identify the most active reasoning category, then applies the corresponding steering vector. To adjust steering strength, we test multiple coefficients and steering windows (number of tokens before current position to apply steering), selecting the steered token with lowest perplexity according to the thinking model. This ensures the thinking model stays in-distribution.

Importantly, this approach requires no parameter updates to the base model. The effectiveness with only 15 distinct steering vectors rules out the explanation that steering simply biases toward specific output tokens—there is insufficient information to generate appropriate outputs across hundreds of diverse problems through token-level manipulation alone. Instead, our results suggest steering activates latent reasoning modes within the base model.

Hybrid Models Unlock Reasoning Model-Level Behavior with Minimal Intervention. Overview of our approach for steering base language models to reason like thinking models. (A) We use the base model as the primary generator of tokens in the output sequence. (B) For each token position, we evaluate the current rollout in a target thinking model and use a "thinking model activation classifier" to detect the reasoning mechanism that should be applied next. (C) When the classifier detects a reasoning step, we apply a corresponding steering vector to the base model's activations, triggering structured reasoning behavior. This approach shows that base models already possess latent reasoning abilities, and that these can be reliably activated without any parameter updates, bridging much of the gap to full reasoning models with minimal extra machinery.

Results

The evaluation for our steered base model approach focuses on thinking models across diverse architectures and parameter scales, including models trained with distillation (DeepSeek-R1-Distill series) and models trained directly with RLVR (QwQ-32B). The results show that this approach substantially lifts base model performance, recovering up to 91% of the performance gap between base and thinking models on GSM8K and MATH500 benchmarks without any weight updates.

This finding provides strong evidence that reinforcement learning with verifiable rewards (RLVR) used to train thinking models primarily teaches when to activate pre-existing skills rather than teaching how to execute those skills. This perspective has direct implications for more efficient training of reasoning in future language models.

Ablation Studies

To assess the hybrid model's components, we ablate three factors: the specificity of the learned steering vectors, the timing of their application, and the contribution of the bias vector. We run these ablations on Qwen2.5-32B as the base with QwQ-32B (RLVR) as the thinking model on MATH500, the setting with the strongest hybrid performance (91% gap recovery). We compare three ablations against the full hybrid model:

The results show that Only-bias achieves 76.8%, indicating the bias helps but category-specific steering vectors are necessary; Random-firing reaches 77.8%, showing that proper timing of activation is crucial; Random-vectors achieves 77.2%, confirming that the learned directions are specific rather than generic. Together, these findings support that effectiveness comes from specific learned directions applied with correct, category-timed activation, consistent with our claim that thinking models primarily learn when to deploy reasoning mechanisms.

Citation

If you find our work useful, please consider citing:

@inproceedings{venhoff2025thinking,
  title={Base Models Know How to Reason, Thinking Models Learn When},
  author={Venhoff, Constantin and Arcuschin, Iv{\'a}n and Torr, Philip and Conmy, Arthur and Nanda, Neel},
  booktitle={Mechanistic Interpretability Workshop at NeurIPS},
  year={2025}
}