LLM Reasoning

Inducing System 2 thinking. From in-context learning to chain-of-thought and learned tool-use architectures.

Estimated time: ~90 min

Study this path with flashcards

6 cards

Study →

Step 1
GPT-3 & Few-Shot In-Context Learning
GPT-3 showed that a 175B-parameter autoregressive Transformer can perform many tasks from natural-language instructions and a few demonstrations in the prompt, without gradient updates or task-specific fine-tuning. That result made in-context learning a central paradigm and showed that scale alone could unlock strong few-shot behavior.
Step 2
In-Context Learning
In-context learning is the ability of a model to adapt its behavior from instructions or examples placed in the prompt, without changing its weights. The model remains frozen; the adaptation happens within the forward pass through pattern recognition over the context.
Step 3
Chain of Thought
Chain of thought is a prompting strategy that elicits intermediate reasoning steps before the final answer. It often improves performance on multi-step tasks because the model can use the generated text as an external scratchpad rather than compressing all reasoning into one token prediction.
Step 4
Chain-of-Thought Monitorability
Chain-of-thought monitorability is the safety claim that when a model needs explicit reasoning to complete a task, its written chain of thought can be monitored for harmful intent or deception. The key property is monitorability rather than perfect faithfulness: hiding the reasoning tends to become harder when the reasoning itself is load-bearing for success.
Step 5
Toolformer & Learned Tool Use
Toolformer trains a language model to decide when to call external tools and what arguments to send without requiring human demonstrations of tool use. It keeps tool calls only when the returned result improves the continuation, making tool use a self-supervised learning signal.
Step 6
Tool Use / Function Calling Benchmarks (BFCL, τ-bench)
BFCL (Berkeley Function Calling Leaderboard, 2024) and τ-bench (2024) evaluate LLMs' ability to select, parameterise, and sequence API calls. BFCL is single-turn function-call accuracy; τ-bench is multi-turn dialogue in simulated customer-service environments with realistic state and policy constraints.