Reinforcement Learning Fundamentals

Reinforcement learning (RL) trains agents to take actions in environments to maximize cumulative reward. Unlike supervised learning, there are no labels — only feedback in the form of rewards.

RL has produced spectacular successes (game playing, robotics) and quiet successes (recommendation systems, ad bidding) but is harder to apply than supervised learning.

The setup

Components:

Agent: makes decisions
Environment: state of the world
Action: what the agent does
Reward: feedback signal
Policy: agent's decision rule
Value function: expected future reward

Loop:

Observe state
Choose action (according to policy)
Environment transitions to new state, returns reward
Update policy/value function
Repeat

Markov Decision Processes (MDPs)

The mathematical framework:

State space S
Action space A
Transition function P(s'|s,a)
Reward function R(s,a,s')
Discount factor γ ∈ [0, 1)

Markov property: future depends only on current state, not history.

Many real problems aren't truly Markov but are approximated as such.

The objective

Maximize expected discounted return: G = R₁ + γR₂ + γ²R₃ + ...

Discount factor γ trades immediate vs future reward. γ near 0: myopic. γ near 1: far-sighted.

Value functions

State value V(s)

Expected return from state s, following policy π.

Action value Q(s, a)

Expected return from state s, taking action a, then following π.

The Bellman equation relates values across states: V(s) = E[R + γV(s')]

This recursive structure underlies most RL algorithms.

Exploration vs exploitation

The fundamental tension:

Exploit: take known-good actions
Explore: try new actions to learn

Pure exploitation gets stuck in local optima. Pure exploration never benefits from learning.

Strategies:

ε-greedy: random action with probability ε, else best
UCB: balance estimated value with uncertainty
Thompson sampling: sample from posterior
Entropy bonuses: reward exploration directly

Algorithm families

Value-based

Learn Q(s, a); act greedily.

Q-learning: classic; off-policy
DQN: Q-learning with neural networks; Atari breakthrough
Rainbow: DQN with many improvements

Policy-based

Learn policy directly.

REINFORCE: vanilla policy gradient
PPO: proximal policy optimization; standard workhorse
TRPO: trust region; theoretical predecessor

Actor-critic

Combine policy (actor) and value (critic).

A3C / A2C: parallel actor-critic
PPO with value function: PPO is technically actor-critic
SAC: soft actor-critic; strong for continuous control

Model-based

Learn a model of the environment; plan with it.

MuZero: learns model and plans (AlphaZero descendant)
Dreamer: world models for high-dim observations

Sample efficiency advantage; hard to do well.

Modern practice

For most applied RL:

PPO: discrete or continuous, easy to tune, robust
SAC: continuous control, sample efficient
DQN variants: Atari-style discrete

Off-the-shelf libraries: Stable-Baselines3, RLlib, CleanRL.

Sample efficiency

RL needs many environment interactions. Often millions to billions.

For simulation: tractable. For real world: prohibitive.

Approaches:

Sim-to-real transfer (train in simulation, deploy in real)
Offline RL (learn from logged data)
Meta-learning (transfer across tasks)

Reward design

Probably the hardest part of RL.

Sparse rewards

Reward only at goal. Agent rarely succeeds; learning is slow.

Reward shaping

Intermediate rewards guide learning. Risky: can be exploited.

Reward hacking

Agents find unintended ways to get reward. Common and creative.

Examples: agent that pauses game forever; cleaning robot that breaks plates to "clean" more.

Reward from human feedback (RLHF)

Humans rate outputs; train reward model; train policy. Powers ChatGPT and similar.

Stability

RL is notoriously unstable:

Hyperparameter-sensitive
Variance across random seeds
Hard to debug

Best practices:

Multiple seeds
Compare to baselines
Track many metrics
Fix everything before tuning hyperparameters

Distributional / safe RL

For safety-critical applications:

Constrained RL (constraints on actions/states)
Distributional RL (model return distribution, not just mean)
Risk-sensitive policies

Active research area.

Major RL successes

Game playing

Atari (DQN, 2013)
Go (AlphaGo, 2016)
StarCraft, Dota (2018-2019)
Diplomacy (Cicero, 2022)

Robotics

Manipulation (Boston Dynamics, OpenAI's hand)
Locomotion (quadrupeds, humanoids)
Sim-to-real successes

Industrial

Data center cooling (Google)
Chip design (Google)
Recommender systems (many companies)

LLM training

RLHF for alignment (ChatGPT, Claude)
Constitutional AI (Anthropic)

When NOT to use RL

If supervised learning would work:

You have labeled data
Action consequences are immediate
One-shot decisions, not sequential

Don't reach for RL out of fashion.

Common failure patterns

Reward hacking

Rewards optimized in unintended ways.

Distribution shift

Policy trained in one distribution fails in another.

Catastrophic forgetting

Continuing to train can degrade earlier behavior.

Bad reward function

If you can't write a good reward, RL won't save you.

Insufficient exploration

Stuck in local optima.

Overfitting to simulation

Sim-to-real gap. Domain randomization helps.

Practical advice

Start with the simplest possible setup
Get a baseline working before optimizing
Check for bugs before tuning hyperparameters
Multiple random seeds; report variance
If RL isn't working, ask whether the problem is RL-shaped at all

Reinforcement Learning Fundamentals

The setup

Markov Decision Processes (MDPs)

The objective

Value functions

State value V(s)

Action value Q(s, a)

Exploration vs exploitation

Algorithm families

Value-based

Policy-based

Actor-critic

Model-based

Modern practice

Sample efficiency

Reward design

Sparse rewards

Reward shaping

Reward hacking

Reward from human feedback (RLHF)

Stability

Distributional / safe RL

Major RL successes

Game playing

Robotics

Industrial

LLM training

When NOT to use RL

Common failure patterns

Reward hacking

Distribution shift

Catastrophic forgetting

Bad reward function

Insufficient exploration

Overfitting to simulation

Practical advice

Further Reading