Reinforcement Learning Fundamentals

Reinforcement learning (RL) trains agents to take actions in environments to maximize cumulative reward. Unlike supervised learning, there are no labels — only feedback in the form of rewards.

RL has produced spectacular successes (game playing, robotics) and quiet successes (recommendation systems, ad bidding) but is harder to apply than supervised learning.

The setup

Components:

- **Agent**: makes decisions

- **Environment**: state of the world

- **Action**: what the agent does

- **Reward**: feedback signal

- **Policy**: agent's decision rule

- **Value function**: expected future reward

Loop:

1. Observe state

2. Choose action (according to policy)

3. Environment transitions to new state, returns reward

4. Update policy/value function

5. Repeat

Markov Decision Processes (MDPs)

The mathematical framework:

- State space S

- Action space A

- Transition function P(s'|s,a)

- Reward function R(s,a,s')

- Discount factor γ ∈ [0, 1)

Markov property: future depends only on current state, not history.

Many real problems aren't truly Markov but are approximated as such.

The objective

Maximize expected discounted return:

G = R₁ + γR₂ + γ²R₃ + ...

Discount factor γ trades immediate vs future reward. γ near 0: myopic. γ near 1: far-sighted.

Value functions

State value V(s)

Expected return from state s, following policy π.

Action value Q(s, a)

Expected return from state s, taking action a, then following π.

The Bellman equation relates values across states:

V(s) = E[R + γV(s')]

This recursive structure underlies most RL algorithms.

Exploration vs exploitation

The fundamental tension:

- Exploit: take known-good actions

- Explore: try new actions to learn

Pure exploitation gets stuck in local optima. Pure exploration never benefits from learning.

Strategies:

- **ε-greedy**: random action with probability ε, else best

- **UCB**: balance estimated value with uncertainty

- **Thompson sampling**: sample from posterior

- **Entropy bonuses**: reward exploration directly

Algorithm families

Value-based

Learn Q(s, a); act greedily.

- **Q-learning**: classic; off-policy

- **DQN**: Q-learning with neural networks; Atari breakthrough

- **Rainbow**: DQN with many improvements

Policy-based

Learn policy directly.

- **REINFORCE**: vanilla policy gradient

- **PPO**: proximal policy optimization; standard workhorse

- **TRPO**: trust region; theoretical predecessor

Actor-critic

Combine policy (actor) and value (critic).

- **A3C / A2C**: parallel actor-critic

- **PPO with value function**: PPO is technically actor-critic

- **SAC**: soft actor-critic; strong for continuous control

Model-based

Learn a model of the environment; plan with it.

- **MuZero**: learns model and plans (AlphaZero descendant)

- **Dreamer**: world models for high-dim observations

Sample efficiency advantage; hard to do well.

Modern practice

For most applied RL:

- **PPO**: discrete or continuous, easy to tune, robust

- **SAC**: continuous control, sample efficient

- **DQN variants**: Atari-style discrete

Off-the-shelf libraries: Stable-Baselines3, RLlib, CleanRL.

Sample efficiency

RL needs many environment interactions. Often millions to billions.

For simulation: tractable.

For real world: prohibitive.

Approaches:

- Sim-to-real transfer (train in simulation, deploy in real)

- Offline RL (learn from logged data)

- Meta-learning (transfer across tasks)

Reward design

Probably the hardest part of RL.

Sparse rewards

Reward only at goal. Agent rarely succeeds; learning is slow.

Reward shaping

Intermediate rewards guide learning. Risky: can be exploited.

Reward hacking

Agents find unintended ways to get reward. Common and creative.

Examples: agent that pauses game forever; cleaning robot that breaks plates to "clean" more.

Reward from human feedback (RLHF)

Humans rate outputs; train reward model; train policy. Powers ChatGPT and similar.

Stability

RL is notoriously unstable:

- Hyperparameter-sensitive

- Variance across random seeds

- Hard to debug

Best practices:

- Multiple seeds

- Compare to baselines

- Track many metrics

- Fix everything before tuning hyperparameters

Distributional / safe RL

For safety-critical applications:

- Constrained RL (constraints on actions/states)

- Distributional RL (model return distribution, not just mean)

- Risk-sensitive policies

Active research area.

Major RL successes

Game playing

- Atari (DQN, 2013)

- Go (AlphaGo, 2016)

- StarCraft, Dota (2018-2019)

- Diplomacy (Cicero, 2022)

Robotics

- Manipulation (Boston Dynamics, OpenAI's hand)

- Locomotion (quadrupeds, humanoids)

- Sim-to-real successes

Industrial

- Data center cooling (Google)

- Chip design (Google)

- Recommender systems (many companies)

LLM training

- RLHF for alignment (ChatGPT, Claude)

- Constitutional AI (Anthropic)

When NOT to use RL

If supervised learning would work:

- You have labeled data

- Action consequences are immediate

- One-shot decisions, not sequential

Don't reach for RL out of fashion.

Common failure patterns

Reward hacking

Rewards optimized in unintended ways.

Distribution shift

Policy trained in one distribution fails in another.

Catastrophic forgetting

Continuing to train can degrade earlier behavior.

Bad reward function

If you can't write a good reward, RL won't save you.

Insufficient exploration

Stuck in local optima.

Overfitting to simulation

Sim-to-real gap. Domain randomization helps.

Practical advice

1. Start with the simplest possible setup

2. Get a baseline working before optimizing

3. Check for bugs before tuning hyperparameters

4. Multiple random seeds; report variance

5. If RL isn't working, ask whether the problem is RL-shaped at all

Further Reading

- [TreeBasedModels](TreeBasedModels) — Different ML paradigm

- [TransformerArchitecture](TransformerArchitecture) — Often the policy network

- [ML Hub](MLHub) — Cluster index