Reinforcement Learning
Learn how agents learn to make optimal decisions through interaction with their environment.
Level
Intermediate
Time
~20 min
Topics
6 Core Concepts
1. What is Reinforcement Learning?
Reinforcement Learning (RL) is a paradigm in machine learning where an agent learns to make optimal decisions by interacting with an environment. Unlike supervised learning, the agent doesn't receive labeled examples—instead, it receives rewards or penalties for its actions.
Key Concepts:
Agent
The learner or decision-maker that interacts with the environment
Environment
The world the agent interacts with, providing states and rewards
Reward Signal
Feedback that indicates how good an action was in a given state
Policy
A strategy mapping states to actions that the agent learns
Quick check
What is the primary difference between reinforcement learning and supervised learning?
2. The Agent-Environment Loop
At each timestep, the agent observes the current state, selects an action, receives a reward, and transitions to a new state. This loop repeats until the episode ends.
The RL Loop:
Observe State (s)
Agent perceives the current state of the environment
Select Action (a)
Agent chooses an action based on its policy
Receive Reward (r) & Next State (s')
Environment responds with reward and new state
Update Knowledge
Agent updates its understanding (Q-values) based on experience
Repeat
Return to step 1 with the new state
This fundamental loop is the basis of all reinforcement learning algorithms. The agent learns by trial and error, refining its policy over many interactions.
Quick check
In the RL loop, what role does the reward signal play?
Reinforcement learning problems are formally modeled as Markov Decision Processes (MDPs). An MDP is defined by:
- S: A set of states
- A: A set of actions
- P(s'|s,a): Transition probabilities
- R(s,a,s'): Reward function
- γ: Discount factor
The key property of MDPs is the Markov property: the future depends only on the current state, not the history. This simplifies learning algorithms.
3. Guided Reinforcement Learning Walkthrough
Guided Reinforcement Learning Walkthrough
Step 1: Agent-Environment Loop
An RL agent interacts with an environment. At each step, the agent observes the current state, takes an action, and receives feedback.
Interactive Grid World
Watch the agent navigate a 5x5 grid. It must reach the goal ('⭐') while avoiding walls ('🧱'). The agent learns from each step taken.
Grid World Environment
Episode
0
Steps
0
Episode Reward
0.0
Legend:
🤖 = Agent (current position)
⭐ = Goal (reward +100)
🧱 = Wall (impassable)
Light blue = Recent path
Quick check
What is the purpose of epsilon (ε) in the epsilon-greedy strategy?
4. Q-Learning Algorithm
Q-Learning is one of the most popular RL algorithms. It learns the Q-function, which estimates the expected cumulative reward from taking a specific action in a given state.
The Bellman Equation (Core Update Rule):
Q(s,a) ← Q(s,a) + α[r + γ·max Q(s',a') - Q(s,a)]
Q(s,a) = Estimated value of action a in state s
α = Learning rate (0 to 1)
r = Immediate reward
γ = Discount factor (0 to 1)
s' = Next state
High Learning Rate (α)
Fast learning but may overshoot optimal values and be unstable
High Discount Factor (γ)
Agent cares more about long-term rewards. γ=0.99 means far future still matters
Quick check
What does the Bellman equation in Q-Learning represent?
Q-Learning is proven to converge to the optimal Q-function Q*(s,a) under certain conditions:
- Every state-action pair must be visited infinitely often as episodes go to infinity
- The learning rate α must decay appropriately (sum to infinity, sum of squares finite)
- The environment must be finite and deterministic
Once converged, the optimal policy is to always choose the action with the highest Q-value: π*(s) = argmax_a Q*(s,a)
5. Exploration vs Exploitation Tradeoff
The core dilemma in RL: should the agent try new things (explore) or use what it already knows works (exploit)?
Exploration
- Benefit: Discovers new rewards and strategies
- Cost: May find suboptimal actions
- Example: Trying a random restaurant
Exploitation
- Benefit: Maximizes immediate reward
- Cost: May miss better options
- Example: Going to your favorite restaurant
Epsilon-Greedy Strategy
The simplest approach: with probability ε, take a random action (explore). With probability 1-ε, take the best known action (exploit).
if random() < ε:
action = random_action()
else:
action = argmax Q(s,a)
Quick check
Which exploration strategy slowly decreases exploration over time?
6. Interactive RL Playground
Now it's your turn! Configure hyperparameters and train your own agent. Experiment with different values to see how they affect learning.
Configure & Train Your Own Agent
How fast to learn from new data
Importance of future rewards
Probability of random exploration
Quick check
If you increase the learning rate (α), what typically happens?
Q-Learning is just one of many RL algorithms:
- SARSA: On-policy variant that uses the actual next action taken
- Policy Gradient: Directly optimize the policy instead of the value function
- Actor-Critic: Combines value function and policy learning
- Deep Q-Networks (DQN): Uses neural networks to approximate Q-values
These advanced methods enable RL to work in high-dimensional spaces like images and real-world robotics.
Key Takeaways
RL agents learn by trial and error through interaction with their environment.
The agent-environment loop with reward signals is fundamental to RL.
Q-Learning uses the Bellman equation to iteratively improve value estimates.
The exploration-exploitation tradeoff is central; epsilon-greedy balances it.
Hyperparameters like learning rate, discount factor, and epsilon significantly impact learning.
Through convergence, agents discover optimal policies that maximize cumulative reward.
Quick check
Which statement best describes the goal of reinforcement learning?
Finished this lesson?
Mark it as complete to track your progress and get a certificate.