Vamsi Krishna Sankarayogi — Technologist at Heart

1. What is Reinforcement Learning?

Reinforcement Learning (RL) is a paradigm in machine learning where an agent learns to make optimal decisions by interacting with an environment. Unlike supervised learning, the agent doesn't receive labeled examples—instead, it receives rewards or penalties for its actions.

Key Concepts:

Agent

The learner or decision-maker that interacts with the environment

Environment

The world the agent interacts with, providing states and rewards

Reward Signal

Feedback that indicates how good an action was in a given state

Policy

A strategy mapping states to actions that the agent learns

?

Quick check

What is the primary difference between reinforcement learning and supervised learning?

2. The Agent-Environment Loop

At each timestep, the agent observes the current state, selects an action, receives a reward, and transitions to a new state. This loop repeats until the episode ends.

The RL Loop:

1

Observe State (s)

Agent perceives the current state of the environment

2

Select Action (a)

Agent chooses an action based on its policy

3

Receive Reward (r) & Next State (s')

Environment responds with reward and new state

4

Update Knowledge

Agent updates its understanding (Q-values) based on experience

5

Repeat

Return to step 1 with the new state

This fundamental loop is the basis of all reinforcement learning algorithms. The agent learns by trial and error, refining its policy over many interactions.

?

Quick check

In the RL loop, what role does the reward signal play?

Reinforcement learning problems are formally modeled as Markov Decision Processes (MDPs). An MDP is defined by:

S: A set of states
A: A set of actions
P(s'|s,a): Transition probabilities
R(s,a,s'): Reward function
γ: Discount factor

The key property of MDPs is the Markov property: the future depends only on the current state, not the history. This simplifies learning algorithms.

3. Guided Reinforcement Learning Walkthrough

Guided Reinforcement Learning Walkthrough

Step 1: Agent-Environment Loop

An RL agent interacts with an environment. At each step, the agent observes the current state, takes an action, and receives feedback.

Interactive Grid World

Watch the agent navigate a 5x5 grid. It must reach the goal ('⭐') while avoiding walls ('🧱'). The agent learns from each step taken.

Grid World Environment

🤖

🧱

⭐

Episode

0

Steps

0

Episode Reward

0.0

Speed:

Legend:

🤖 = Agent (current position)

⭐ = Goal (reward +100)

🧱 = Wall (impassable)

Light blue = Recent path

?

Quick check

What is the purpose of epsilon (ε) in the epsilon-greedy strategy?

4. Q-Learning Algorithm

Q-Learning is one of the most popular RL algorithms. It learns the Q-function, which estimates the expected cumulative reward from taking a specific action in a given state.

The Bellman Equation (Core Update Rule):

Q(s,a) ← Q(s,a) + α[r + γ·max Q(s',a') - Q(s,a)]

Q(s,a) = Estimated value of action a in state s

α = Learning rate (0 to 1)

r = Immediate reward

γ = Discount factor (0 to 1)

s' = Next state

High Learning Rate (α)

Fast learning but may overshoot optimal values and be unstable

High Discount Factor (γ)

Agent cares more about long-term rewards. γ=0.99 means far future still matters

?

Quick check

What does the Bellman equation in Q-Learning represent?

Q-Learning is proven to converge to the optimal Q-function Q*(s,a) under certain conditions:

Every state-action pair must be visited infinitely often as episodes go to infinity
The learning rate α must decay appropriately (sum to infinity, sum of squares finite)
The environment must be finite and deterministic

Once converged, the optimal policy is to always choose the action with the highest Q-value: π*(s) = argmax_a Q*(s,a)

5. Exploration vs Exploitation Tradeoff

The core dilemma in RL: should the agent try new things (explore) or use what it already knows works (exploit)?

Exploration

Benefit: Discovers new rewards and strategies
Cost: May find suboptimal actions
Example: Trying a random restaurant

Exploitation

Benefit: Maximizes immediate reward
Cost: May miss better options
Example: Going to your favorite restaurant

Epsilon-Greedy Strategy

The simplest approach: with probability ε, take a random action (explore). With probability 1-ε, take the best known action (exploit).

if random() < ε:
action = random_action()
else:
action = argmax Q(s,a)

?

Quick check

Which exploration strategy slowly decreases exploration over time?

6. Interactive RL Playground

Now it's your turn! Configure hyperparameters and train your own agent. Experiment with different values to see how they affect learning.

Configure & Train Your Own Agent

Grid Width: 5

Grid Height: 5

Learning Rate (α): 0.10

How fast to learn from new data

Discount Factor (γ): 0.99

Importance of future rewards

Exploration Rate (ε): 0.30

Probability of random exploration

Episodes to Train: 50

?

Quick check

If you increase the learning rate (α), what typically happens?

Q-Learning is just one of many RL algorithms:

SARSA: On-policy variant that uses the actual next action taken
Policy Gradient: Directly optimize the policy instead of the value function
Actor-Critic: Combines value function and policy learning
Deep Q-Networks (DQN): Uses neural networks to approximate Q-values

These advanced methods enable RL to work in high-dimensional spaces like images and real-world robotics.

Key Takeaways

✓

RL agents learn by trial and error through interaction with their environment.

✓

The agent-environment loop with reward signals is fundamental to RL.

✓

Q-Learning uses the Bellman equation to iteratively improve value estimates.

✓

The exploration-exploitation tradeoff is central; epsilon-greedy balances it.

✓

Hyperparameters like learning rate, discount factor, and epsilon significantly impact learning.

✓

Through convergence, agents discover optimal policies that maximize cumulative reward.

?

Quick check

Which statement best describes the goal of reinforcement learning?

Reinforcement Learning

1. What is Reinforcement Learning?

Key Concepts:

Agent

Environment

Reward Signal

Policy

2. The Agent-Environment Loop

The RL Loop:

3. Guided Reinforcement Learning Walkthrough

Guided Reinforcement Learning Walkthrough

Step 1: Agent-Environment Loop

Interactive Grid World

Grid World Environment

4. Q-Learning Algorithm

The Bellman Equation (Core Update Rule):

High Learning Rate (α)

High Discount Factor (γ)

5. Exploration vs Exploitation Tradeoff

Exploration

Exploitation

Epsilon-Greedy Strategy

6. Interactive RL Playground

Configure & Train Your Own Agent

Key Takeaways

Finished this lesson?