Jul 25, 2024

Building Intelligent Systems with Reinforcement Learning

Introduction to Reinforcement Learning

Reinforcement Learning (RL) is one of the most exciting areas of machine learning, where agents learn optimal strategies through trial and error. To demonstrate these concepts, I built a comprehensive Blackjack AI that masters the game through self-play.

Why Blackjack?

Blackjack is an ideal testbed for RL because it:

Has a well-defined state space (card combinations)
Involves both deterministic and stochastic elements
Requires strategic decision-making (hit, stand, double down)
Has a known optimal strategy (basic strategy)

This allows us to compare learned strategies against the theoretically optimal approach.

Three Learning Approaches

1. Monte Carlo Methods

Monte Carlo learning estimates value functions by averaging returns from complete episodes:

def monte_carlo_update(episode, values, returns):
    G = 0
    for state, action, reward in reversed(episode):
        G = gamma * G + reward
        returns[state].append(G)
        values[state] = np.mean(returns[state])

Pros:

Simple to implement
Works well for episodic tasks
No bootstrapping assumptions

Cons:

Requires complete episodes
High variance
Slow convergence

2. Temporal Difference (TD) Learning

TD learning updates estimates based on other estimates, allowing learning from incomplete episodes:

def td_update(state, next_state, reward, alpha=0.1, gamma=0.99):
    td_target = reward + gamma * V[next_state]
    td_error = td_target - V[state]
    V[state] += alpha * td_error

Advantages:

Online learning (updates during episodes)
Lower variance than Monte Carlo
More sample efficient

3. Q-Learning

Q-Learning learns action-value functions, enabling off-policy learning:

def q_learning_update(state, action, reward, next_state, alpha=0.1):
    max_next_q = max(Q[next_state].values())
    td_target = reward + gamma * max_next_q
    Q[state][action] += alpha * (td_target - Q[state][action])

Key Features:

Off-policy (learns optimal policy while following exploratory policy)
Convergence guarantees under certain conditions
Foundation for modern deep RL (DQN)

Implementation Highlights

State Representation

I represented Blackjack states as tuples:

state = (player_sum, dealer_card, usable_ace)

This compact representation captures all relevant information for decision-making.

Exploration Strategy

Used ε-greedy exploration with decay:

def select_action(state, epsilon):
    if random.random() < epsilon:
        return random.choice(actions)
    else:
        return max(Q[state], key=Q[state].get)

Real-Time Visualization

One of the most exciting features is real-time visualization of:

State values: Heatmaps showing the value of different card combinations
Policy decisions: Visual representation of hit/stand decisions
Learning curves: Tracking performance over training episodes

Results and Insights

Convergence Comparison

After training all three algorithms:

Monte Carlo: Converged after ~100,000 episodes
TD Learning: Converged after ~50,000 episodes
Q-Learning: Converged after ~30,000 episodes

Learned Strategy vs. Basic Strategy

The learned strategies closely matched basic Blackjack strategy:

Hit on 11 or below (always)
Stand on 17 or above (usually)
Consider dealer’s upcard for 12-16 range

Interesting Discoveries

Soft Hands Matter: Hands with usable aces require different strategies
Dealer’s Card is Crucial: The dealer’s visible card significantly affects optimal play
Variance is Real: Even with optimal play, short-term results vary widely

Technical Challenges

Memory Efficiency

Storing Q-values for all state-action pairs required careful memory management:

from collections import defaultdict
Q = defaultdict(lambda: defaultdict(float))

Visualization Performance

Real-time updates needed optimization:

Batch rendering updates
Only redraw changed elements
Use efficient data structures

Hyperparameter Tuning

Finding the right learning rates, discount factors, and exploration schedules required extensive experimentation.

Beyond Blackjack

The techniques demonstrated here extend to many domains:

Game Playing: Chess, Go, video games
Robotics: Motion planning, manipulation
Finance: Trading strategies, portfolio optimization
Resource Management: Cloud computing, energy systems

Key Takeaways

Different RL algorithms have different trade-offs: Monte Carlo is simple but slow; Q-Learning is efficient but complex
Visualization aids understanding: Seeing how values evolve provides intuition about learning
Theoretical knowledge transfers to practice: Basic Blackjack strategy emerges naturally from RL
Exploration is critical: Without proper exploration, agents get stuck in local optima

Future Enhancements

I’m planning to extend this project with:

Deep Q-Networks (DQN): Neural network function approximation
Policy Gradient Methods: Direct policy optimization
Multi-Agent Learning: Multiple players at the table
Tournament Simulation: Testing against various strategies

Check out the complete implementation on GitHub with interactive visualizations and detailed documentation!