Building Intelligent Systems with Reinforcement Learning


Introduction to Reinforcement Learning

Reinforcement Learning (RL) is one of the most exciting areas of machine learning, where agents learn optimal strategies through trial and error. To demonstrate these concepts, I built a comprehensive Blackjack AI that masters the game through self-play.

Why Blackjack?

Blackjack is an ideal testbed for RL because it:

  • Has a well-defined state space (card combinations)
  • Involves both deterministic and stochastic elements
  • Requires strategic decision-making (hit, stand, double down)
  • Has a known optimal strategy (basic strategy)

This allows us to compare learned strategies against the theoretically optimal approach.

Three Learning Approaches

1. Monte Carlo Methods

Monte Carlo learning estimates value functions by averaging returns from complete episodes:

def monte_carlo_update(episode, values, returns):
    G = 0
    for state, action, reward in reversed(episode):
        G = gamma * G + reward
        returns[state].append(G)
        values[state] = np.mean(returns[state])

Pros:

  • Simple to implement
  • Works well for episodic tasks
  • No bootstrapping assumptions

Cons:

  • Requires complete episodes
  • High variance
  • Slow convergence

2. Temporal Difference (TD) Learning

TD learning updates estimates based on other estimates, allowing learning from incomplete episodes:

def td_update(state, next_state, reward, alpha=0.1, gamma=0.99):
    td_target = reward + gamma * V[next_state]
    td_error = td_target - V[state]
    V[state] += alpha * td_error

Advantages:

  • Online learning (updates during episodes)
  • Lower variance than Monte Carlo
  • More sample efficient

3. Q-Learning

Q-Learning learns action-value functions, enabling off-policy learning:

def q_learning_update(state, action, reward, next_state, alpha=0.1):
    max_next_q = max(Q[next_state].values())
    td_target = reward + gamma * max_next_q
    Q[state][action] += alpha * (td_target - Q[state][action])

Key Features:

  • Off-policy (learns optimal policy while following exploratory policy)
  • Convergence guarantees under certain conditions
  • Foundation for modern deep RL (DQN)

Implementation Highlights

State Representation

I represented Blackjack states as tuples:

state = (player_sum, dealer_card, usable_ace)

This compact representation captures all relevant information for decision-making.

Exploration Strategy

Used ε-greedy exploration with decay:

def select_action(state, epsilon):
    if random.random() < epsilon:
        return random.choice(actions)
    else:
        return max(Q[state], key=Q[state].get)

Real-Time Visualization

One of the most exciting features is real-time visualization of:

  • State values: Heatmaps showing the value of different card combinations
  • Policy decisions: Visual representation of hit/stand decisions
  • Learning curves: Tracking performance over training episodes

Results and Insights

Convergence Comparison

After training all three algorithms:

  • Monte Carlo: Converged after ~100,000 episodes
  • TD Learning: Converged after ~50,000 episodes
  • Q-Learning: Converged after ~30,000 episodes

Learned Strategy vs. Basic Strategy

The learned strategies closely matched basic Blackjack strategy:

  • Hit on 11 or below (always)
  • Stand on 17 or above (usually)
  • Consider dealer’s upcard for 12-16 range

Interesting Discoveries

  1. Soft Hands Matter: Hands with usable aces require different strategies
  2. Dealer’s Card is Crucial: The dealer’s visible card significantly affects optimal play
  3. Variance is Real: Even with optimal play, short-term results vary widely

Technical Challenges

Memory Efficiency

Storing Q-values for all state-action pairs required careful memory management:

from collections import defaultdict
Q = defaultdict(lambda: defaultdict(float))

Visualization Performance

Real-time updates needed optimization:

  • Batch rendering updates
  • Only redraw changed elements
  • Use efficient data structures

Hyperparameter Tuning

Finding the right learning rates, discount factors, and exploration schedules required extensive experimentation.

Beyond Blackjack

The techniques demonstrated here extend to many domains:

  • Game Playing: Chess, Go, video games
  • Robotics: Motion planning, manipulation
  • Finance: Trading strategies, portfolio optimization
  • Resource Management: Cloud computing, energy systems

Key Takeaways

  1. Different RL algorithms have different trade-offs: Monte Carlo is simple but slow; Q-Learning is efficient but complex
  2. Visualization aids understanding: Seeing how values evolve provides intuition about learning
  3. Theoretical knowledge transfers to practice: Basic Blackjack strategy emerges naturally from RL
  4. Exploration is critical: Without proper exploration, agents get stuck in local optima

Future Enhancements

I’m planning to extend this project with:

  • Deep Q-Networks (DQN): Neural network function approximation
  • Policy Gradient Methods: Direct policy optimization
  • Multi-Agent Learning: Multiple players at the table
  • Tournament Simulation: Testing against various strategies

Check out the complete implementation on GitHub with interactive visualizations and detailed documentation!