File size: 15,019 Bytes

---
tags:
- Pixelcopter-PLE-v0
- reinforce
- reinforcement-learning
- custom-implementation
- deep-rl-class
model-index:
- name: Pixelcopter-RL
  results:
  - task:
      type: reinforcement-learning
      name: reinforcement-learning
    dataset:
      name: Pixelcopter-PLE-v0
      type: Pixelcopter-PLE-v0
    metrics:
    - type: mean_reward
      value: 13.10 +/- 6.89
      name: mean_reward
      verified: false
---
# REINFORCE Agent for Pixelcopter-PLE-v0

## Model Description

This repository contains a trained REINFORCE (Policy Gradient) reinforcement learning agent that has learned to play Pixelcopter-PLE-v0, a challenging helicopter navigation game from the PyGame Learning Environment (PLE). The agent uses policy gradient methods to learn optimal flight control strategies through trial and error.

### Model Details

- **Algorithm**: REINFORCE (Monte Carlo Policy Gradient)
- **Environment**: Pixelcopter-PLE-v0 (PyGame Learning Environment)
- **Framework**: Custom implementation following Deep RL Course guidelines
- **Task Type**: Discrete Control (Binary Actions)
- **Action Space**: Discrete (2 actions: do nothing or thrust up)
- **Observation Space**: Visual/pixel-based or feature-based state representation

### Environment Overview

Pixelcopter-PLE-v0 is a classic helicopter control game where:
- **Objective**: Navigate a helicopter through obstacles without crashing
- **Challenge**: Requires precise timing and control to avoid ceiling, floor, and obstacles
- **Physics**: Gravity constantly pulls the helicopter down; player must apply thrust to maintain altitude
- **Scoring**: Points are awarded for surviving longer and successfully navigating through gaps
- **Difficulty**: Requires learning temporal dependencies and precise action timing

## Performance

The trained REINFORCE agent achieves the following performance metrics:

- **Mean Reward**: 13.10 ± 6.89
- **Performance Analysis**: This represents solid performance for this challenging environment
- **Consistency**: The standard deviation indicates moderate variability, which is expected for policy gradient methods

### Performance Context

The mean reward of 13.10 demonstrates that the agent has successfully learned to:
- Navigate through multiple obstacles before crashing
- Balance altitude control against obstacle avoidance
- Develop timing strategies for thrust application
- Achieve consistent survival beyond random baseline performance

The variability (±6.89) is characteristic of policy gradient methods and reflects the stochastic nature of the learned policy, which can lead to different episode outcomes based on exploration.

## Algorithm: REINFORCE

REINFORCE is a foundational policy gradient algorithm that:
- **Direct Policy Learning**: Learns a parameterized policy directly (no value function)
- **Monte Carlo Updates**: Uses complete episode returns for policy updates
- **Policy Gradient**: Updates policy parameters in direction of higher expected returns
- **Stochastic Policy**: Learns probabilistic action selection for exploration

### Key Advantages
- Simple and intuitive policy gradient approach
- Works well with discrete and continuous action spaces
- No need for value function approximation
- Good educational foundation for understanding policy gradients

## Usage

### Installation Requirements

```bash
# Core dependencies
pip install torch torchvision
pip install gymnasium
pip install pygame-learning-environment
pip install numpy matplotlib

# For visualization and analysis
pip install pillow
pip install imageio  # for gif creation
```

### Loading and Using the Model

```python
import torch
import gymnasium as gym
from ple import PLE
from ple.games.pixelcopter import Pixelcopter
import numpy as np

# Load the trained model
# Note: Adjust path based on your model file structure
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = torch.load("pixelcopter_reinforce_model.pth", map_location=device)
model.eval()

# Create the environment
def create_pixelcopter_env():
    game = Pixelcopter()
    env = PLE(game, fps=30, display=True)  # Set display=False for headless
    return env

# Initialize environment
env = create_pixelcopter_env()
env.init()

# Run trained agent
def run_agent(model, env, episodes=5):
    total_rewards = []
    
    for episode in range(episodes):
        env.reset_game()
        episode_reward = 0
        
        while not env.game_over():
            # Get current state
            state = env.getScreenRGB()  # or env.getGameState() if using features
            state = preprocess_state(state)  # Apply your preprocessing
            
            # Convert to tensor
            state_tensor = torch.FloatTensor(state).unsqueeze(0).to(device)
            
            # Get action probabilities
            with torch.no_grad():
                action_probs = model(state_tensor)
                action = torch.multinomial(action_probs, 1).item()
            
            # Execute action (0: do nothing, 1: thrust)
            reward = env.act(action)
            episode_reward += reward
        
        total_rewards.append(episode_reward)
        print(f"Episode {episode + 1}: Reward = {episode_reward:.2f}")
    
    mean_reward = np.mean(total_rewards)
    std_reward = np.std(total_rewards)
    print(f"\nAverage Performance: {mean_reward:.2f} ± {std_reward:.2f}")
    
    return total_rewards

# Preprocessing function (adjust based on your model's input requirements)
def preprocess_state(state):
    """
    Preprocess the game state for the neural network
    This should match the preprocessing used during training
    """
    if isinstance(state, np.ndarray) and len(state.shape) == 3:
        # If using image input
        state = np.transpose(state, (2, 0, 1))  # Channel first
        state = state / 255.0  # Normalize pixels
        return state.flatten()  # or keep as image depending on model
    else:
        # If using game state features
        return np.array(list(state.values()))

# Run the agent
rewards = run_agent(model, env, episodes=10)
```

### Training Your Own Agent

```python
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import deque

class PolicyNetwork(nn.Module):
    def __init__(self, state_size, action_size, hidden_size=64):
        super(PolicyNetwork, self).__init__()
        self.fc1 = nn.Linear(state_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.fc3 = nn.Linear(hidden_size, action_size)
        self.softmax = nn.Softmax(dim=1)
        
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return self.softmax(x)

class REINFORCEAgent:
    def __init__(self, state_size, action_size, lr=0.001):
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.policy_net = PolicyNetwork(state_size, action_size).to(self.device)
        self.optimizer = optim.Adam(self.policy_net.parameters(), lr=lr)
        
        self.saved_log_probs = []
        self.rewards = []
        
    def select_action(self, state):
        state = torch.FloatTensor(state).unsqueeze(0).to(self.device)
        probs = self.policy_net(state)
        action = torch.multinomial(probs, 1)
        
        self.saved_log_probs.append(torch.log(probs.squeeze(0)[action]))
        return action.item()
    
    def update_policy(self, gamma=0.99):
        # Calculate discounted rewards
        discounted_rewards = []
        R = 0
        
        for r in reversed(self.rewards):
            R = r + gamma * R
            discounted_rewards.insert(0, R)
        
        # Normalize rewards
        discounted_rewards = torch.FloatTensor(discounted_rewards).to(self.device)
        discounted_rewards = (discounted_rewards - discounted_rewards.mean()) / (discounted_rewards.std() + 1e-8)
        
        # Calculate policy loss
        policy_loss = []
        for log_prob, reward in zip(self.saved_log_probs, discounted_rewards):
            policy_loss.append(-log_prob * reward)
        
        # Update policy
        self.optimizer.zero_grad()
        policy_loss = torch.cat(policy_loss).sum()
        policy_loss.backward()
        self.optimizer.step()
        
        # Clear episode data
        self.saved_log_probs.clear()
        self.rewards.clear()
        
        return policy_loss.item()

def train_agent(episodes=2000):
    env = create_pixelcopter_env()
    env.init()
    
    # Determine state size based on your preprocessing
    state_size = len(preprocess_state(env.getScreenRGB()))  # Adjust as needed
    action_size = 2  # do nothing, thrust
    
    agent = REINFORCEAgent(state_size, action_size)
    
    scores = deque(maxlen=100)
    
    for episode in range(episodes):
        env.reset_game()
        episode_reward = 0
        
        while not env.game_over():
            state = preprocess_state(env.getScreenRGB())
            action = agent.select_action(state)
            
            reward = env.act(action)
            agent.rewards.append(reward)
            episode_reward += reward
        
        # Update policy after each episode
        loss = agent.update_policy()
        scores.append(episode_reward)
        
        if episode % 100 == 0:
            avg_score = np.mean(scores)
            print(f"Episode {episode}, Average Score: {avg_score:.2f}, Loss: {loss:.4f}")
    
    # Save the trained model
    torch.save(agent.policy_net, "pixelcopter_reinforce_model.pth")
    return agent

# Train a new agent
# trained_agent = train_agent()
```

### Evaluation and Analysis

```python
import matplotlib.pyplot as plt

def evaluate_agent_detailed(model, env, episodes=50):
    """Detailed evaluation with statistics and visualization"""
    rewards = []
    episode_lengths = []
    
    for episode in range(episodes):
        env.reset_game()
        episode_reward = 0
        steps = 0
        
        while not env.game_over():
            state = preprocess_state(env.getScreenRGB())
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            
            with torch.no_grad():
                action_probs = model(state_tensor)
                action = torch.multinomial(action_probs, 1).item()
            
            reward = env.act(action)
            episode_reward += reward
            steps += 1
        
        rewards.append(episode_reward)
        episode_lengths.append(steps)
        
        if (episode + 1) % 10 == 0:
            print(f"Episodes {episode + 1}/{episodes} completed")
    
    # Statistical analysis
    mean_reward = np.mean(rewards)
    std_reward = np.std(rewards)
    median_reward = np.median(rewards)
    max_reward = np.max(rewards)
    min_reward = np.min(rewards)
    
    mean_length = np.mean(episode_lengths)
    
    print(f"\n--- Evaluation Results ---")
    print(f"Episodes: {episodes}")
    print(f"Mean Reward: {mean_reward:.2f} ± {std_reward:.2f}")
    print(f"Median Reward: {median_reward:.2f}")
    print(f"Max Reward: {max_reward:.2f}")
    print(f"Min Reward: {min_reward:.2f}")
    print(f"Mean Episode Length: {mean_length:.1f} steps")
    
    # Visualization
    plt.figure(figsize=(12, 4))
    
    plt.subplot(1, 2, 1)
    plt.plot(rewards)
    plt.axhline(y=mean_reward, color='r', linestyle='--', label=f'Mean: {mean_reward:.2f}')
    plt.title('Episode Rewards')
    plt.xlabel('Episode')
    plt.ylabel('Reward')
    plt.legend()
    
    plt.subplot(1, 2, 2)
    plt.hist(rewards, bins=20, alpha=0.7)
    plt.axvline(x=mean_reward, color='r', linestyle='--', label=f'Mean: {mean_reward:.2f}')
    plt.title('Reward Distribution')
    plt.xlabel('Reward')
    plt.ylabel('Frequency')
    plt.legend()
    
    plt.tight_layout()
    plt.show()
    
    return {
        'rewards': rewards,
        'episode_lengths': episode_lengths,
        'stats': {
            'mean': mean_reward,
            'std': std_reward,
            'median': median_reward,
            'max': max_reward,
            'min': min_reward
        }
    }

# Run detailed evaluation
# results = evaluate_agent_detailed(model, env, episodes=100)
```

## Training Information

### Hyperparameters

The REINFORCE agent was trained with carefully tuned hyperparameters:
- **Learning Rate**: Optimized for stable policy gradient updates
- **Discount Factor (γ)**: Balances immediate vs. future rewards
- **Network Architecture**: Multi-layer perceptron with appropriate hidden dimensions
- **Episode Length**: Sufficient episodes to learn temporal patterns

### Training Environment

- **State Representation**: Processed game screen or extracted features
- **Action Space**: Binary discrete actions (do nothing vs. thrust)
- **Reward Signal**: Game score progression with survival bonus
- **Training Episodes**: Extended training to achieve stable performance

### Algorithm Characteristics

- **Sample Efficiency**: Moderate (typical for policy gradient methods)
- **Stability**: Good convergence with proper hyperparameter tuning
- **Exploration**: Built-in through stochastic policy
- **Interpretability**: Clear policy learning through gradient ascent

## Limitations and Considerations

- **Sample Efficiency**: REINFORCE requires many episodes to learn effectively
- **Variance**: Policy gradient estimates can have high variance
- **Environment Specific**: Trained specifically for Pixelcopter game mechanics
- **Stochastic Performance**: Episode rewards vary due to policy stochasticity
- **Real-time Performance**: Inference speed suitable for real-time game play

## Related Work and Extensions

This model serves as an excellent educational example for:
- **Policy Gradient Methods**: Understanding direct policy optimization
- **Deep Reinforcement Learning**: Practical implementation of RL algorithms
- **Game AI**: Learning complex temporal control tasks
- **Baseline Comparisons**: Foundation for more advanced algorithms (A2C, PPO, etc.)

## Citation

If you use this model in your research or educational projects, please cite:

```bibtex
@misc{pixelcopter_reinforce_2024,
  title={REINFORCE Agent for Pixelcopter-PLE-v0},
  author={Adilbai},
  year={2024},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/Adilbai/Pixelcopter-RL}},
  note={Trained following Deep RL Course Unit 4}
}
```

## Educational Resources

This model was developed following the **Deep Reinforcement Learning Course Unit 4**:
- **Course Link**: [https://huggingface.co/deep-rl-course/unit4/introduction](https://huggingface.co/deep-rl-course/unit4/introduction)
- **Topic**: Policy Gradient Methods and REINFORCE
- **Learning Objectives**: Understanding policy-based RL algorithms

For comprehensive learning about REINFORCE and policy gradient methods, refer to the complete course materials.

## License

This model is distributed under the MIT License. The model is intended for educational and research purposes.