--- tags: - Pixelcopter-PLE-v0 - reinforce - reinforcement-learning - custom-implementation - deep-rl-class model-index: - name: Pixelcopter-RL results: - task: type: reinforcement-learning name: reinforcement-learning dataset: name: Pixelcopter-PLE-v0 type: Pixelcopter-PLE-v0 metrics: - type: mean_reward value: 13.10 +/- 6.89 name: mean_reward verified: false --- # REINFORCE Agent for Pixelcopter-PLE-v0 ## Model Description This repository contains a trained REINFORCE (Policy Gradient) reinforcement learning agent that has learned to play Pixelcopter-PLE-v0, a challenging helicopter navigation game from the PyGame Learning Environment (PLE). The agent uses policy gradient methods to learn optimal flight control strategies through trial and error. ### Model Details - **Algorithm**: REINFORCE (Monte Carlo Policy Gradient) - **Environment**: Pixelcopter-PLE-v0 (PyGame Learning Environment) - **Framework**: Custom implementation following Deep RL Course guidelines - **Task Type**: Discrete Control (Binary Actions) - **Action Space**: Discrete (2 actions: do nothing or thrust up) - **Observation Space**: Visual/pixel-based or feature-based state representation ### Environment Overview Pixelcopter-PLE-v0 is a classic helicopter control game where: - **Objective**: Navigate a helicopter through obstacles without crashing - **Challenge**: Requires precise timing and control to avoid ceiling, floor, and obstacles - **Physics**: Gravity constantly pulls the helicopter down; player must apply thrust to maintain altitude - **Scoring**: Points are awarded for surviving longer and successfully navigating through gaps - **Difficulty**: Requires learning temporal dependencies and precise action timing ## Performance The trained REINFORCE agent achieves the following performance metrics: - **Mean Reward**: 13.10 ± 6.89 - **Performance Analysis**: This represents solid performance for this challenging environment - **Consistency**: The standard deviation indicates moderate variability, which is expected for policy gradient methods ### Performance Context The mean reward of 13.10 demonstrates that the agent has successfully learned to: - Navigate through multiple obstacles before crashing - Balance altitude control against obstacle avoidance - Develop timing strategies for thrust application - Achieve consistent survival beyond random baseline performance The variability (±6.89) is characteristic of policy gradient methods and reflects the stochastic nature of the learned policy, which can lead to different episode outcomes based on exploration. ## Algorithm: REINFORCE REINFORCE is a foundational policy gradient algorithm that: - **Direct Policy Learning**: Learns a parameterized policy directly (no value function) - **Monte Carlo Updates**: Uses complete episode returns for policy updates - **Policy Gradient**: Updates policy parameters in direction of higher expected returns - **Stochastic Policy**: Learns probabilistic action selection for exploration ### Key Advantages - Simple and intuitive policy gradient approach - Works well with discrete and continuous action spaces - No need for value function approximation - Good educational foundation for understanding policy gradients ## Usage ### Installation Requirements ```bash # Core dependencies pip install torch torchvision pip install gymnasium pip install pygame-learning-environment pip install numpy matplotlib # For visualization and analysis pip install pillow pip install imageio # for gif creation ``` ### Loading and Using the Model ```python import torch import gymnasium as gym from ple import PLE from ple.games.pixelcopter import Pixelcopter import numpy as np # Load the trained model # Note: Adjust path based on your model file structure device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = torch.load("pixelcopter_reinforce_model.pth", map_location=device) model.eval() # Create the environment def create_pixelcopter_env(): game = Pixelcopter() env = PLE(game, fps=30, display=True) # Set display=False for headless return env # Initialize environment env = create_pixelcopter_env() env.init() # Run trained agent def run_agent(model, env, episodes=5): total_rewards = [] for episode in range(episodes): env.reset_game() episode_reward = 0 while not env.game_over(): # Get current state state = env.getScreenRGB() # or env.getGameState() if using features state = preprocess_state(state) # Apply your preprocessing # Convert to tensor state_tensor = torch.FloatTensor(state).unsqueeze(0).to(device) # Get action probabilities with torch.no_grad(): action_probs = model(state_tensor) action = torch.multinomial(action_probs, 1).item() # Execute action (0: do nothing, 1: thrust) reward = env.act(action) episode_reward += reward total_rewards.append(episode_reward) print(f"Episode {episode + 1}: Reward = {episode_reward:.2f}") mean_reward = np.mean(total_rewards) std_reward = np.std(total_rewards) print(f"\nAverage Performance: {mean_reward:.2f} ± {std_reward:.2f}") return total_rewards # Preprocessing function (adjust based on your model's input requirements) def preprocess_state(state): """ Preprocess the game state for the neural network This should match the preprocessing used during training """ if isinstance(state, np.ndarray) and len(state.shape) == 3: # If using image input state = np.transpose(state, (2, 0, 1)) # Channel first state = state / 255.0 # Normalize pixels return state.flatten() # or keep as image depending on model else: # If using game state features return np.array(list(state.values())) # Run the agent rewards = run_agent(model, env, episodes=10) ``` ### Training Your Own Agent ```python import torch import torch.nn as nn import torch.optim as optim import numpy as np from collections import deque class PolicyNetwork(nn.Module): def __init__(self, state_size, action_size, hidden_size=64): super(PolicyNetwork, self).__init__() self.fc1 = nn.Linear(state_size, hidden_size) self.fc2 = nn.Linear(hidden_size, hidden_size) self.fc3 = nn.Linear(hidden_size, action_size) self.softmax = nn.Softmax(dim=1) def forward(self, x): x = torch.relu(self.fc1(x)) x = torch.relu(self.fc2(x)) x = self.fc3(x) return self.softmax(x) class REINFORCEAgent: def __init__(self, state_size, action_size, lr=0.001): self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") self.policy_net = PolicyNetwork(state_size, action_size).to(self.device) self.optimizer = optim.Adam(self.policy_net.parameters(), lr=lr) self.saved_log_probs = [] self.rewards = [] def select_action(self, state): state = torch.FloatTensor(state).unsqueeze(0).to(self.device) probs = self.policy_net(state) action = torch.multinomial(probs, 1) self.saved_log_probs.append(torch.log(probs.squeeze(0)[action])) return action.item() def update_policy(self, gamma=0.99): # Calculate discounted rewards discounted_rewards = [] R = 0 for r in reversed(self.rewards): R = r + gamma * R discounted_rewards.insert(0, R) # Normalize rewards discounted_rewards = torch.FloatTensor(discounted_rewards).to(self.device) discounted_rewards = (discounted_rewards - discounted_rewards.mean()) / (discounted_rewards.std() + 1e-8) # Calculate policy loss policy_loss = [] for log_prob, reward in zip(self.saved_log_probs, discounted_rewards): policy_loss.append(-log_prob * reward) # Update policy self.optimizer.zero_grad() policy_loss = torch.cat(policy_loss).sum() policy_loss.backward() self.optimizer.step() # Clear episode data self.saved_log_probs.clear() self.rewards.clear() return policy_loss.item() def train_agent(episodes=2000): env = create_pixelcopter_env() env.init() # Determine state size based on your preprocessing state_size = len(preprocess_state(env.getScreenRGB())) # Adjust as needed action_size = 2 # do nothing, thrust agent = REINFORCEAgent(state_size, action_size) scores = deque(maxlen=100) for episode in range(episodes): env.reset_game() episode_reward = 0 while not env.game_over(): state = preprocess_state(env.getScreenRGB()) action = agent.select_action(state) reward = env.act(action) agent.rewards.append(reward) episode_reward += reward # Update policy after each episode loss = agent.update_policy() scores.append(episode_reward) if episode % 100 == 0: avg_score = np.mean(scores) print(f"Episode {episode}, Average Score: {avg_score:.2f}, Loss: {loss:.4f}") # Save the trained model torch.save(agent.policy_net, "pixelcopter_reinforce_model.pth") return agent # Train a new agent # trained_agent = train_agent() ``` ### Evaluation and Analysis ```python import matplotlib.pyplot as plt def evaluate_agent_detailed(model, env, episodes=50): """Detailed evaluation with statistics and visualization""" rewards = [] episode_lengths = [] for episode in range(episodes): env.reset_game() episode_reward = 0 steps = 0 while not env.game_over(): state = preprocess_state(env.getScreenRGB()) state_tensor = torch.FloatTensor(state).unsqueeze(0) with torch.no_grad(): action_probs = model(state_tensor) action = torch.multinomial(action_probs, 1).item() reward = env.act(action) episode_reward += reward steps += 1 rewards.append(episode_reward) episode_lengths.append(steps) if (episode + 1) % 10 == 0: print(f"Episodes {episode + 1}/{episodes} completed") # Statistical analysis mean_reward = np.mean(rewards) std_reward = np.std(rewards) median_reward = np.median(rewards) max_reward = np.max(rewards) min_reward = np.min(rewards) mean_length = np.mean(episode_lengths) print(f"\n--- Evaluation Results ---") print(f"Episodes: {episodes}") print(f"Mean Reward: {mean_reward:.2f} ± {std_reward:.2f}") print(f"Median Reward: {median_reward:.2f}") print(f"Max Reward: {max_reward:.2f}") print(f"Min Reward: {min_reward:.2f}") print(f"Mean Episode Length: {mean_length:.1f} steps") # Visualization plt.figure(figsize=(12, 4)) plt.subplot(1, 2, 1) plt.plot(rewards) plt.axhline(y=mean_reward, color='r', linestyle='--', label=f'Mean: {mean_reward:.2f}') plt.title('Episode Rewards') plt.xlabel('Episode') plt.ylabel('Reward') plt.legend() plt.subplot(1, 2, 2) plt.hist(rewards, bins=20, alpha=0.7) plt.axvline(x=mean_reward, color='r', linestyle='--', label=f'Mean: {mean_reward:.2f}') plt.title('Reward Distribution') plt.xlabel('Reward') plt.ylabel('Frequency') plt.legend() plt.tight_layout() plt.show() return { 'rewards': rewards, 'episode_lengths': episode_lengths, 'stats': { 'mean': mean_reward, 'std': std_reward, 'median': median_reward, 'max': max_reward, 'min': min_reward } } # Run detailed evaluation # results = evaluate_agent_detailed(model, env, episodes=100) ``` ## Training Information ### Hyperparameters The REINFORCE agent was trained with carefully tuned hyperparameters: - **Learning Rate**: Optimized for stable policy gradient updates - **Discount Factor (γ)**: Balances immediate vs. future rewards - **Network Architecture**: Multi-layer perceptron with appropriate hidden dimensions - **Episode Length**: Sufficient episodes to learn temporal patterns ### Training Environment - **State Representation**: Processed game screen or extracted features - **Action Space**: Binary discrete actions (do nothing vs. thrust) - **Reward Signal**: Game score progression with survival bonus - **Training Episodes**: Extended training to achieve stable performance ### Algorithm Characteristics - **Sample Efficiency**: Moderate (typical for policy gradient methods) - **Stability**: Good convergence with proper hyperparameter tuning - **Exploration**: Built-in through stochastic policy - **Interpretability**: Clear policy learning through gradient ascent ## Limitations and Considerations - **Sample Efficiency**: REINFORCE requires many episodes to learn effectively - **Variance**: Policy gradient estimates can have high variance - **Environment Specific**: Trained specifically for Pixelcopter game mechanics - **Stochastic Performance**: Episode rewards vary due to policy stochasticity - **Real-time Performance**: Inference speed suitable for real-time game play ## Related Work and Extensions This model serves as an excellent educational example for: - **Policy Gradient Methods**: Understanding direct policy optimization - **Deep Reinforcement Learning**: Practical implementation of RL algorithms - **Game AI**: Learning complex temporal control tasks - **Baseline Comparisons**: Foundation for more advanced algorithms (A2C, PPO, etc.) ## Citation If you use this model in your research or educational projects, please cite: ```bibtex @misc{pixelcopter_reinforce_2024, title={REINFORCE Agent for Pixelcopter-PLE-v0}, author={Adilbai}, year={2024}, publisher={Hugging Face}, howpublished={\url{https://huggingface.co/Adilbai/Pixelcopter-RL}}, note={Trained following Deep RL Course Unit 4} } ``` ## Educational Resources This model was developed following the **Deep Reinforcement Learning Course Unit 4**: - **Course Link**: [https://huggingface.co/deep-rl-course/unit4/introduction](https://huggingface.co/deep-rl-course/unit4/introduction) - **Topic**: Policy Gradient Methods and REINFORCE - **Learning Objectives**: Understanding policy-based RL algorithms For comprehensive learning about REINFORCE and policy gradient methods, refer to the complete course materials. ## License This model is distributed under the MIT License. The model is intended for educational and research purposes.