Update README.md
Browse files
README.md
CHANGED
|
@@ -5,7 +5,6 @@ tags:
|
|
| 5 |
- deep-reinforcement-learning
|
| 6 |
- reinforcement-learning
|
| 7 |
- custom-implementation
|
| 8 |
-
- deep-rl-course
|
| 9 |
model-index:
|
| 10 |
- name: PPO
|
| 11 |
results:
|
|
@@ -21,41 +20,136 @@ model-index:
|
|
| 21 |
name: mean_reward
|
| 22 |
verified: false
|
| 23 |
---
|
|
|
|
| 24 |
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
- deep-reinforcement-learning
|
| 6 |
- reinforcement-learning
|
| 7 |
- custom-implementation
|
|
|
|
| 8 |
model-index:
|
| 9 |
- name: PPO
|
| 10 |
results:
|
|
|
|
| 20 |
name: mean_reward
|
| 21 |
verified: false
|
| 22 |
---
|
| 23 |
+
# PPO Agent for LunarLander-v2
|
| 24 |
|
| 25 |
+
## Model Description
|
| 26 |
+
|
| 27 |
+
This is a Proximal Policy Optimization (PPO) agent trained to play the LunarLander-v2 environment from OpenAI Gym. The model was trained using a custom PyTorch implementation of the PPO algorithm.
|
| 28 |
+
|
| 29 |
+
## Model Details
|
| 30 |
+
|
| 31 |
+
- **Model Type**: Reinforcement Learning Agent (PPO)
|
| 32 |
+
- **Architecture**: Actor-Critic Neural Network
|
| 33 |
+
- **Framework**: PyTorch
|
| 34 |
+
- **Environment**: LunarLander-v2 (OpenAI Gym)
|
| 35 |
+
- **Algorithm**: Proximal Policy Optimization (PPO)
|
| 36 |
+
- **Training Library**: Custom PyTorch implementation
|
| 37 |
+
|
| 38 |
+
## Training Details
|
| 39 |
+
|
| 40 |
+
### Hyperparameters
|
| 41 |
+
|
| 42 |
+
| Parameter | Value |
|
| 43 |
+
|-----------|-------|
|
| 44 |
+
| Total Timesteps | 50,000 |
|
| 45 |
+
| Learning Rate | 0.00025 |
|
| 46 |
+
| Number of Environments | 4 |
|
| 47 |
+
| Steps per Environment | 128 |
|
| 48 |
+
| Batch Size | 512 |
|
| 49 |
+
| Minibatch Size | 128 |
|
| 50 |
+
| Number of Minibatches | 4 |
|
| 51 |
+
| Update Epochs | 4 |
|
| 52 |
+
| Discount Factor (纬) | 0.99 |
|
| 53 |
+
| GAE Lambda (位) | 0.95 |
|
| 54 |
+
| Clip Coefficient | 0.2 |
|
| 55 |
+
| Value Function Coefficient | 0.5 |
|
| 56 |
+
| Entropy Coefficient | 0.01 |
|
| 57 |
+
| Max Gradient Norm | 0.5 |
|
| 58 |
+
|
| 59 |
+
### Training Configuration
|
| 60 |
+
|
| 61 |
+
- **Seed**: 1 (for reproducibility)
|
| 62 |
+
- **Device**: CUDA enabled
|
| 63 |
+
- **Learning Rate Annealing**: Enabled
|
| 64 |
+
- **Generalized Advantage Estimation (GAE)**: Enabled
|
| 65 |
+
- **Advantage Normalization**: Enabled
|
| 66 |
+
- **Value Loss Clipping**: Enabled
|
| 67 |
+
|
| 68 |
+
## Performance
|
| 69 |
+
|
| 70 |
+
### Evaluation Results
|
| 71 |
+
|
| 72 |
+
- **Environment**: LunarLander-v2
|
| 73 |
+
- **Mean Reward**: -113.57 卤 74.63
|
| 74 |
+
|
| 75 |
+
The agent achieves a mean reward of -113.57 with a standard deviation of 74.63 over evaluation episodes.
|
| 76 |
+
|
| 77 |
+
## Usage
|
| 78 |
+
|
| 79 |
+
This model can be used for:
|
| 80 |
+
- Reinforcement learning research and experimentation
|
| 81 |
+
- Educational purposes to understand PPO implementation
|
| 82 |
+
- Baseline comparison for LunarLander-v2 experiments
|
| 83 |
+
- Fine-tuning starting point for similar control tasks
|
| 84 |
+
|
| 85 |
+
## Technical Implementation
|
| 86 |
+
|
| 87 |
+
### Architecture Details
|
| 88 |
+
|
| 89 |
+
The model uses an Actor-Critic architecture implemented in PyTorch:
|
| 90 |
+
- **Actor Network**: Outputs action probabilities for the discrete action space
|
| 91 |
+
- **Critic Network**: Estimates state values for advantage computation
|
| 92 |
+
- **Shared Features**: Common feature extraction layers (if applicable)
|
| 93 |
+
|
| 94 |
+
### PPO Algorithm Features
|
| 95 |
+
|
| 96 |
+
- **Clipped Surrogate Objective**: Prevents large policy updates
|
| 97 |
+
- **Value Function Clipping**: Stabilizes value function learning
|
| 98 |
+
- **Generalized Advantage Estimation**: Reduces variance in advantage estimates
|
| 99 |
+
- **Multiple Epochs**: Updates policy multiple times per batch of experience
|
| 100 |
+
|
| 101 |
+
## Environment Information
|
| 102 |
+
|
| 103 |
+
**LunarLander-v2** is a classic control task where an agent must learn to:
|
| 104 |
+
- Land a lunar lander safely on a landing pad
|
| 105 |
+
- Control thrust and rotation to manage descent
|
| 106 |
+
- Balance fuel efficiency with landing accuracy
|
| 107 |
+
- Handle continuous state space and discrete action space
|
| 108 |
+
|
| 109 |
+
**Action Space**: Discrete(4)
|
| 110 |
+
- 0: Do nothing
|
| 111 |
+
- 1: Fire left orientation engine
|
| 112 |
+
- 2: Fire main engine
|
| 113 |
+
- 3: Fire right orientation engine
|
| 114 |
+
|
| 115 |
+
**Observation Space**: Box(8) containing:
|
| 116 |
+
- Position (x, y)
|
| 117 |
+
- Velocity (x, y)
|
| 118 |
+
- Angle and angular velocity
|
| 119 |
+
- Left and right leg ground contact
|
| 120 |
+
|
| 121 |
+
## Training Environment
|
| 122 |
+
|
| 123 |
+
- **Framework**: Custom PyTorch PPO implementation
|
| 124 |
+
- **Parallel Environments**: 4 concurrent environments for data collection
|
| 125 |
+
- **Total Training Time**: 50,000 timesteps across all environments
|
| 126 |
+
- **Experience Collection**: On-policy learning with trajectory batches
|
| 127 |
+
|
| 128 |
+
## Limitations and Considerations
|
| 129 |
+
|
| 130 |
+
- The model shows moderate performance with high variance in rewards
|
| 131 |
+
- Training was limited to 50,000 timesteps, which may be insufficient for optimal performance
|
| 132 |
+
- Performance may vary significantly across different episodes due to the stochastic nature of the environment
|
| 133 |
+
- The model has not been tested on variations of the LunarLander environment
|
| 134 |
+
|
| 135 |
+
## Citation
|
| 136 |
+
|
| 137 |
+
If you use this model in your research, please cite:
|
| 138 |
+
|
| 139 |
+
```bibtex
|
| 140 |
+
@misc{cllv2_ppo_lunarlander,
|
| 141 |
+
author = {Adilbai},
|
| 142 |
+
title = {PPO Agent for LunarLander-v2},
|
| 143 |
+
year = {2024},
|
| 144 |
+
publisher = {Hugging Face},
|
| 145 |
+
url = {https://huggingface.co/Adilbai/cLLv2}
|
| 146 |
+
}
|
| 147 |
+
```
|
| 148 |
+
|
| 149 |
+
## License
|
| 150 |
+
|
| 151 |
+
Please refer to the repository license for usage terms and conditions.
|
| 152 |
+
|
| 153 |
+
## Contact
|
| 154 |
+
|
| 155 |
+
For questions or issues regarding this model, please open an issue in the model repository or contact the model author.
|