Adilbai
/

cLLv2

@@ -5,7 +5,6 @@ tags:
 - deep-reinforcement-learning
 - reinforcement-learning
 - custom-implementation
-- deep-rl-course
 model-index:
 - name: PPO
   results:
@@ -21,41 +20,136 @@ model-index:
       name: mean_reward
       verified: false
 ---
-  # PPO Agent Playing LunarLander-v2
-  This is a trained model of a PPO agent playing LunarLander-v2.
-  # Hyperparameters
-  ```python
-  {'exp_name': 'ppo'
-'seed': 1
-'torch_deterministic': True
-'cuda': True
-'track': False
-'wandb_project_name': 'cleanRL'
-'wandb_entity': None
-'capture_video': False
-'env_id': 'LunarLander-v2'
-'total_timesteps': 50000
-'learning_rate': 0.00025
-'num_envs': 4
-'num_steps': 128
-'anneal_lr': True
-'gae': True
-'gamma': 0.99
-'gae_lambda': 0.95
-'num_minibatches': 4
-'update_epochs': 4
-'norm_adv': True
-'clip_coef': 0.2
-'clip_vloss': True
-'ent_coef': 0.01
-'vf_coef': 0.5
-'max_grad_norm': 0.5
-'target_kl': None
-'repo_id': 'Adilbai/cLLv2'
-'batch_size': 512
-'minibatch_size': 128}
-  ```

 - deep-reinforcement-learning
 - reinforcement-learning
 - custom-implementation
 model-index:
 - name: PPO
   results:
       name: mean_reward
       verified: false
 ---
+# PPO Agent for LunarLander-v2
+## Model Description
+This is a Proximal Policy Optimization (PPO) agent trained to play the LunarLander-v2 environment from OpenAI Gym. The model was trained using a custom PyTorch implementation of the PPO algorithm.
+## Model Details
+- **Model Type**: Reinforcement Learning Agent (PPO)
+- **Architecture**: Actor-Critic Neural Network
+- **Framework**: PyTorch
+- **Environment**: LunarLander-v2 (OpenAI Gym)
+- **Algorithm**: Proximal Policy Optimization (PPO)
+- **Training Library**: Custom PyTorch implementation
+## Training Details
+### Hyperparameters
+| Parameter | Value |
+|-----------|-------|
+| Total Timesteps | 50,000 |
+| Learning Rate | 0.00025 |
+| Number of Environments | 4 |
+| Steps per Environment | 128 |
+| Batch Size | 512 |
+| Minibatch Size | 128 |
+| Number of Minibatches | 4 |
+| Update Epochs | 4 |
+| Discount Factor (γ) | 0.99 |
+| GAE Lambda (λ) | 0.95 |
+| Clip Coefficient | 0.2 |
+| Value Function Coefficient | 0.5 |
+| Entropy Coefficient | 0.01 |
+| Max Gradient Norm | 0.5 |
+### Training Configuration
+- **Seed**: 1 (for reproducibility)
+- **Device**: CUDA enabled
+- **Learning Rate Annealing**: Enabled
+- **Generalized Advantage Estimation (GAE)**: Enabled
+- **Advantage Normalization**: Enabled
+- **Value Loss Clipping**: Enabled
+## Performance
+### Evaluation Results
+- **Environment**: LunarLander-v2
+- **Mean Reward**: -113.57 ± 74.63
+The agent achieves a mean reward of -113.57 with a standard deviation of 74.63 over evaluation episodes.
+## Usage
+This model can be used for:
+- Reinforcement learning research and experimentation
+- Educational purposes to understand PPO implementation
+- Baseline comparison for LunarLander-v2 experiments
+- Fine-tuning starting point for similar control tasks
+## Technical Implementation
+### Architecture Details
+The model uses an Actor-Critic architecture implemented in PyTorch:
+- **Actor Network**: Outputs action probabilities for the discrete action space
+- **Critic Network**: Estimates state values for advantage computation
+- **Shared Features**: Common feature extraction layers (if applicable)
+### PPO Algorithm Features
+- **Clipped Surrogate Objective**: Prevents large policy updates
+- **Value Function Clipping**: Stabilizes value function learning
+- **Generalized Advantage Estimation**: Reduces variance in advantage estimates
+- **Multiple Epochs**: Updates policy multiple times per batch of experience
+## Environment Information
+**LunarLander-v2** is a classic control task where an agent must learn to:
+- Land a lunar lander safely on a landing pad
+- Control thrust and rotation to manage descent
+- Balance fuel efficiency with landing accuracy
+- Handle continuous state space and discrete action space
+**Action Space**: Discrete(4)
+- 0: Do nothing
+- 1: Fire left orientation engine
+- 2: Fire main engine
+- 3: Fire right orientation engine
+**Observation Space**: Box(8) containing:
+- Position (x, y)
+- Velocity (x, y)
+- Angle and angular velocity
+- Left and right leg ground contact
+## Training Environment
+- **Framework**: Custom PyTorch PPO implementation
+- **Parallel Environments**: 4 concurrent environments for data collection
+- **Total Training Time**: 50,000 timesteps across all environments
+- **Experience Collection**: On-policy learning with trajectory batches
+## Limitations and Considerations
+- The model shows moderate performance with high variance in rewards
+- Training was limited to 50,000 timesteps, which may be insufficient for optimal performance
+- Performance may vary significantly across different episodes due to the stochastic nature of the environment
+- The model has not been tested on variations of the LunarLander environment
+## Citation
+If you use this model in your research, please cite:
+```bibtex
+@misc{cllv2_ppo_lunarlander,
+  author = {Adilbai},
+  title = {PPO Agent for LunarLander-v2},
+  year = {2024},
+  publisher = {Hugging Face},
+  url = {https://huggingface.co/Adilbai/cLLv2}
+}
+```
+## License
+Please refer to the repository license for usage terms and conditions.
+## Contact
+For questions or issues regarding this model, please open an issue in the model repository or contact the model author.