Adilbai commited on
Commit
59b128b
verified
1 Parent(s): 2c55a84

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +132 -38
README.md CHANGED
@@ -5,7 +5,6 @@ tags:
5
  - deep-reinforcement-learning
6
  - reinforcement-learning
7
  - custom-implementation
8
- - deep-rl-course
9
  model-index:
10
  - name: PPO
11
  results:
@@ -21,41 +20,136 @@ model-index:
21
  name: mean_reward
22
  verified: false
23
  ---
 
24
 
25
- # PPO Agent Playing LunarLander-v2
26
-
27
- This is a trained model of a PPO agent playing LunarLander-v2.
28
-
29
- # Hyperparameters
30
- ```python
31
- {'exp_name': 'ppo'
32
- 'seed': 1
33
- 'torch_deterministic': True
34
- 'cuda': True
35
- 'track': False
36
- 'wandb_project_name': 'cleanRL'
37
- 'wandb_entity': None
38
- 'capture_video': False
39
- 'env_id': 'LunarLander-v2'
40
- 'total_timesteps': 50000
41
- 'learning_rate': 0.00025
42
- 'num_envs': 4
43
- 'num_steps': 128
44
- 'anneal_lr': True
45
- 'gae': True
46
- 'gamma': 0.99
47
- 'gae_lambda': 0.95
48
- 'num_minibatches': 4
49
- 'update_epochs': 4
50
- 'norm_adv': True
51
- 'clip_coef': 0.2
52
- 'clip_vloss': True
53
- 'ent_coef': 0.01
54
- 'vf_coef': 0.5
55
- 'max_grad_norm': 0.5
56
- 'target_kl': None
57
- 'repo_id': 'Adilbai/cLLv2'
58
- 'batch_size': 512
59
- 'minibatch_size': 128}
60
- ```
61
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  - deep-reinforcement-learning
6
  - reinforcement-learning
7
  - custom-implementation
 
8
  model-index:
9
  - name: PPO
10
  results:
 
20
  name: mean_reward
21
  verified: false
22
  ---
23
+ # PPO Agent for LunarLander-v2
24
 
25
+ ## Model Description
26
+
27
+ This is a Proximal Policy Optimization (PPO) agent trained to play the LunarLander-v2 environment from OpenAI Gym. The model was trained using a custom PyTorch implementation of the PPO algorithm.
28
+
29
+ ## Model Details
30
+
31
+ - **Model Type**: Reinforcement Learning Agent (PPO)
32
+ - **Architecture**: Actor-Critic Neural Network
33
+ - **Framework**: PyTorch
34
+ - **Environment**: LunarLander-v2 (OpenAI Gym)
35
+ - **Algorithm**: Proximal Policy Optimization (PPO)
36
+ - **Training Library**: Custom PyTorch implementation
37
+
38
+ ## Training Details
39
+
40
+ ### Hyperparameters
41
+
42
+ | Parameter | Value |
43
+ |-----------|-------|
44
+ | Total Timesteps | 50,000 |
45
+ | Learning Rate | 0.00025 |
46
+ | Number of Environments | 4 |
47
+ | Steps per Environment | 128 |
48
+ | Batch Size | 512 |
49
+ | Minibatch Size | 128 |
50
+ | Number of Minibatches | 4 |
51
+ | Update Epochs | 4 |
52
+ | Discount Factor (纬) | 0.99 |
53
+ | GAE Lambda (位) | 0.95 |
54
+ | Clip Coefficient | 0.2 |
55
+ | Value Function Coefficient | 0.5 |
56
+ | Entropy Coefficient | 0.01 |
57
+ | Max Gradient Norm | 0.5 |
58
+
59
+ ### Training Configuration
60
+
61
+ - **Seed**: 1 (for reproducibility)
62
+ - **Device**: CUDA enabled
63
+ - **Learning Rate Annealing**: Enabled
64
+ - **Generalized Advantage Estimation (GAE)**: Enabled
65
+ - **Advantage Normalization**: Enabled
66
+ - **Value Loss Clipping**: Enabled
67
+
68
+ ## Performance
69
+
70
+ ### Evaluation Results
71
+
72
+ - **Environment**: LunarLander-v2
73
+ - **Mean Reward**: -113.57 卤 74.63
74
+
75
+ The agent achieves a mean reward of -113.57 with a standard deviation of 74.63 over evaluation episodes.
76
+
77
+ ## Usage
78
+
79
+ This model can be used for:
80
+ - Reinforcement learning research and experimentation
81
+ - Educational purposes to understand PPO implementation
82
+ - Baseline comparison for LunarLander-v2 experiments
83
+ - Fine-tuning starting point for similar control tasks
84
+
85
+ ## Technical Implementation
86
+
87
+ ### Architecture Details
88
+
89
+ The model uses an Actor-Critic architecture implemented in PyTorch:
90
+ - **Actor Network**: Outputs action probabilities for the discrete action space
91
+ - **Critic Network**: Estimates state values for advantage computation
92
+ - **Shared Features**: Common feature extraction layers (if applicable)
93
+
94
+ ### PPO Algorithm Features
95
+
96
+ - **Clipped Surrogate Objective**: Prevents large policy updates
97
+ - **Value Function Clipping**: Stabilizes value function learning
98
+ - **Generalized Advantage Estimation**: Reduces variance in advantage estimates
99
+ - **Multiple Epochs**: Updates policy multiple times per batch of experience
100
+
101
+ ## Environment Information
102
+
103
+ **LunarLander-v2** is a classic control task where an agent must learn to:
104
+ - Land a lunar lander safely on a landing pad
105
+ - Control thrust and rotation to manage descent
106
+ - Balance fuel efficiency with landing accuracy
107
+ - Handle continuous state space and discrete action space
108
+
109
+ **Action Space**: Discrete(4)
110
+ - 0: Do nothing
111
+ - 1: Fire left orientation engine
112
+ - 2: Fire main engine
113
+ - 3: Fire right orientation engine
114
+
115
+ **Observation Space**: Box(8) containing:
116
+ - Position (x, y)
117
+ - Velocity (x, y)
118
+ - Angle and angular velocity
119
+ - Left and right leg ground contact
120
+
121
+ ## Training Environment
122
+
123
+ - **Framework**: Custom PyTorch PPO implementation
124
+ - **Parallel Environments**: 4 concurrent environments for data collection
125
+ - **Total Training Time**: 50,000 timesteps across all environments
126
+ - **Experience Collection**: On-policy learning with trajectory batches
127
+
128
+ ## Limitations and Considerations
129
+
130
+ - The model shows moderate performance with high variance in rewards
131
+ - Training was limited to 50,000 timesteps, which may be insufficient for optimal performance
132
+ - Performance may vary significantly across different episodes due to the stochastic nature of the environment
133
+ - The model has not been tested on variations of the LunarLander environment
134
+
135
+ ## Citation
136
+
137
+ If you use this model in your research, please cite:
138
+
139
+ ```bibtex
140
+ @misc{cllv2_ppo_lunarlander,
141
+ author = {Adilbai},
142
+ title = {PPO Agent for LunarLander-v2},
143
+ year = {2024},
144
+ publisher = {Hugging Face},
145
+ url = {https://huggingface.co/Adilbai/cLLv2}
146
+ }
147
+ ```
148
+
149
+ ## License
150
+
151
+ Please refer to the repository license for usage terms and conditions.
152
+
153
+ ## Contact
154
+
155
+ For questions or issues regarding this model, please open an issue in the model repository or contact the model author.