2 weeks ago, 13 November, one of the most exiting RL competitions of this year finally end. NIPS 2017: Learning to Run was really interesting and hard to solve. Nevertheless, me and my friend Mikhail Pavlov took 3rd place in the final round and we invited to NIPS conference. For more detailed description you can read our article or PyTorch/Theano source code.
Long story short, we use DDPG as a main agent and speed up environment as fast as we can. Our final agent was trained on 36 cores for about 5 days. We also use several observation improvements to correct agent’s input.
The competition was launched by Stanford Neuromuscular Biomechanics Laboratory. As an environment it uses open-sourced OpenSim simulator. The main objective of this competition was to train physiologically-based human model to navigate a complex obstacle course as quickly as possible. It was also a time limit in the form 1000 maximum timesteps in simulation.
The environment was a musculoskeletal model that includes body segments for each leg, a pelvis segment, and a single segment to represent the upper half of the body (trunk, head, arms).
Environment state was represented with R⁴¹ vector with coordinates and velocities of various body parts and obstacle locations.
As an action agent should output R¹⁸ vector with muscles activations, 9 per leg, each in [0, 1] range.
The reward function was defined as change in x coordinate of pelvis plus a small penalty for using ligament forces.
The simulation ends if agent falls (pelvis y < 0.65) or after 1000 steps in environment.
Finally, the simulation have additional stochasticity such as random strength of the psoas muscles and random location and size of obstacles.
Methods: on- vs off-policy algorithms
As a baseline in this competition we use state-of-the-art RL policy gradient methods such as Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO). Unfortunately, these algorithms are on-policy optimization algorithms, that can only update agent’s behaviour according to the current agent’s policy.
To be more concrete, in this optimization method:
- agent plays one episode;
- updated it’s policy according to that episode;
- forget all about this episode;
- and plays next one.
Why is that?
The main reason is speed, OpenSim is ~2000 times slower that Roboschool. For example, first round simulation achieve ~5–15 timesteps per second, and the final round was even worse: ~0.05–5 timesteps per second.
All these lead us to the point, that we need something more sample efficient, something with memory replay and ability to learning based on all data from arbitrary policies. And it was…
Deep Deterministic Policy Gradient (DDPG)
Or, as you can say, a DQN variant for continuous action space environments. DDPG consists of actor and critic networks. Critic is trained using Bellman equation and off-policy data:
where π is the actor policy. The actor is trained to maximize the critic’s estimated Q-values by back-propagating through critic and actor networks.
But what is really crucial — it uses replay buffer. Yes, the environment is still very slow, but now we have history.
We used some standard reinforcement learning techniques: action repeat (we repeat each action 5 times) and scaled reward (we choose scale factor of 10).
For exploration we used Ornstein-Uhlenbeck (OU) process (Uhlenbeck and Ornstein 1930) to generate temporally correlated noise for efficient exploration in physical environments.
Finally, we also use recently proposed improvements such as parameter noise and layer norm.
Learning to speed up
Okay, we have an working agent. But, how can we improve it? Just look at the environment!
First, that we found was some kind of error in environment state. In agent state all (x, y) coordinates were absolute. To improve generalization of our agent and more efficient data usage, we modified the original version of environment making all x coordinates relative to the x coordinate of pelvis. We call this state transform.
After that, we also find actions and states reflection symmetry. The simulated model had bilateral body symmetry. So, state components and actions can be reflected to increase sample size by factor of 2. This feature was called flip state action.
Even at these stage we have a very impressive results, but we want more. So, we came to the last hack — simulator recompilation. We took the same simulator, change the accuracy of it’s calculations and recompile it.
The results were impressive! 2 times speed up! And as a side effect — much more noisy environment. You can think it was bad, but these noise just improve agent generalization and made it more robust.
Combining these all together, we are coming to our final training pipeline.
Our DDPG implementation was parallelized as follows: N processes collected samples with fixed weights all of which were processed by the learning process at the end of an episode, which updated their weights. Since DDPG is an off-policy method, the stale weights of the samples only improved the performance providing each sampling process with its own weights and thus improving exploration.
3rd place solution
By this way we are coming to the end of our competition. Our best agent was trained on 36 cores for about 5 days. It reached up to 45 meters, but unfortunately, because of environment stochasticity, it was grader 38.42 only.
Nevertheless, comparing to on-policy methods it works pretty well.
Comparing DDPG improvements
After the end of the competitions, we also checked out the importance of each DDPG modification.
As you can see, our final method solves this environment really well.
Run, skeleton, run!
Thanks for reading, time for demonstrations!