Self-Driving Cars in the Browser

This project demonstrates how a computer can learn by itself how to control a car. Using reinforcement learning, the agent's neural network is trained to maximize its reward, which in this case is the speed of the car. The code runs in the browser and you can watch the cars drive below.

Source code is available here 1.

On this project I have been working on for quite some time now. These cars learned how to drive by themselves. They got feedback on what good and what bad actions are based on their current speed as a form of reward. Powered by a neural network.

You can drag the mouse to draw obstacles, which the cars must avoid. Play around with this demo and get excited about machine learning!

The following is a rough description of how this works. You may stop reading here and just play with the demo if you’re not interested in the technical background!

Deep Deterministic Policy Gradient

These cars learn to drive using a modified version of Deep Deterministic Policy Gradient (DDPG in short, Lillicrap et al., 2015), a form of reinforcement learning.

The idea behind DDPG is essentially to learn two neural networks.

The first neural network is called the critic and its task is to predict the Q-value of a state and action pair. It can tell you how good an action $a$ is if you are in a specific state $s$.

The second neural network is called actor. Its job is to find the action $a^\star$ that maximizes the Q-value as predicted by the critic for a given state $s$. In the optimal case, it receives the current state $s_t$ as input and outputs the best possible action $a^\star_t$ that maximizes the future reward.

Both of these networks are trained in parallel. The critic is supervised using the Bellman equation to output Q-values. At the same time, the actor uses gradients from the critic to learn the optimal action for each state. During training DDPG makes use of a replay buffer, as introduced in the original Atari paper (Mnih et al., 2013). This means experiences of the agent (state, action and reward tuples) are stored in memory and are used continously to train the neural networks.

Now that you have a rough understanding of DDPG, in the following sections I describe how I adapted this algorithm for learning to drive. I start by outlining the state and action space. Then, I explain how my agent explores this space and how multi-agent learning plays a role.

Sensors

The state $s_t$ of the agent consists of observations from two time-steps, the current time-step $t$ and the previous time-step $t-1$. This helps the agent make decisions based on how things moved over time. Each observation contains information about the agent’s environment. This includes 19-distance sensors $\vec{d}_{t}$, which are arranged in different angles. You can think of these sensors as beams, that stop when they hit an object. The shorter the beam, the higher the input to the agent, 0 – for no hit, 1 – for an object that is directly in front of the car. In addition, each observation contains the current speed of the agent $v_t$. In total, each observation contains $79$ variables which means the input to the neural networks is $158$-dimensional.

Imagine sitting in a room with a computer, looking at $158$ numbers on the screen and having to press left or right in order to increase some reward that you know nothing about. That’s exactly what this agent is doing. Isn’t that crazy?

Exploration

A major issue with DDPG is exploration. In regular Deep Q-Learning you have discrete actions from which you can choose from. So you can easily mix up your action-state-space by choosing random actions using a so-called epsilon-greedy policy. In continuous spaces (as the case with DDPG) this is not as easy. For this project I used dropout as a way to explore. This means dropping some neurons of the last layer of the actor network randomly and therefore obtaining some kind of variation in actions.

Multi-agent learning

In addition to applying dropout to the actor network, I put 4 agents into the virtual environment at the same time. All these agents share the same value network, but have different actors and therefore have different approaches to different states. Therefore every agent explores different areas of the state-action space. All in all this resulted in better and faster convergence.

1. The code for the demo above along with the JavaScript library neu­ro­js I specif­i­cal­ly made for this project is avail­able on GitHub↩︎