Reinforcement Learning Research

RL Group Dresden

Observe and learn!

RL Conference Dresden 2022

On September 15-16 2022, RL Dresden and the chair of Econometrics and Statistics esp. in the transport sector, were pleased to host the Conference on Reinforcement Learning 2022. Visit the conference website for more information.


Motivation

Optimal control for traffic dynamics

With an ever-growing interest in machine learning applications in a connected world, Reinforcement Learning (RL) occupies a unique position. It can learn and perform tasks in uncertain and complex circumstances. The goal is for an agent to maximize the long-term reward by performing actions in the environment. Typical real-world applications of RL include self-driving vessels, traffic flow optimization, or playing games like Starcraft or Go on a super-human level. RL has the potential to outsource error- and accident-prone human labor to a more risk-free and time-efficient automated agent. We are driven by the motivation that RL brings advances in human labor efficiency and, more importantly, human safety.

Projects

HHOS: Steering an autonomous ship from Hamburg to Oslo

Current simulation-based research for autonomous surface vehicles (ASV) often makes unrealistic assumptions. The effects of the environmental disturbances (wind, waves, and currents) are generally neglected, and other traffic participants are considered non-reactive, linearly moving obstacles. Further, the control actions are regularly specified to impact the surge force and the yaw moment while neglecting the low-level translation into, e.g., revolutions per second of a propeller and the rudder angle. Finally, the simulation environments do not consider realistic geometries like rivers, the corresponding water depths, and the latter's impact on the ship's maneuverability.

We create a simulation environment that does not make unrealistic assumptions, and we train an RL agent traveling from Hamburg to Oslo based on actual geographical and hydrological data. We propose a two-level hierarchical structure of the agent, consisting of a local-path planner and a local-path follower. The first is responsible for waypoint generation, while the second one offers the robustness to execute the planned path under consideration of the low-level control routine.

Picture of Two dimensional traffic simulator

COLREG-compliant autonomous surface vehicles through DRL

Maritime operations based on autonomous surface vehicles (ASV) require processing positional and motion information from several sources to avoid collisions in multi-ship encounter situations successfully. We propose a spatial-temporal recurrent neural network architecture for Deep Q-Networks to address this challenge and construct an end-to-end system from AIS data input to the rudder angle. The reward function is defined to comply with the Convention on the International Regulations for Preventing Collisions at Sea (COLREG).

Further, we define a state-of-the-art collision risk based on the concepts of the closest point of approach and the ship domain. The policy is validated on a custom set of newly created single-ship encounters called Around the Clock problems and the commonly chosen Imazu problems, which include 18 multi-ship scenarios. Additionally, the framework shows robustness when deployed simultaneously in a multi-agent scenario.

Robust Sim2Real transfer with DRL for Autonomous vehicles

Training an RL agent in the real world is impossible and expensive, thus, simulation is an important tool for developing RL agents. But if we want to transfer the trained agent in the real world, the established discrepancies between the simulation and reality will cause a reality gap and make the transfer of RL policies a challenge.

To train the autonomous lane following and overtaking agent, we separate the agent into two different modules, namely the perception and control modules. The perception module transfers the image input into compact information about the environment. Afterward, the control module utilizes the information from the perception module as part of the observation and outputs control commands for the vehicle. Thus, the trained agent can be transferred from the simulation to the real world with a minor modification of the perception module.

Path-following-validation

Robust path following on Rivers for medium-sized cargo ships

Spatial restrictions due to waterway geometry and the resulting challenges, such as high flow velocities or shallow banks, require controlled and precise control commands for steering. This is achieved by augmenting the agent's perception with environmental information such as current-velocity and direction and water depth.

A test-based bootstrapped Q-learning algorithm is used in combination with a versatile training environment generator to develop a robust and accurate rudder-controller. Validation on the lower and middle rhine in both directions indicate a state of the art performance despite stark environmental differences across the scenarios.

Swing-up and Balancing of Cart Pole

In classical problem of swing-up and balancing Cart Pole (Inverted Pendulum) the goal is to train agent to swing-up and balance the pole by moving the cart left/right with different speed as action, the unique feature is we are using a camera and computer vision to get the observations, while other studies use different sensors such as gyroscope/rotary encoder. Cartpole is a very unstable system and it is not easy to replicate the real behavior in simulation for training due to various factors, hence we are training the agent directly on physical model.

Picture of Two dimensional traffic simulator

Two dimensional traffic simulator

Developed lane-bound and non-lane-bound traffic simulator uses a hierarchical reinforcement learning algorithm to guide traffic agents. We decomposed the control task into different modules as a longitudinal and lateral control policy to control movements or decision policy that decides the overtaking maneuver.

Different RL algorithms are used for each module, such as the Deep Deterministic Policy Gradient (DDPG), or the Twin delayed DDPG (TD3). The models are trained in a synthetic environment and through semi-supervised learning with actual trajectories. The central goal is to develop a system of fully autonomous vehicles based on reinforcement learning methods. Youtube Button to view an example video of the twodimensional traffic simulator

Improving RL with real-world trajectories

When training an RL agent for traffic tasks, the agent's behavior depends entirely on the definition of the reward function based on experience, which does not guarantee perfect actions. Human demonstrations are often used to improve the agent's behavior or to speed up the learning process. These demonstrations are generated and recorded in a simulated or limited real environment. This leads to restrictions on generalization and biases towards data.

We consider a fundamental challenge for developing autonomous car-following RL agents with real-world driving datasets instead of human demonstration in simulated environments. We mix the human experience into the replay buffer of the agent and let the agent improve the behavior by extracting human demonstration in the real world. Youtube Button to view an example video of Carla Simulator

Plots to show overestimation in RL

Tackling Overestimation Bias in RL

Even long-standing and widespread Reinforcement Learning algorithms like Q-Learning have severe limitations under specific conditions. One particular issue is the usage of the maximization operator during the target computation since the algorithm is an instantiation of the Bellman optimality equation in a sample-based procedure. This maximization frequently leads to exaggerated Q-values of state-action pairs, further transmitted through the following updates.

Early contributions in the literature recognized this limitation, and the most common method to combat this problem is Double Q-Learning. However, this procedure introduces an underestimation bias, and more flexible specifications are needed. We are working on a new set of estimators for the underlying max-mu-problem and simultaneously translate it into the case where deep neural networks serve as function approximators.

Picture of heterogenous traffic in India

Two-Dimensional Traffic Flow Modelling

Microscopic traffic flow(driver behavior models - Car-Following and Lane-Changing) models are backbone of different traffic simulators, the classical models such as IDM, MTM, and Wiedemann model are built and calibrated to match the observed real traffic behaviour. We are building a Reinforcement Learning-based two-dimensional traffic flow model, the classical microscopic models are vital for understanding and evaluating the performance of developed RL algorithms, and will compare RL model with classical models to find out if, when, and under which aspect, they are "better".


Software

Github RL Suite

RL Dresden Algorithm Suite [GitHub]

This Suite implements several model-free off-policy deep reinforcement learning algorithms for discrete and continuous action spaces in PyTorch.

Mixed Traffic Simulator

Mixed Traffic Web Simulators [mtreiber.de] [traffic-simulation.de]

Fully operational, 2D, Javascript-based web simulator implementing the "Mixed Traffic flow Model" (MTM). This simulation is intended to demonstrate fully two-dimensional but directed traffic flow and visulalize 2D flow models.


Publications


Members

Prof. Dr. Ostap Okhrin

Prof. Dr. Ostap Okhrin

Ostap Okhrin is the professor for Statistics and Econometrics at the Department of Transportation at the TU Dresden. His expertise lies in mathematical statistics and data science with applications in transportation and economics.

Dr. Martin Treiber

Dr. Martin Treiber

Martin Treiber is a senior expert in traffic flow models including human and automated driving, bicycle, and pedestrian traffic. He also works in traffic data analysis and simulation (traffic-simulation.de, mtreiber.de/mixedTraffic).

Fabian Hart

Fabian Hart

Fabian Hart was studying electrical engineering at TU Dresden. His main research focuses on modeling two-dimensional traffic based on Deep Reinforcement Learning methods.

Martin Waltz

Martin Waltz

Martin Waltz conducted his studies in Industrial Engineering and is now a research associate, with his main research focus being (Deep) Reinforcement Learning.

Niklas Paulig

Niklas Paulig

Niklas Paulig is a RL Group research associate, with his main field of research being the modeling of autonomous inland vessel traffic based on reinforcement learning methods, and HPC implementations of currently in use algorithms.

Dianzhao Li

Dianzhao Li

Dianzhao Li is a research assistant in the RL group, focusing on the area of trajectory planning for autonomously driving vehicles with Reinforcement learning algorithms. He now mixes the human driving datasets with RL in simulated environments to achieve better performance for the vehicles.

Ankit Anil Chaudhari

Ankit Anil Chaudhari

Ankit Chaudhari is currently working on "Enhancing Traffic-Flow Understanding by Two-Dimensional Microscopic Models". His research interests are traffic flow modelling, traffic simulation, mixed traffic flow, machine learning and reinforcement learning.

Members

Prof. Dr. Ostap Okhrin

Prof. Dr. Ostap Okhrin

Ostap Okhrin is the professor for Statistics and Econometrics at the Department of Transportation at the TU Dresden. His expertise lies in mathematical statistics and data science with applications in transportation and economics.

Dr. Martin Treiber

Dr. Martin Treiber

Martin Treiber is a senior expert in traffic flow models including human and automated driving, bicycle, and pedestrian traffic. He also works in traffic data analysis and simulation (traffic-simulation.de, mtreiber.de/mixedTraffic).

Fabian Hart

Fabian Hart

Fabian Hart was studying electrical engineering at TU Dresden. His main research focuses on modeling two-dimensional traffic based on Deep Reinforcement Learning methods.

Martin Waltz

Martin Waltz

Martin Waltz conducted his studies in Industrial Engineering and is now a research associate, with his main research focus being (Deep) Reinforcement Learning.

Niklas Paulig

Niklas Paulig

Niklas Paulig is a research associate, with his main field of research being the modeling of autonomous inland vessel traffic based on reinforcement learning methods, and HPC implementations of currently in use algorithms.

Dianzhao Li

Dianzhao Li

Dianzhao Li is a research assistant in the RL group, focusing on the area of trajectory planning for autonomously driving vehicles with Reinforcement learning algorithms. He now mixes the human driving datasets with RL in simulated environments to achieve better performance for the vehicles.

Ankit Anil Chaudhari

Ankit Anil Chaudhari

Ankit Chaudhari is currently working on "Enhancing Traffic-Flow Understanding by Two-Dimensional Microscopic Models". His research interests are traffic flow modelling, traffic simulation, mixed traffic flow, machine learning and reinforcement learning.