Visible to the public Safety-Constrained Reinforcement Learning for MDPs

Many formal system models are inherently stochastic, consider for instance randomized distributed algorithms (where randomization breaks the symmetry between processes), security (e.g., key generation at encryption), systems biology (where species randomly react depending on their concentration), or embedded systems (interacting with unknown and varying environments).

In contrast to verification of such systems, controller synthesis is a relatively new topic in this setting. Having a formal model of a controllable entity--e. g. a robot--and an environment, the goal is to synthesize a controller that satisfies certain requirements. Again, often faithful models are stochastic, imagine, e. g., sensor imprecisions of robots, message loss, or unpredictable environment behavior. Moreover, it might be the case that certain information--such as cost caused by energy consumption--is not known prior to exploration and observation.

Here, we abstract problems as Markov decision processes in which the expected performance is measured using a cost function that is unknown prior to run-time exploration of the state space. Consider for instance a motion planning scenario placed in a grid-world, where a robot wants to move to a certain position. Acting unpredictably, a janitor moves randomly through the grid. The robot reaches its goal safely if it moves according to a strategy that avoids the janitor. Moreover, each movement of the robot occasions cost-depending on the surface while it only learns the actual cost during physically executing actions within the environment.

Standard learning approaches synthesize cost-optimal strategies without guaranteeing safety properties. To remedy this, we first compute safe, permissive strategies. In contrast to standard strategies, where for each system run the next action to take is fixed, more permissiveness is given in the sense that several actions are allowed. Then, exploration is constrained to these strategies and thereby meets the imposed safety requirements. Exploiting an iterative procedure, the resulting strategy is safety-constrained and optimal. We show correctness and completeness of the method and discuss the use of several heuristics to increase its scalability. Moreover, we demonstrate the applicability by means of a prototype implementation.

Nils Jansen did his PhD with Erika Abraham at RWTH Aachen University, Germany, on the topic "Counterexamples in Probabilistic Verification". Afterwards, he started working as a postdoc with Joost-Pieter Katoen, also in Aachen. His research interests lie in the formal verification, in particular of stochastic systems. Currently, he is mostly working on parametric Markov models. He is also interested in all kinds of applications to formal verification.

Creative Commons Attribution-NonCommercial 3.0

Other available formats:

Safety-Constrained Reinforcement Learning for MDPs
Switch to experimental viewer