Reinforcement Learning with Shield

Motivation

  1. Neural network performs excellent(the ability gain reward) in reinforcement learning task, but we cannot guarantee the safety of neural network’s output.
  2. Previous work, such as Safety Verification of Hybrid Systems Using
    Barrier Certificates
    , has proposed several effective ways to guarantee the safety of linear function controller in reinforcement learning.

Reinforcement learning with shield is the combination of the performance of neural network and the safety of linear function.

Workflow

Overview

Training Agent


Agent is a neural network (yellow part of above fig).
Its input is state, and output is action.

Simulator equals the environment in reinforcement learning.
Simulator’s input is action (action will affect environment), outputs are state and reward (agent’s state will change after taking action and each action should have its reward correspondingly).

The output state St of simulator at step t will be the input of agent at step t+1.
The reward will be passed to the optimizer 1.
The optimizer 1 will try to maximize the reward by adjusting the wights and biases of the agent.

The final goal of this stage is to get a well-trained neural network agent.

Training Shield


Shield is a linear function.
Its input is state, and output is action.
Depending on the dimension of state, it may not be a line in 2-D space, it can also be plane in 3-D space, etc.

Giving the linear function and neural network same input state at step t, they will both output an action.
The output of neural network will be the input of simulator at step t+1, and it will produce the new state as the input of neural network and linear function at step t+1.
We want to make they similar, so we use optimizer 2 to minimize abs(ActionLF - ActionNN) by adjusting the coefficient of the linear function.

After we get a linear function, we can verify it is safe or not.
A linear function is safe means that we can find a loop invariant which is the subset of the safe region.
For any state which satisfies the loop invariant, using the linear function to control the agent can grantee the state will never transform to the out of loop invariant.
If it is safe, we get the linear function and loop invariant of it.
Or, we drop the linear function and retrain it.

Using Shield


Detector: Judge if the output neural network in the loop invariant.
In: Use the neural network’s output action, for higher reward.
Out: Use linear function, for safety(keep state in loop invariant).

Example


Game Rules:
For high reward, the stick should keep stand(Perpendicular to horizontal plane) and stay in the center of screen as long as possible.
The stick cannot fall down(angle with horizontal < 0).
The cart cannot go out of the screen.

Training Agent: We will get a neural network playing the game well, like what we see in the above gif. The input of the neural network will be the location and the angle of the cart and the stick. The output is a deterministic action, for example, turn left for 3 unit length.

Training Shield: We can make the output of linear function be similar to the output of neural network when they have same input state, and then use verifier to find a loop invariant and the invariant should be the subset of safe range. In this example, loop invariant should be the subset of up-part of the whole screen.

Using Shield: We now have a well trained neural network, a linear function, and a loop invariant. Now, assuming the loop invariant is {30 < angle < 150, 0.2 screen < position < 0.8 screen}, if the neural network will go beyond this boundary, our detector will detect this before the neural network takes this action, and using the linear function to take action instead. In that case, we can ensure all the state will be bounded in the loop invariant, thus in the safe region.