Skip to content
/ IRL Public

Algorithms for Inverse Reinforcement Learning

License

Notifications You must be signed in to change notification settings

Vrroom/IRL

Repository files navigation

IRL

The Inverse Reinforcement Learning problem asks why some observed behaviour happens. Let's say, I'm playing football and I run down the pitch and kick the ball towards the opponent's net. With access to only such behaviours, with no other knowledge of the world, the IRL problem determines the reward structure that is incentivizing this behaviour. In football, the reward is scoring a goal. Like in many other cases, the reward is sparse (there are few goals per 90 minutes) and serves as a succint representation of the task. Here I'm building a tool for testing different algorithms on a variety of environments.

Currently, I have one algorithm for the Acrobot-v1 environment. The algorithm is an LP formulation for finding an optimal reward function from a linearly parameterized family of reward functions.

Setup

I have used rlpyt for finding optimal policies for different reward functions. It is instantiated as a submodule. So you'll have to recursively clone this repository and then build rlpyt.

$ git clone --recurse https://github.com/Vrroom/IRL.git
$ cd IRL/rlpyt
$ python3 setup.py install
$ cd .. && python3 IRL.py

Example

The Acrobot-v1 is a double pendulum system. An agent can give clockwise or counterclockwise torque. The goal is to get the bottom link at a particular height. I played with this environment and accumulated 100 trajectories. These were the inputs to the IRL algorithm.

The reward function is a function of the angle that top link makes with the vertical, theta1, and the angle the bottom link makes with respect to the top link, theta2.

True Reward Function Recovered Reward Function

The recovered reward function looks nothing like the true one. But importanly, the two match around the origin. The origin is the initial state, where both the links point downward. Both incentivize getting out of this region. Once the agent is out of this region, it has sufficient kinetic energy to eventually finish the task, even if no further torque is provided.

This is one trajectory under the optimal policy for the recovered reward function. The trajectory also exhibits the observed behaviour.

acrobot

Algorithm Overview

Assuming that the observed policy is optimal for the underlying task, the algorithm iteratively shrinks the space of candidate reward function. Initially, a random reward function is guessed and an optimal policy for it is determined. Valid candidates satisfy that the value of the states under this new policy is less than or equal to that under the observed policy. This constraint shrinks the space of valid candidates. We find the best valid reward candidate, characterized by superiority of value of observed policy over the new policy. An optimal policy is determined for this new candidate and the cycle is repeated again. The outer loop described above can be found here. Reward Space shrinking can be found here.

Limitations

  1. The main bottleneck preventing generalization to other environments is that each new environment needs to be supplied with a custom reward bases.
  2. The motivation of the problem isn't addressed in its solution. The motivation was to recover a succint representation of the task from observed behavior. For example, "score a goal" or "cross the grey line". In the example above, the recovered reward function explained the observed behaviour and yet wasn't interpretable in the same way as the true reward function.

Future Work

  1. Add Ziebart's Maximum Entropy Inverse Reinforcement Learning and the more recent Maximum Entropy Deep Inverse Reinforcement Learning by Wulfmeier.

References

  1. Algorithms for Inverse Reinforcement Learning
  2. Linear Programming for Piecewise Linear Objectives