Autonomous mobile robots, like e.g. Automated Guided Vehicles or soccer robots, are powered by smart embedded software. This software makes the robot sense the world, share its believes about the world state with other robots and select the best action to perform next, based upon its believes. The behavior of these robots is emerging from the action selection policy that has been coded in the embedded software. A well-designed, rational policy makes the robot maximize its expected utility.
Traditionally action selection policies for robots are modelled and implemented using hierarchical state machines or behavior trees. Although these methods are very good in describing the hierarchical decomposition of actions into smaller actions, they are also quite rigid in specifying how to choose among actions at runtime. The rigidity comes from the fact that actions are selected using hardcoded conditions on the world state. These conditions are in general inflexible and suboptimal.
Our research deals with replacing hardcoded action selection by a more flexible mechanism. We do not want to specify action selection conditions, but we want our robot to learn a good policy by training. To attain this goal we use “reinforcement learning”. We make the robot learn a non-linear policy function that maps a world state onto action utilities. A model of the robot is made to explore a simulated, but realistic environment where it gets feedback on its actions in the form of rewards. It gradually improves its policy function by maximizing rewards.