They present a framework that integrates DRL with hierarchical value functions (h-DQN) where the agent is motivated to solve intrinsic goals to aid exploration of the state space. They propose that intrinsic motivation can help the agent with the problem of sparse rewards. The show very good results in some Atari games like Montezuma's Revenge where previous approaches did not perform nearly as well.
- The agent learns value functions which are not only functions of the state s but also of the current goal g: V(s,g).
- The agent learns to achieve intrinsically generated goals and then it learns how to properly chain these policies together in order to achieve global (extrinsic) goals.
- They approximate the value function V using NN: V(s,g; θ).
- They have a hierarchical representation where a meta-controller takes the current state and proposes a new goal. The low-level controller is then responsible for reaching this goal by properly selecting actions. Once the goal is reached or the episode terminates, the meta-controller selects a new goal.
- The meta-controller runs at a slower time-scale than the controller. It collects one experience transition only after the controller has experienced several steps and performed several actions.
- Two DQN's are implemented. One for the meta-controller and one for the controller. These networks are trained normally using the individual experience replay buffers.
- Receives a state st and chooses a goal gt ∈ G. Where G denotes all set of possible current goals and is predefined by the authors according to the environment/task.
- The meta-controller's objective is to maximize the extrinsic reward received from the environment.
- It collects transitions (st,gt,ft,st+1) and stores them in a replay memory for training. Where ft is the extrinsic reward.
- Takes the current state st and goal gt and selects an action at.
- The goal gt remains fixed for the next few steps either until it is achieved or a terminal state is reached.
- The internal critic is in charge of evaluating whether the goal is achieved or not and provides an appropriate intrinsic reward rt(g).
- The controller's objective is to maximize this intrinsic reward.
- It collects transitions (st,at,gt,rt,st+1)
- Use of entities or objects in the scene to parameterize goals.
- They built a custom object detector from images to provide object/goals candidates.
- The internal critic is defined in the space <entity1,relation,entity2>, where relation defines a configuration of the entities (like for example: entity1 reaches entity2).
- The agent is free to choose entity2.
- They only handle one relation: entity reaches entity... but they propose to develop a parameterized intrinsic reward function given entities as future work.
- I like both the ideas of a having hierarchical approach and using intrinsic motivation. Given that we are talking about reinforcement learning, however, it would be useful if the agent could learn how to set internal goals on its own, without the need of predefining a fixed set of goals.
- Moreover, the development of the internal critic looks like a really exciting field of research. Right now, a limited amount of relations needs to be defined. It could be possible for the critic to evolve autonomously by learning more complex relations or high-level abstractions: instead of evaluating if <entity1 reaches entity2>, it could evaluate if <entitydoor is now open>.