Simple Double DQN implementation which can learn solely based on a list of trajectories without an environement. This can be helpful in offline reinforcement learning where trajectories are generated by a lagging policy, which results in a decoupling of training and experience collection.
Note that the policy used to generate the trajectories should be relatively up-to-date.
Since this is a private project and not meant for the public, the code is not the cleanest and also the performance could be lacking. This implementation is CPU only.
An example is provided here, which converges to the optimal q values for the described scenario.
pip install git+https://github.com/webertim/dqn_experience.git@master