Post 2: Environment and algorithms
The goals detailed in this post consist on the implementation of an environment which is able tu support RL training processes, and also some RL algorithms will be implemented in order to test them.
Environment
Gym Interface
An enviromnment able to support RL processes, must implement a set of methods, known as OpenAI Gym interface: open(), close(), reset(), step() and render(). In order to implement these methods, I am going to use the ZMQ Remote API, available since CoppeliaSim v4.3 was released. It allows to manage all the data available in a simulation, and also makes possible to control the simultion itself (start, end, reset, load a scene, change the speed, …).
Implementation
Apart from using the ZMQ API, it is neccessary to make the code easy to mantain, clear and clean. All the implemented code has been organized in different files according to its functionality.
Improvements
When I tested the environmnt it worked to slow, so i decided to take some time references and find the bottleneck of the code. The issue was that the ZMQ API script calls were too slow, so I decided to gather them and reimplement the code in irder to reduce the calls as much as possible.
Algorithms
Once the environment is ready, it is time to develop some algorithms to test it and see some results. In this post, only and introduction will be shown.
Q learning
It is an algorithm based on dynamic programming and Bellman’s equation (Q stands for ‘Quality’). The main idea of Q learning is hto have an mxn-sized table (m: number of states, n: number of actions), and the content of thet table will be values that indicate how god or bad is the action in a certain state. The learning process consists on filling the Q table with the appropriate values, using the Bellman’s equation. This algorithm considers a discrete space os states, as well as a discrete space of actions, as both are needed as indexes to access the Q table. To test this algorithm, only 2 dimensions has been considered (x-axix and y-axix movements).
DQN
It is kind a continuous version of Q learning. Intead of using the Q table (which needs a discrete state space) it uses a neural network (NN), that offers the advantage of using a continuous space state. The NN takes as input the state (continuous values) extracted from the environment, and the output consists in the most suitable action to be performed in that state. The NN funcions the same way as the Q table did.
Daniel Peix del Río