Deep Reinforcement Learning Algorithms
Approach
The aim of the project is also to implement various deep RL algorithms. A few deep RL algorithms were written. There were two architectures that were implemented. First is a normal MLP based network, wherein the underlying module in each network will be a simple multilayeredperceptron. The other architecture that we used is a transformerstyle network.

DQN
The very first network that was written was Deep QLearning network. This is the most simple offpolicy network that started the breakthrough in deep RL research. Two important features are the fixed targets, and the replay buffer. The replay buffer is used to store the trajectories observed, and states are sampled from this to update a neural network whose aim is to learn the Qvalue function. Fixed targets are used for stabilising the training, and is updated periodically.

Dueling DQN
This is a slight modification to the DQN architecture. In this there are two heads, in which one of the head estimates the statevalue function, and the other head tries to estimate the advantage function. Advantage function, as the name suggests, gives the difference between Q(s,a) and V(s) [how advatageous is it to take an action]. The rest of the training is similar the DQN network.

A2C
The above two methods are ways of estimating the Q value function, and generally an epsilongreedy policy is applied to learn the policy. A2C (Advantage Actor Critic) is a method that is used to directly learn the policy function, and finally converge to the optimal policy. This is a policy gradient approach, and an on policy algorithm. The explanation behind the algorithm involves a bit of math, but in short, a neural network is learnt that predicts the probabilites of taking actions given the state. Finally it should be able to learn such a policy that the maximum probability should be for the optimal action given the state.

PPO
Proximal Policy Optimisation (PPO) is another onpolicy algorithm, and is used to estimate the policy function. The problem with on policy methods is that we do not have much data sampled using the current policy, because once we update the network, the policy changes. So, we would want to have data, as well as work onpolicy. This is handled using a trick called importance sampling. The other important point regarding PPO, is that there is a clipping that is used instead of a trust region as proposed in an earlier paper (TRPO).
All these algorithms were written, and a few experiments were conducted. A glimpse of the dueling DQN on an environment with just the goal and no obstacles can be seen here:
https://userimages.githubusercontent.com/57453637/18598288570aed38c48554ed9a32e9cfdc5171a88.mp4