Understanding Reinforcement Learning in-depth – GeeksforGeeks

The subject of reinforcement learning has absolutely grown in recent years ever since the astonishing results with old Atari games deep Minds victory with AlphaGo stunning breakthroughs in robotic arm manipulation which even beats professional players at 1v1 dota. Since the impressive breakthrough on the ImageNet classification challenge in 2012, supervised deep learning’s successes have continued to pile up, and people from all walks of life have begun to use deep neural nets to solve a variety of new problems, including how to learn intelligent behavior in complex dynamic environments.

Why Supervised Learning is different from Reinforcement Learning?

As a result, supervised learning is used in the majority of machine learning applications. This means that you provide an input to your neural network model and know what output your model should create. As a result, you may compute gradients using a backpropagation approach to train the network to produce your desired outputs. So, let’s say you want to teach a neural network to play pong in a supervised setting, you’d get a good human player to play pong for a couple of hours, and you’d create a data set where you log all of the frames that the human sees on the screen, as well as the actions that he takes in response to those frames. So, whatever is driving the up or down arrow, we can feed those input frames into a very simple neural network, which can yield two simple behaviors at the output. It will either choose the up or down action, and we can train that neural network to duplicate the actions of a human gamer by simply training on the data set of human games using things like backpropagation. 

However, there are two significant drawbacks to this approach:

  • Supervised learning requires the creation of a data set to train on, which is not always an easy task.
  • If you train your neural network model to simply imitate the actions of the human player well, your agent can never be better at playing the game of pong than that human gamer. For example, if you want to train a neural network to be better at playing the game of pong than that human gamer.

Working of Reinforcement Learning:

There is, fortunately, and it is known as reinforcement learning. As a result, the framework and reinforcement learning are surprisingly similar to the supervised learning framework. So we still have an input frame, we run it through a neural network model, and the network produces an output action, either up or down. But now we don’t know the target label, so we don’t know whether we should have gone up or down in any case since we don’t have a data set to train on. The Policy Network is the network in reinforcement learning that converts input frames to output actions. A strategy known as Policy Gradients is now one of the simplest ways to train a policy network. In policy gradients, the strategy is to start with a completely random network. You feed a frame from the game engine to that network. It generates a random up with an activity you are aware of, either up or down. return to the game engine with that action The loop continues while the game engine generates the next frame, and the network, in this example, might be a completely connected network.

However, you may use convolutions there as well, and your network’s output will now consist of two numbers: the probability of going up and the probability of going down. While training, you’ll sample from the distribution so that you’re not always repeating the same exact activities. This will help your agent to explore the world more randomly, hopefully discovering greater rewards and, more crucially, better behavior. Because we want our agent to be able to learn entirely on its own, the only feedback we’ll give it is the game’s scoreboard. So, anytime our agent scores a goal, it receives a +1 reward, and if the opponent scores a goal, our agent receives a minus 1 penalty, and the agent’s main purpose is to optimize its policy to collect as much reward as possible. So, in order to train our policy network, the first thing we’ll do is gather a lot of data. Simply run a few of those game frames over your network, select random actions, and feed them back into the engine to generate a series of random pong games. Obviously, because our agent hasn’t learned anything valuable yet, it’ll lose the majority of those games, but the point is, our agent might get lucky and select an entire series of actions that really leads to a goal at random. In this instance, our agent will be rewarded, and it’s important to remember that for each episode, regardless of whether we desire a positive or negative reward, we can already compute the gradients that will make our agent’s actions more likely in the future. This is critical, as policy gradients will employ normal gradients to raise the probability of those acts in the future for every episode when we have a positive reward. When we get a negative, we’ll apply the same gradient, but we’ll multiply it by minus one, and this minus sign will ensure that all of the actions we made in a particularly awful episode will be less likely in the future. As a result, when training our policy network, acts that lead to negative rewards will gradually be filtered out, while actions that lead to good rewards will become increasingly likely, so in a way, our agent is learning to play pong.

The drawback of Policy Gradients:

So, we can utilize policy gradients to train a neural network to play pong. But there are a few substantial drawbacks to employing this strategy, as there always are. Let’s return to pong once more. Imagine that your agent has been practising for a while and is actually quite good at playing pong, bouncing the ball back and forth, but then it makes a mistake at the conclusion of the episode. It allows the ball to pass through and receives a penalty. The problem with policy gradients is that they assume that because we lost that episode, all of the acts we made there must have been poor, and this will diminish the likelihood of repeating those actions in the future.

Credit Assignment Dilemma:

But keep in mind that for most portions of that episode, we were performing extremely well, so we don’t want to reduce the chance of those behaviours, which is known as the credit assignment dilemma in reinforcement learning. It’s the situation where, if you get a reward at the end of your episode, what were the particular acts that lead to that specific award, and this problem is totally due to the fact that we have a sparse reward setting. So, instead of receiving a reward for every single action, we only receive a reward after a complete episode, and our agent must figure out which element of its action sequence is causing the reward that it eventually receives, as in the example of punk. For example, our agent should understand that only the activities immediately preceding the ball’s impact are genuinely crucial; everything else after the ball has flown away is irrelevant to the final payoff. As a result of this sparse reward setting, reinforcement learning algorithms are often sampled inefficient, which means you’ll have to give them a lot of training time before they can learn anything valuable.

Montezuma’s revenge reinforcement learning algorithms

When comparing the efficiency of reinforcement learning algorithms to human learning, it turns out that the sparse reward setting fails altogether in some extreme circumstances. In the game, Montezuma’s Revenge, the agent’s mission is to negotiate a series of ladders, jump over a skull, retrieve a key, and then travel to the door – in order to progress to the next level. The issue is that by performing random acts, your agent will never see a single reward because you know the sequence of activities required to obtain that reward is far too complicated. With random actions, it’ll never get there, and your policy gradient will never receive a single positive reward, so it’ll have no idea what to do. The same is true in robotics, where you could want to train a robotic arm to pick up an object and stack it on top of something else. The average robot has roughly seven joints that can move, so it has a lot of action space. If you only give it a positive reward when it has successfully stacked a block by undertaking random exploration, it will never see any of the benefits. It’s also worth noting how this compares to the usual supervised deep learning accomplishments that we see in areas like computer vision. The reason computer vision works so well is that each input frame has a target label, allowing for very efficient gradient descent using techniques such as backpropagation. In a reinforcement learning scenario, on the other hand, you’re dealing with the huge problem of sparse reward setting. This is why, despite the fact that something as simple as stacking one block on top of another appears to be quite tough even for state-of-the-art deep learning, the usual method to solving the problem of sparse rewards has been to utilize rewards shaping.

Montezuma’s Revenge

Reward Chipping:

The practice of manually creating a reward function that has to direct your policy to some desired behavior is known as reward chipping. For example, in Montezuma’s Revenge, you may offer your agent a prize every time it avoids the skull or reaches the key, and these extra rewards will encourage your policy to behave in a certain way. 

While this obviously makes it easier for your policy to converge to intended behavior, reward shaping has a number of drawbacks:

  • For starters, reward shaping is a specific process that must be completed for each new environment in which you want to train a policy. If you use the Atari benchmark as an example, you’d have to create a different reward system for each of those games, which is simply not scalable.
  • The second issue is that reward shaping is plagued by what is known as the alignment problem. When it comes to shaping your reward function, it turns out that reward shaping is fairly tough in many circumstances. Your agent will devise some ingenious scheme to ensure that it receives a large sum of money while doing nothing. In a way, the policy is simply overfitting to that unique reward function that you designed, rather than generalizing to the anticipated behavior that you had in mind.

There are a lot of amusing examples of reward shaping gone bad. For example, if an agent was trained to jump and the reward function was the distance between its feet and the ground, the agent learned to develop a very tall body and do some sort of backflip to ensure that its feet were very far from the ground. Look at the equations below for a shaped reward function for a robotic control job to get a sense of how difficult reward shaping can be. 

R_{text {grasp }}= begin{cases}b_{text {lift }}, & text { if } h>epsilon_{h} \ w_{theta}, & text { if } d_{text {orient }}<epsilon_{d} text { and } theta_{text {orient }}<epsilon_{theta} \ 0, & text { otherwise }end{cases}

R_{text {stack }}= begin{cases}gamma_{t} b_{text {stack }}, & text { if } d_{text {stack }}<epsilon_{d} text { and } theta_{text {stack }}<epsilon_{theta} \ gamma_{t}left(w_{theta} r_{theta}+w_{d} r_{d}right), & text { otherwise, }end{cases}

R_{f u l l}= begin{cases}w_{text {stack }}, & text { if } d_{text {stack }}<epsilon_{d} wedge theta_{text {stack }}<epsilon_{theta} \ w_{text {stages }}, & text { if } d_{text {stack }}<epsilon_{d} wedge theta_{text {stack }}<epsilon_{theta} \ w_{text {grasp }}, & text { if } h>epsilon_{h} \ w_{text {stage }_{1}}, & text { if } d_{text {stage }_{1}}<epsilon_{d} \ 0, & text { otherwise }end{cases}

One can only imagine how much effort researchers spent developing this exact reward mechanism in order to achieve the desired behavior.

 Finally, in some circumstances, such as AlphaGo, you don’t want to conduct any reward shaping because it will confine your policy to human behavior, which isn’t always desirable. So the dilemma we’re in right now is that we know it’s difficult to train in a sparsely populated environment, but it’s also difficult to build a reward function, which we don’t always want to do. Many internet sources describe reinforcement learning as some kind of magical AI sauce that allows the agent to learn from itself or improve upon its previous form, but the reality is that the majority of these advances are the result of some of the world’s greatest minds at work today. There’s a lot of hard engineering going on behind the scenes, so I believe one of the most difficult aspects of navigating our digital landscape is separating fact from fiction in this sea of clickbait fuelled by the advertising business. Boston Dynamics’ Atlas robot is a great example of what I’m talking about. So, if you walk out on the streets and ask a thousand people who have the most advanced robots today, they’ll most likely point to Atlas from Boston Dynamics because everyone has seen the video of it doing a backflip. However, if you consider what Boston Dynamics is actually good at, it’s highly unlikely that there’s a lot of deep learning going on there if you look at their previous papers in the research track record. Don’t get me wrong, they’re doing a lot of advanced robotics, but there’s not a lot of self-driven behavior or intelligent decision-making going on in those robots, so don’t get me wrong. Boston Dynamics is an excellent robotics business, but the media impressions they’ve developed may be perplexing to many individuals who are unaware of what’s going on behind the scenes. Nonetheless, given the current state of research, we should not be dismissive of the possible dangers that these technologies may pose. It’s great that more individuals are becoming interested in AI safety research because concerns like autonomous weapons and widespread surveillance must be handled seriously. The only hope we have is that international law will be able to keep up with the tremendous technological advancements we are witnessing. On the other hand, I believe the media focuses far too much on the bad aspects of these technologies simply because people are afraid of what they don’t understand, and fear sells more ads than utopias. Most, if not all, technological advancements are good in the long term as long as we can ensure that no monopolies can preserve or enforce their dominance through the malicious use of AI.


  • Gudimella, A., Story, R., Shaker, M., Kong, R., Brown, M., Shnayder, V., & Campos, M. (2017). Deep reinforcement learning for dexterous manipulation with concept networks. arXiv preprint arXiv:1709.06977.

Leave a Comment