Skip main navigation

Elements of a reinforcement learning setting

Man in cold water, sub-zero temperatures.
© Shutterstock

Building upon our previous exploration of reinforcement learning examples, we will now delve deeper into the interaction between the agent and the environment. In this step, we are going to describe these two elements in greater detail, as well as introduce a number of new elements which are distinctive to Reinforcement Learning.


The agent is our decision-making entity. We need to be nuanced here, as the idea of an agent is a little abstract. An agent is not necessarily an entire organism or robot or some physically distinct artefact. This is because, for the purposes of formulating problems in a solvable way, some elements of the organism or robot or entity can actually be considered part of the environment. We will look at what we mean by this when we consider the “environment” element next.


It is easy to think about the environment as only that which exists outside the robot or organism or behaving entity, but that is rarely the case in a formulation of a reinforcement learning problem. For example, think of an autonomous taxi driving around a city environment and computing a decision regarding whether to go to a particular part of the city in search of a fare or returning to a charging point. It might make its decision based on current battery status, typical traffic conditions and journey times to the city target for the passenger pick up and ease of finding available charging points. In this example, the environment is best modelled to include elements of the car itself, i.e. the battery. This is a point that beginners often miss and which causes confusion. More generally the agent’s environment can often include state information about the object in which the agent resides. This includes things like memories, desires, needs and wants.

We are now going to introduce a number of new elements of a reinforcement learning problem which are common in any formal treatment of the topic. These are:

  • Policies
  • Value functions
  • Reward signals
  • Model of the environment


Policy defines the agent’s behaviour. It’s a mapping between the agent’s perception of the environment, as represented in an environmental state and actions to take. The policy of an agent is the central part of reinforcement learning as it is the policy that dictates how the agent behaves. The interaction between behavior (policy) and the environment is crucial for optimizing behavior to maximize returns based on the task specifics. If we say that an agent behaves intelligently, we are in effect saying that it has an intelligent policy. Real-world policies in agents range in complexity from simple look-up tables to complex stochastic high-order models.

Reward signal

The reward signal and value function (which is discussed next) are so closely coupled that people often confuse the two. It is imperative that we keep these distinct from one another. In each trial of the reinforcement learning algorithm, i.e. each time step, the environment gives up a reward. This is a single number. It encodes positive and negative outcomes with respect to an action in a given environmental state. These rewards are the experience of the agent. They define what is a good or bad or indifferent experience for the agent with respect to an action and environmental state. The agent can only influence the reward through its actions in the environment.

It is the reward signal which, in general, drives changes in policy. If an action leads to a significant reward from the environment, the policy may be encouraged to reproduce that action, should the same environmental state be encountered in the future. It is worth pointing out, that rewards are often stochastic functions of the environmental state and estimating the likelihood of pay-out is all part of the challenge. This is something which is particularly apparent in our own research in computational psychology here at DCU.

Finally, in reinforcement learning the agent strives to maximise accumulated reward. It defines the goal of the reinforcement learning agent. I have emphasised the word “strives” as that is the goal. Achieving some global optimum may never be achieved. It is the trying which is the important thing. Consequently, reinforcement learning plays a long game for most interesting problems and it is not the immediate reward, but the overall long-term reward. This brings us towards the idea of a value function.

Value function

The value function is our long-sighted entity in reinforcement learning. The reward signal is the immediate outcome of an action in any given environmental state. The “value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state”(emphasis added). So let’s take an analogy most people can relate to, vigorous exercise! Sometimes if I go swimming, I may initially hate the experience of stepping into the water. There is the huge drop in temperature, it’s wet and it’s a shock to the system. In terms of a reward signal for me, this is a big negative score. However, after acclimatising, the experience of swimming becomes more enjoyable. Once I get out of the water, get dressed and leave the pool, I feel great. The new state I have entered into is wonderful. I’m relaxed, nicely tired and it’s a nice feeling. It even relieves any sense of concern I had that I have not been doing enough exercise. I know there are longer-term benefits too, being fitter, being less stiff, etc. To break this down, the initial state of jumping into the pool has a low immediate reward signal and the value of that state for me (the agent) is considerable. The value function is taking into account the longer-term desirability of the states that come after that initial (poorly rewarding) state.

Values are incredibly important and the estimation of values based on observation of experiences is at the very heart of Reinforcement Learning. Determining values is hard. Determining rewards is not a problem at all as they are offered up by the environment in response to action. Determining how things will play out from a given state as a function of policy is the real challenge. So, remember this. Value estimation (predicting the future if you like!), is the most significant challenge in reinforcement learning.

Finally, to help in estimating values and planning we need a model of the environment in order to figure out what-if scenarios.

Model of the environment

The model of the environment is something Reinforcement Learning can use to simulate the environment as policies are developed. Models are used for planning. An obvious analogy is a map. Before you choose a route to drive somewhere you use a map to look at routes, fuel stops along the way, and easy driving conditions. You use your map to plan. Similarly, models of the environment are used for planning, e.g. predicting next state and next reward for a given action on a current state. In short, models are used for predicting the future. Reinforcement learning, which adopts such models, is unsurprisingly called a model-based approach. In contrast, a purely trial and error (which is often how I navigate anywhere when driving) is considered a model-free approach. Also, as you might have guessed, an increasingly common approach is to start model-free and, through trial and error, develop a model which is refined and improved and used for planning.

© Dublin City University
This article is from the free online

Reinforcement Learning

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now