Rewards focus on the immediate context while value functions can focus on the long term. For instance, an action can have a low immediate reward while in long term can have a high value.

Reward vs. Value Function

Rewards are very important as they are used for estimating value. Estimating rewards are relatively easy compared to calculating values which can be quite challenging. The values are calculated after each time step and after each time step it is important to receive highest value and not the highest reward. 

"While return gives the expected discounted sum of rewards for one episode,
a value function gives the expected discounted sum of rewards from a certain state"

 Rewards, Returns and Value functions

For traditional Q-learning method, there might be no need to train a reward function. For these methods, a reward form is kept to record the reward for every (state, action) pair.  However, to deal with the curse of the dimensionality, a function is provided instead of the form. We try to use the targeted Q value used in the updating process of Q-learning as the tag to train a predict network for that. That's exactly why function approximation is needed. If there exists a well-formed policy to search through (action, state) space to get the best action or fairly small (action, state) space, there is no need to do a function approximation.

Why Function Approximation is Needed?

It's very important to understand how we define a basic reward function in reinforcement learning and its principia mathematica. The basic intuition of reward fucntion in reinforcement learning is the Bellman Equation, which describes the expected reward. And we want to maximize the expected reward. 
The Bellman Equation is:
$v(s) = E[R_{t+1}+\lambda v(S_{t+1})|S_t = s]$

Bellman Equation

The reward function formally describes the feedback an agent receives from the environment, often denoted as $$R$$. Specifically, $$r(s, a, s')$$ represents the reward for taking action $$a$$ in state $$s$$ and transitioning to the next state $$s'$$. For a sequence of state-action pairs, the reward at a specific time step $$t$$ is written as $$r_t = r(s_t, a_t, s_{t+1})$$. In deterministic decision-making processes, where the next state $$s_{t+1}$$ is entirely determined by the current state $$s_t$$ and action $$a_t$$, the notation simplifies to $$r(s_t, a_t)$$.

Reward Function in Reinforcement Learning

In many Natural Language Processing (NLP) applications, such as machine translation, rewards are often sparse. This means that the agent receives a non-zero reward signal only after completing an entire sequence, like generating a full sentence. For all intermediate steps (e.g., generating individual words), the reward is zero ($$r_t = 0$$ for $$t < T$$), which can make learning challenging.

Sparse Rewards in NLP

In the general framework of reinforcement learning, reward models hold a critical role as they establish the foundation upon which value functions are computed. The estimations from a reward model are essential for calculating the long-term value associated with particular states or actions.

Reward Models as the Basis for Value Functions

An autonomous agent is being trained to navigate a maze and reach a specific exit. The agent receives a small negative feedback signal (-0.1) for every step it takes and a large positive feedback signal (+100) only when it reaches the correct exit. The agent's goal is to maximize its total feedback score. Given this feedback structure, what is the most likely reason the agent might fail to learn to solve the maze, even after many attempts?

Two different feedback signal strategies are proposed for training a customer service chatbot whose goal is to resolve user issues efficiently. Evaluate these two strategies. Which one is more likely to result in a helpful and efficient chatbot? Justify your answer by explaining the potential unintended behaviors the less effective strategy might encourage in the agent.

Evaluating Reward Structures for a Chatbot

Imagine you are training a robot dog to fetch a ball and bring it back. The robot can perform actions like 'move forward', 'turn left', 'turn right', 'pick up ball', and 'drop ball'. Describe a simple reward system you could design to teach the robot this task. Specify at least one positive reward and one negative reward (or penalty) and explain the conditions under which the robot would receive them.

Designing a Reward System for a Robot Dog

In reinforcement learning, a reward is a signal sent from the environment to the agent that provides feedback on an action's success. This feedback, which can be positive or negative, guides the agent's learning process by indicating the desirability of the actions taken. The agent's primary objective is to modify its policy over time to maximize the cumulative reward it receives.

University of Michigan - Ann Arbor

Google

San Diego State University

The fundamental concepts that form the basis of reinforcement learning are:

 - Agent
 - Environment
 - Action
 - State
 - Reward

The agent learns through repeated interaction with the environment. To be successful, the agent needs to:
 - Learn the interaction between states, actions, and subsequent rewards.
 - Determine which action will provide the optimal outcome.

Fundamental Concepts for Reinforcement Learning

Here is a useful website for starters to understand reinforcement learning: https://missinglink.ai/guides/deep-learning-healthcare/tensorflow-reinforcement-learning-introduction-and-hands-on-tutorial/

Useful Website for Reinforcement learning

Reference of Foundations of Large Language Models Course

In reinforcement learning, the environment encompasses everything external to the agent with which it interacts. It processes the agent's action in its current state and responds by providing a reward and transitioning the agent to a new state. For example, in a physical system, the environment could be the laws of physics. From the agent's perspective, the environment often functions as a black box.

Environment in Reinforcement Learning

In reinforcement learning, a state (s) represents the current situation or configuration of the environment. It is a snapshot of all relevant information at a specific moment that the agent uses to make a decision. This information can include the agent's position, the status of other entities, and any other data that defines the current circumstances.

State in Reinforcement Learning

In reinforcement learning, the agent is the component that functions as the learner or decision-maker. It interacts with an environment by perceiving its state, performing actions, and learning from the resulting feedback. For instance, an agent could be a robot navigating a path or a trading algorithm making financial decisions. In the context of Large Language Models (LLMs), the LLM itself often serves as the agent.

Agent in Reinforcement Learning

An action is the set of all possible moves the agent can make. Agents choose from a list of possible actions. In video games, for example, the list might include: running right or left, jumping high or low, crouching or standing still.

Action in Reinforcement Learning

Reward in Reinforcement Learning

Here is a free pdf version of a book about Reinforcement Learning:
http://incompleteideas.net/book/RLbook2020.pdf

Useful Book for Reinforcement Learning

https://cims.nyu.edu/~donev/Teaching/WrittenOral/Projects/XintianHan-WrittenAndOral.pdf
https://www.davidsilver.uk/teaching/     
(David Silver's tutorial of reinforcement learning)
http://rail.eecs.berkeley.edu/deeprlcourse/
https://katefvision.github.io/
http://incompleteideas.net/rlai.cs.ualberta.ca/RLAI/RLAIcourse/RLAIcourse2006.html
http://web.stanford.edu/class/cs234/schedule.html

Useful Tutorials about Math behind Reinforcement Learning

Reinforcement learning is a general approach to solve reward-based problem. The building of it is based on mathematical foundations. We can also view it as a sequential decision making and control problem mathematically. For example, when driving a car, we need to make decisions about turning left/right or going straight after the previous decisions. And the key ingredient in reinforcement learning is Markov Decision Process.

Math Behind Reinforcement Learning

It is very challenging in the reinforcement learning to find a perfect balance between using previously used effective action(**exploitation**) and exploring new ones (**exploration**).

In order to solve this problem, simple $\epsilon$-Greedy algorithm can be used, in which we with some small probability $\epsilon$ we would select action randomly (**exploration**) and with $1 - \epsilon$ probability use previously selected effective action(**exploitation**).  

Exploration/Exploitation trade-off

Classification of Reinforcement Learning Methods

On-policy methods attempt to evaluate or improve the policy that is used to make decisions. In contrast, off-policy methods evaluate or improve a policy different from that used to generate the data. Classicla on-policy methods include Sarsa Algorithm. As a comparison, Q-learning method is a classical off-policy method.

On-policy vs Off-policy

Compared to value based methods, actor-critic methods focus on modeling the probability distribution of policies. For value-based methods, the logic is more like a greedy algorith, we want to maximize the value function for every step. As a result, we need to search through the action space to get the best action. But in actor-critic method, they are like using actor to perform actions based on the probability distribution of policies and using critics to judge the performance and adjust the probability distribution. They don't directly always choose the best action. 

Actor-Critic Methods

van Hasselt, H., Guez, A., & Silver, D. (2016). Deep Reinforcement Learning with Double Q-Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 30(1). Retrieved from https://paperswithcode.com/paper/deep-reinforcement-learning-with-double-q

Deep Reinforcement Learning with Double Q-learning

One of the most popular reinforcement learning algorithms, known to sometimes learn unrealistically high action values since it includes a maximization step that tends to prefer overestimated values.

Q-learning

Borges, Alexandre and Arlindo L. Oliveira. “Combining Off and On-Policy Training in Model-Based Reinforcement Learning.” ArXiv abs/2102.12194 (2021): n. pag.
https://arxiv.org/pdf/2102.12194.pdf

Combining Off and On-Policy Training in Model-Based
Reinforcement Learning

MuZero is a model-based reinforcement learning method which contains a representation function, a dynamics function, and a prediction function.The representation function encodes an observation sequence into a hidden state. The dynamics function evaluates the current hidden state and a potential action to determine the potential next hidden state and a reward value. Lastly, the prediction function is the policy generation function. They use similar parameters and are optimized according to a combined loss function.

A Monte Carlo Tree Search (MCTS) is often used when training to evaluate action trees from the current/ root state.

MuZero

In a reinforcement learning framework, the process of training a Large Language Model (LLM) iteratively evaluates and improves the model's policy. At each step $$t$$, the current state $$s_t$$ is defined by the initial input prompt $$\mathbf{x}$$ and the tokens generated so far $$\mathbf{y}_{<t}$$. The LLM acts as the policy, denoted by the predicted distribution $$\Pr(y_t|\mathbf{x},\mathbf{y}_{<t})$$, to choose an action $$a_t$$, which is the next token $$y_t$$. After $$y_t$$ is predicted, a reward model evaluates the sequence $$(\mathbf{x},\mathbf{y}_{<t}, y_t)$$ to determine how well it aligns with the desired textual outcome. This evaluation produces reward scores that are then used to compute the value functions $$V(s_t)$$ and $$Q(s_t,a_t)$$. Finally, these value functions provide the necessary feedback to guide the subsequent training and refinement of the LLM's policy.

Reinforcement Learning Process for LLMs

A robot is being trained to find the exit in a maze. At any given position, the robot can choose to move forward, turn left, or turn right. The system keeps track of the robot's current coordinates within the maze. The robot's objective is to reach the exit as efficiently as possible. It receives a large positive signal for reaching the exit, a large negative signal for hitting a wall, and a small negative signal for each step it takes. Based on this scenario, identify the five fundamental components of this learning system and briefly explain the specific role of each.

Analyzing a Learning System

A robot is being trained to navigate a maze to find a piece of cheese. Analyze this scenario by matching each element of the training process to its corresponding fundamental concept.

The core of reinforcement learning is the interaction between an agent and a dynamic environment, which is modeled as a sequential process. At every time step, the agent assesses the environment's current state and uses its policy to select an action. After executing the action, the environment provides feedback consisting of a reward and a new state. This cycle of observing, acting, and receiving feedback continues until the agent accomplishes its objective.

Agent-Environment Interaction Loop in Reinforcement Learning

A cat is learning to use a new automated feeder that dispenses food when a lever is pressed. Initially, the cat paws at the lever randomly. After several attempts, it presses the lever and food is dispensed. The cat begins to press the lever more frequently. Which of the following statements best analyzes the relationship between the core components in this learning scenario?

Learn Before

Related

Learn After