Reinforcement Learning

Understanding model-based RL, model-free RL, and policy gradient methods.

MDP Based RL

All reinforcement learning setups have an agent and an environment. The agent will observe the state of the environment and take actions. It will then receive some scalar reward and transition to the next state. This will happen over and over again until it reaches some terminal state. A simple example is chess. The current position of all the pieces is the state and when the agent plays a move it moves to a new state. Unlike chess, real world environments are complex and difficult to model. But to build an initial understanding of RL methods, we represent them as a simple Markov Decision Process.

The Markov assumption means that our current state is a sufficient statistic of the history of the environment. In the chess example, the agent does not need to know the previous states to decide what move to make in the current state. We represent such processes in a tuple form $(\mathcal{A}, \mathcal{S}, R, P, \gamma)$ where:

$\mathcal{A}$ is a set of all possible actions the agent can take.
$\mathcal{S}$ is a set of all possible states in the environment.
$R(s)$ is the reward function and determines the reward for being in a state $s.$
$P(s'|s, a)$ is the transition function and determines the probability of going to state $s'$ if you take action $a$ in state $s$ .
$\gamma$ is the discount factor. As we progress through time steps $\sum_{i=0}^{T}t_i$ we discount rewards from states reached in the future by $\gamma^{t_i}$ . This is important for some of our methods to converge and also accounts for the uncertainty of future states.

We define $\pi$ to be the policy of our agent. It is a function that maps from state space to action space. The policy can be deterministic where $\pi(s) = a$ gives us the action to take in that state. It can also be stochastic where $\pi(a|s)$ is a distribution over all actions that the agent takes in the state. We define the return $G_t$ as the reward a model gets across one episode or rollout.

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}

Our goal with RL algorithms is to find a policy that maximizes this reward.

Policy Evaluation

Before we can optimize our policy, we need methods to evaluate it. The first method we use for this is the state value function. It is the expected reward for an episode where you start at $s$ and follow the policy $\pi$ . It is essentially the total future reward you expect to collect if you begin in that state and keep acting according to the policy. So a higher value means the state is generally more desirable because it leads to better long-term outcomes.

V^{\pi}(s) = \mathbb{E}_{\pi}[G_t | s_0 = s]

Similarly the state action value function is the expected reward for an episode where you start at $s$ , take action $a$ and then follow the policy $\pi$ . It essentially tells you how valuable is a specific action in the given state in terms of the future rewards it promises.

Q^{\pi}(s,a) = \mathbb{E}_{\pi}[G_t | s_0 = s, a_0 = a]

Combining the above functions, an advantage function is the difference between a state action value function and the state value function. It essentially tells you how much better is taking an action $a$ in state $s$ compared to the expected action based on policy $\pi$ .

A^{\pi}(s,a) = Q^{\pi}(s,a) - V^{\pi}(s)

These three functions give us a good means of evaluating our policies. Richard Bellman's work on dynamic programming gave us a better way to express the above functions as a recursive expression. The value of any state is the immediate reward you get along with a discounted sum of rewards from future states under the current policy. This is called the Bellman Expectation.

V^{\pi}(s) = \sum_a \pi(a|s)[R(s) + \gamma \sum_{s'}P(s'|s,a) V^{\pi}(s')]

Q^{\pi}(s,a) = R(s) + \gamma \sum_{s'}P(s'|s,a) \sum_{a'}\pi(a'|s')Q^{\pi}(s',a')

Policy Improvement

Now that we can evaluate our policies, the next step is improving them. There are two popular methods here that converge to the same answer. Let's understand them!

Policy Iteration: Here we initialize a random policy and then set the value $V^{\pi}(s) = 0$ for all states in the MDP. We then have an outer loop with two steps within it. The first inner loop is policy evaluation where we use the bellman expectation to recursively update the value function of each state based on observed rewards. We do this until convergence. The second step is policy improvement where based on the new value functions, we change the policy to pick the action which transitions to the most desirable state. We then repeat the full outer loop by evaluating the new policy and improving until we reach convergence on the final policy.

V(s) \leftarrow 0

for all

s \in \mathcal{S}

\;\pi(s) \leftarrow

arbitrary

Repeat until

\pi

converges

Evaluate: repeat until

\Delta < \epsilon

V(s) \leftarrow \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)\big[R(s') + \gamma\, V(s')\big]

for all

s

Improve: for each

s \in \mathcal{S}

\pi(s) \leftarrow \arg\max_a \sum_{s'} P(s'|s,a)\big[R(s') + \gamma\, V(s')\big]

Value iteration: It is a consolidation of the evaluate and improve steps from policy iteration into one. We still initialize all $V^{\pi}(s) = 0$ and then we apply the bellman optimality update at each state. This update simply means picking the action that maximizes the value of the state. We do this until convergence such that our value functions stop changing and then write the optimal policy $\pi^*$ that just takes the action with maximum reward in each state. I'm keeping this section of the notes brief but you can find the proof of why repeatedly applying the bellman operator converges to optimal policy here.

V(s) \leftarrow 0

for all

s \in \mathcal{S}

Repeat until

\Delta < \epsilon

V(s) \leftarrow \max_a \sum_{s'} P(s'|s,a)\big[R(s') + \gamma\, V(s')\big]

for all

s

Extract policy

\pi^*(s) = \arg\max_a \sum_{s'} P(s'|s,a)\big[R(s') + \gamma\, V(s')\big]

Here is an example to understand how policy and value iteration work. This is a 8x8 grid world where we have a reward cell with +1 reward and a trap cell with -1 reward. All other states have no rewards. The number in the cell is its value function and the arrow indicates what action will the current policy take in that state. You can step through both the methods and see how things change at each iteration step. You can also change the discount factor or move the reward/trap cell to try variations of the problem.

MDP Grid World

PI · Evaluation sweep 0

Click the reward or trap cell to pick it up, then click any empty cell to move it.

Model Free RL

While model based RL is a good foundation for understanding concepts, it does not reflect the real world environments we encounter. It is hard to know the true reward or transition function for a robot navigating the world or an LLM generating text. We solve this gap with model free RL techniques. Let's first understand how we do model free evaluation.

Model Free Evaluation

In the model based environment we could use the Bellman expectation to backup our state value functions iteratively until they converged. Our term included an expectation over actions $a$ and transition states $s'$ which was simple to calculate with known model dynamics. In the model free environments, we try and estimate those expectations with two popular methods.

V^{\pi}(s) = \sum_a \pi(a|s)\left[R(s) + \gamma \sum_{s'}P(s'|s,a) V^{\pi}(s')\right]

Monte Carlo Estimation: Here we let the agent explore the environment, observe rewards and complete trajectories. We then observe the empirical reward from each trajectory and use that as an estimate of the value function for each state. Since we do this over many trajectories we write an update rule in gradient descent style.

V^{\pi}(s) \leftarrow V^{\pi}(s) + \left[G_t - V^{\pi}(s)\right]

$G_t$ here is the rewards collected from each state until terminal state. It is an unbiased estimation since you are collecting true values and over time it converges to the true value function. It is still a high variance estimate given the randomness that comes from stochastic policies and transition dynamics in the environment.

Temporal Difference Learning: Instead of sampling full trajectories and using true returns, TD estimation is inspired from the Bellman expression and writes the estimate as the sum of immediate reward the agent observes and estimated reward of the next state it visits.

V^{\pi}(s) \leftarrow V^{\pi}(s) + \left[R(s) + \gamma V(s') - V^{\pi}(s)\right]

This is a low variance estimate of the value function since we don't have the randomness from summing of rewards at each state, but is also high bias since we are using an estimate for the next state. These two methods sit on the extreme ends of the bias variance tradeoff curve for estimating state value functions.

Model Free Improvement

The above methods let us evaluate policies. The next step is improving them. The first way to do this in a model free setting is using value based methods. Here we learn the state action value function for each state and use that to determine the optimal policy.

Q learning is an example of value based policy improvement method. We simply initialize a table of the shape states x actions that holds all $Q^{\pi}(s,a)$ values. Our agent then takes actions according to an $\epsilon$ -greedy policy where it picks a random action with an $\epsilon$ probability and otherwise picks the action with maximum $Q^{\pi}(s,a)$ in that state. This policy then steps through the environment and collects rewards. We then use a version of the TD learning target we came up with above to update $Q(s,a)$ with a gradient style learning update.

Q(s,a) \leftarrow Q(s,a) + \alpha \left[R(s) + \gamma \max_{a'} Q(s',a') - Q(s,a)\right]

Intuitively you can think of this as approximating the Bellman optimality update we saw in model based value iteration. Q learning is also off policy since we are not using the actual action $a'$ the agent took in the next step but a $\max_{a'}$ over all actions for that state instead. SARSA is a variant of Q learning that is on policy.

Q_table = np.zeros((num_states, num_actions), dtype=np.float32)
 
def pick_action(observation: int, q_values: np.ndarray, epsilon: float) -> int:
    if env.np_random.random() < epsilon:
        return int(env.np_random.integers(0, num_actions))
    return int(np.argmax(q_values[observation]))
 
def train():
 
    epsilon = INITIAL_EPSILON
    for _ in range(NUM_EPISODES):
        state, _ = env.reset()
        stop = False
 
        while not stop:
            action = pick_action(state, Q_table, epsilon)
            next_state, reward, terminated, truncated, _ = env.step(action)
            stop = terminated or truncated
            TD_target = float(reward) + DISCOUNT_FACTOR * (1 - terminated) * np.max(Q_table[next_state])
            Q_table[state, action] = Q_table[state, action] + LEARNING_RATE * (TD_target - Q_table[state, action])
            state = next_state
        epsilon = max(FINAL_EPSILON, epsilon * EPSILON_DECAY)
 
    return Q_table

I have implemented Q learning with a helper function that samples epsilon greedy actions. In the main training loop we run each episode until termination/truncation and do the update step. At the end of each episode we decay epsilon as the policy improves.

slippery=False

slippery=True

Q-learning policy trained on the Frozen Lake environment from Gymnasium. On the left is a deterministic environment and on the right is a stochastic environment.

One clear limitation with Q learning is the tabular method for learning state action values. Deep Q Network (DQN) solves this problem using function approximation. This is when we use a neural network (usually an MLP) to replace the table. It takes as input a tensor representing the state and outputs a tensor with value for each action in that state. We also maintain an exponential moving average copy of the $Q_\theta$ network we initialize as $Q_{\theta^-}$ to serve as a target network. The update step then looks similar to Q learning. We sample $(s,a,r,s')$ from the environment and then do a squared loss error between what our network predicts and what the target network predicts.

\mathcal{L}(\theta) = \left(R(s) + \gamma \max_{a'} Q(s',a';\theta^-) - Q(s,a;\theta)\right)^2

\theta \leftarrow \theta - \alpha \nabla_\theta \mathcal{L}(\theta)

One issue that causes DQN to fail is if we sample $(s,a,r,s')$ tuples from the same agent trajectory then there is high correlation between each sample which breaks the IID assumption for our optimizers and causes the network to overfit to that specific region of the environment. We solve this by using a replay buffer where we store all our tuples as the agent explores the environment and we then randomly sample batches from it for the training.

def train_step(q_net, target_net, optimizer, buffer):
    states, actions, rewards, next_states, dones = buffer.sample(BATCH_SIZE)
 
    states = torch.tensor(states, dtype=torch.float32, device=device)
    actions = torch.tensor(actions, dtype=torch.int64, device=device).unsqueeze(1)
    rewards = torch.tensor(rewards, dtype=torch.float32, device=device).unsqueeze(1)
    next_states = torch.tensor(next_states, dtype=torch.float32, device=device)
    dones = torch.tensor(dones, dtype=torch.float32, device=device).unsqueeze(1)
 
    q_values = q_net(states).gather(1, actions)
 
    with torch.no_grad():
        max_next_q_values = target_net(next_states).max(1, keepdim=True)[0]
 
    target = rewards + DISCOUNT_FACTOR * (1 - dones) * max_next_q_values
    loss = nn.functional.mse_loss(q_values, target)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    return loss
 
def train():
 
    q_net = Q_network(observation_dim, action_dim, hidden_dim=128).to(device)
    target_net = Q_network(observation_dim, action_dim, hidden_dim=128).to(device)
    target_net.load_state_dict(q_net.state_dict())
    optimizer = torch.optim.Adam(q_net.parameters(), lr=LEARNING_RATE)
    buffer = ReplayBuffer(BUFFER_CAPACITY)
    epsilon = INITIAL_EPSILON
    global_step = 0
 
    for _ in range(NUM_EPISODES):
        state, _ = env.reset()
        stop = False
 
        while not stop:
            action = pick_action(state, q_net, epsilon)
            next_state, reward, terminated, truncated, _ = env.step(action)
            buffer.push(state, action, reward, next_state, terminated)
            state = next_state
            global_step += 1
            stop = terminated or truncated
 
            if len(buffer) > MIN_BUFFER_SIZE:
                train_step(q_net, target_net, optimizer, buffer)
 
            if global_step % 5000 == 0:
                target_net.load_state_dict(q_net.state_dict())
 
        epsilon = max(FINAL_EPSILON, EPSILON_DECAY * epsilon)
 
    return q_net

I've implemented DQN with helper functions for picking action similar to previous Q learning one. We define a buffer class we can push to and sample from and our critic and target critic network. For each epsiode we push tuples to the buffer and once we have minimum capacity we sample from it, calculate loss and do the upate step with an optimizer. Our target syncs with main critic network every 5000 steps.

While these value based methods do work you can imagine how hard it is to scale them to LLMs. An essential component is the $\arg\max_{a'}$ which is not feasible over a full vocabulary size. The off policy nature of these techniques also makes training high dimensional policies like LLMs harder. All of these limitations led to another set of model free RL improvement methods called Policy Gradients that have dominated frontier research in the field today. The next section dives deeper into them.

Policy Gradients

One approach to learning the optimal policy in a model free environment is to parameterize the policy itself and use the return it produces as an objective function to climb. This method is called policy gradients. Let our policy be parameterized by $\theta$ . We can then write our objective function over trajectories $\tau$ drawn from the policy $\pi_\theta$ as:

J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)] = \sum_{\tau} P(\tau \mid \pi_\theta)\, R(\tau)

We can use the log trick here which gives us:

\nabla_\theta P(\tau \mid \pi_\theta) = P(\tau \mid \pi_\theta) \cdot \nabla_\theta \log P(\tau \mid \pi_\theta)

We can then use the above expression to rewrite our gradient of an expectation as an expectation over gradients. This gives us the simplest form of the policy gradient. Conceptually you can think of this as scaling the log probability of the policy taking an action $a$ in state $s$ based on the reward from that trajectory.

\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot R(\tau)\right]

Variations

While you cannot do much with the log probabilities observed from the policy, you can change the second reward term to get different variations of policy gradients. To express this better we can write the standard form of the policy gradient method where we can change the value of $\Psi_t$ to vary how we scale our log probabilities.

\nabla_\theta J(\theta) = \mathbb{E}\left[\sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot \Psi_t\right]

The simplest form is Vanilla REINFORCE where $\Psi_t = R(\tau)$ and we scale the log probability of each action based on how much reward the entire trajectory gets. In practice we will initialize some random policy and collect trajectories. We then observe the reward and then do gradient ascent update based on the above expression.

\theta \leftarrow \theta + \alpha \,\mathbb{E}\left[\nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot \Psi_t\right]

It is quite obvious that scaling each action's logprobs with the trajectory reward is not the best credit assignment. We can instead use return-to-go where $\Psi_t = G_t$ is the discounted return from time $t$ onward,

G_t = \sum_{k=t}^{T} \gamma^{k-t} r_{k+1}

such that we are scaling each action based on the reward we get in the trajectory after it. Both of these methods still have high variance in the scaling factor because we are essentially doing a Monte Carlo estimate with randomness involved.

We can solve the variance problem by subtracting a baseline from our reward scaling term. A baseline is some function of the state, and subtracting it leaves our gradient unbiased; you can see this in a proof here. The value function of a state is a natural baseline, which leads to the advantage function from model-based RL:

\Psi_t = A^{\pi}(s_t, a_t)

The advantage function is now a mainstay in policy gradient methods and all new methods are some new way of approximating the advantage for a given state and action.

class Policy(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim):
        super().__init__()
        self.weights = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(), 
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(), 
            nn.Linear(hidden_dim, action_dim)
        )
 
    def forward(self, input):
        return self.weights(input)
 
def pick_action(state, policy):
    state = torch.as_tensor(state, device=device, dtype=torch.float32).unsqueeze(0)
    logits = policy(state)
    distribution = Categorical(logits=logits)
    action = distribution.sample()
    log_prob = distribution.log_prob(action)
    return action.item(), log_prob
 
def discounted_returns(returns, gamma):
    result = []
    moving_return = 0.0
    for reward in reversed(returns):
        moving_return = reward + gamma * moving_return
        result.append(moving_return)
    result.reverse()
    return result 
 
def train():
    policy_network = Policy(observation_dim, action_dim, hidden_dim=HIDDEN_DIM).to(device)
    optimizer = torch.optim.Adam(policy_network.parameters(), lr=LEARNING_RATE)
    for _ in range(NUM_EPISODES):
        state, _ = env.reset()
        stop = False 
        rewards = []
        log_probs = []
        while not stop: 
            action, log_prob = pick_action(state, policy_network)
            next_state, reward, terminated, truncated, _ = env.step(action)
            stop = terminated or truncated 
            state = next_state
            rewards.append(reward)
            log_probs.append(log_prob)
        returns = discounted_returns(rewards, DISCOUNT_FACTOR)
        returns_t = torch.tensor(returns, device=device, dtype=torch.float32)
        returns_t = (returns_t - returns_t.mean()) / (returns_t.std(unbiased=False) + 1e-8)
        log_probs_t = torch.stack(log_probs).squeeze(-1)
        loss = -(log_probs_t * returns_t).mean()
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
 
    return policy_network

This is a simple REINFORCE implementation. We initialize a policy that outputs logits over action space. We use a helper function to sample an action and its logit. In the main train loop we then step through the environment and based on observed reward and action backpropogate using the policy gradient loss.

Vanilla REINFORCE

REINFORCE + reward-to-go

REINFORCE + baseline

Here is a comparison of the three variations discussed above on the CartPole environment from gymanisum. The goal for the agent is to keep the pole stable. We see that simple REINFORCE fails. Reward to go makes the policy perform error-free and using an advantage baseline makes it even smoother.

Actor–critic methods

The use of the advantage function led to actor–critic policy gradient methods. Here we maintain $\pi_\theta$ as an actor policy that produces actions and $V_\phi$ as a critic that learns the state value function. But how do we calculate the advantage? One option is to use the Monte Carlo estimate, which is a true sample of $Q^\pi(s,a)$ with no bias but high variance:

\Psi_t = G_t - V^{\pi}(s_t) = A^{\pi}(s_t, a_t)

Another option, given that we have a learnt state value function, is a TD learning estimate. This is low variance but can carry more bias:

\Psi_t = r_t + \gamma V^{\pi}(s_{t+1}) - V^{\pi}(s_t)

Generalized Advantage Estimation

Schulman et al. introduced Generalized Advantage Estimation (GAE) to better navigate this bias–variance tradeoff. The idea is to not pick between the two extremes and instead define a knob for how many steps of observed rewards you include in your advantage estimate before bootstrapping with the value function:

A^{t}(n) = r_t + \gamma r_{t+1} + \cdots + \gamma^{n-1} r_{t+n-1} + \gamma^{n} V_{\phi}(s_{t+n}) - V_{\phi}(s_t)

If $n = 1$ this is a simple TD estimate; if it runs until the end of the trajectory it is a Monte Carlo estimate. Instead of sticking to one value of $n$ , GAE uses an exponential moving average of all such $n$ -step advantages using the TD-error representation with a geometric weight decay you can tune. With $\lambda \approx 0$ you lean toward a TD estimate, and with $\lambda \approx 1$ toward Monte Carlo:

\hat{A}^{\text{GAE}}_t(\gamma, \lambda) = \sum_{l=0}^{\infty} (\gamma \lambda)^l \, \delta_{t+l}, \quad \delta_t = r_t + \gamma V_{\phi}(s_{t+1}) - V_{\phi}(s_t)

We can then use GAE to train actor–critic policy gradients. For the actor policy we use the standard policy gradient objective with $\Psi_t = \hat{A}_t^{\text{GAE}}$ . For the critic network we use a simple regression loss between the current estimate and a target:

L_{\text{critic}} = \bigl(V_\phi(s_t) - V_t^{\text{target}}\bigr)^2 \qquad V_t^{\text{target}} = \hat{A}_t^{\text{GAE}} + V_\phi(s_t)

In practice the actor and critic can be two different networks or the same network with a separate value head. You can also use an ensemble of critics for a better estimate of state values. There are tons of variations on these methods, but the next few sections in these notes will discuss some popular ones where I'll dive deeper into implementation. Let's start with PPO in the next section!

PPO

There are two problems with vanilla policy gradient algorithms from the previous section.

They are data inefficient. We do some rollouts with the current $\pi_\theta$ and then use them to update the policy and then generate new rollouts after each update step.
They have a stability issue. If we take a large update step such that the new policy is considerably different from the old one and ends up in a bad region it may get poor data and struggle to recover. Essentially a single bad update can send the policy off a cliff.

Several new methods and changes were proposed to deal with these issues with TRPO making important contributions. Proximal Policy Optimization (PPO) was motivated by the same problem of maximizing improvement per step while preventing performance collapse. It used first order methods with simple implementation with a few more tricks to make the first scalable method that would negotiate this tradeoff and was widely adopted across tasks.

How does PPO work?

The PPO algorithm makes two key changes to the base policy gradient algorithm to address the aforementioned limitations. Let's understand them better.

Surrogate Objective

If we collect rollouts from some policy $\pi_{\theta_{\text{old}}}$ we would normally do a gradient update with that minibatch to get an updated policy $\pi_{\theta_{\text{new}}}$ and then discard the data. We can't use the same minibatch to train the updated policy because our critic estimates have changed and the logprobs that the new policy assigns are different from what the old one did. We can solve this issue using importance sampling. If we want to estimate expectation over one distribution with samples from another distribution, we can use the ratio of the probabilities assigned to the same variable state by the two distributions to resolve the discrepancy.

\mathbb{E}_{x \sim p}[f(x)] = \mathbb{E}_{x \sim q}\left[\frac{p(x)}{q(x)} f(x)\right]

In this example, if we have samples from distribution $q$ we can still estimate expectation under $p$ by multiplying the probability ratio of $p$ to $q$ . Similarly, if we have samples from an old policy but want to train a new policy we can use this method to weight logprobs such that it is representative of how the new policy would produce them.

L(\theta) = \mathbb{E}_{(s,a) \sim \pi_{\theta_{\text{old}}}}\left[\frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)} \cdot A_t\right]

This gives us the surrogate objective where the ratio tells us how likely the new policy is to take action $a_t$ in state $s_t$ compared to the old policy. If both policies are the same the ratio is one and we get our simple policy gradient objective back.

Clipping

While the surrogate objective solves for the distribution mismatch in expectation we still need to deal with large gradient updates that can push policies off the cliff. The idea of trust region was introduced in TRPO that meant constraining the new policy within a certain KL divergence of the old policy. But the constrained optimization method in TRPO was complex and computationally expensive. PPO introduced a simple alternative to the KL constraint with a clipped objective function where $\rho_t(\theta)$ represents the importance sampling ratio and $\epsilon$ is a hyperparameter we can tune.

L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min\left( \rho_t(\theta)\, \hat{A}_t,\ \operatorname{clip}\left(\rho_t(\theta),\, 1 - \epsilon,\, 1 + \epsilon\right) \hat{A}_t \right) \right]

Based on this clipping condition, whenever the ratio is within the bounds of $1 - \epsilon$ and $1 + \epsilon$ for a sample, we let it contribute to the objective and gradients flow back during backpropagation to update the policy. But if the ratio exceeds that bound—which means the new policy deviates too far from the old one in either direction (increasing or decreasing logprobs)—we clip the sample's contribution to a constant and no gradients flow through it because it has already shifted enough in the new distribution.

We make two more additions to the final objective. First is the regression loss for the critic network that learns the value function, as seen in the policy gradients section. Second is a bonus term: the entropy of the current policy, $H[\pi_{\theta}]$ . This encourages the policy to be more spread out over actions, increasing stochasticity and exploration. That yields the full PPO objective below; the coefficients $c_1$ and $c_2$ weight how much each component contributes to the total objective:

L_{\text{PPO}} = L^{\text{CLIP}} - c_1 L_{\text{VF}} + c_2 \, H[\pi_{\theta}]

Implementation

class ActorCritic(nn.Module):
    def __init__(self, observation_dim, action_dim, hidden_dim):
        super().__init__()
        self.weights = nn.Sequential(
            nn.Linear(observation_dim, hidden_dim),
            nn.Tanh(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.Tanh(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.Tanh()
        )
        self.actor_head = nn.Linear(hidden_dim, action_dim)
        self.value_head = nn.Linear(hidden_dim, 1)
 
    def forward(self, input):
        hidden = self.weights(input)
        action = self.actor_head(hidden)
        value = self.value_head(hidden)
        return action, value
 
    def get_action_value(self, observation, action=None):
        logits, value = self.forward(observation)
        distribution = torch.distributions.Categorical(logits=logits)
        entropy = distribution.entropy()
        if action is None:
            action = distribution.sample()
        log_probs = distribution.log_prob(action)
        value = value.squeeze(-1)
        return action, log_probs, entropy, value

This is our Actor Critic Policy with two heads that output action logits and value scalar.

class RolloutBuffer:
    def __init__(self, num_steps, num_envs, observation_dim, action_dim):
        self.observations = torch.zeros(
            (num_steps, num_envs, observation_dim), dtype=torch.float32, device=device
        )
        self.actions = torch.zeros((num_steps, num_envs), dtype=torch.long, device=device)
        self.log_probs = torch.zeros((num_steps, num_envs), dtype=torch.float32, device=device)
        self.rewards = torch.zeros((num_steps, num_envs), dtype=torch.float32, device=device)
        self.dones = torch.zeros((num_steps, num_envs), dtype=torch.float32, device=device)
        self.values = torch.zeros((num_steps, num_envs), dtype=torch.float32, device=device)
        self.advantages = torch.zeros((num_steps, num_envs), dtype=torch.float32, device=device)
        self.returns = torch.zeros((num_steps, num_envs), dtype=torch.float32, device=device)
        self.step = 0
 
    def push(self, observation, action, value, log_probs, rewards, done):
        t = self.step
        self.observations[t] = torch.as_tensor(observation, dtype=torch.float32, device=device)
        self.actions[t] = torch.as_tensor(action, dtype=torch.long, device=device)
        self.values[t] = value.detach()
        self.log_probs[t] = log_probs.detach()
        self.rewards[t] = torch.as_tensor(rewards, dtype=torch.float32, device=device)
        self.dones[t] = torch.as_tensor(done, dtype=torch.float32, device=device)
        self.step += 1
 
    def reset(self):
        self.step = 0
 
    @torch.no_grad()
    def compute_returns_and_advantages(self, last_value, last_done, gamma, gae_lambda):
        last_gae = 0.0
        num_steps = self.rewards.shape[0]
        for t in reversed(range(num_steps)):
            if t == num_steps - 1:
                next_values = last_value
                next_non_terminal = 1.0 - last_done
            else:
                next_values = self.values[t + 1]
                next_non_terminal = 1.0 - self.dones[t]
            delta = self.rewards[t] + gamma * next_values * next_non_terminal - self.values[t]
            last_gae = delta + gamma * gae_lambda * next_non_terminal * last_gae
            self.advantages[t] = last_gae
        self.returns = self.advantages + self.values
 
    def iter_minibatches(self, batch_size):
        n = self.observations.shape[0] * self.observations.shape[1]
        flat = {
            "observations": self.observations.reshape(n, -1),
            "actions": self.actions.reshape(n),
            "log_probs": self.log_probs.reshape(n),
            "advantages": self.advantages.reshape(n),
            "returns": self.returns.reshape(n),
            "values": self.values.reshape(n),
        }
        indices = torch.randperm(n, device=device)
        for start in range(0, n, batch_size):
            idx = indices[start : start + batch_size]
            yield {k: v[idx] for k, v in flat.items()}

This is the Rollout Buffer object where we store our samples across all vectorized environments and total steps. We use the GAE formula to recursively calculate advantage for each action.

def train_step(agent, optimizer, batch):
    _, new_log_prob, entropy, new_value = agent.get_action_value(
        batch["observations"], batch["actions"]
    )
    advantages = batch["advantages"]
    advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
 
    ratio = torch.exp((new_log_prob - batch["log_probs"]))
    policy_loss = -torch.min(
        ratio * advantages,
        torch.clamp(ratio, 1 - CLIP_EPSILON, 1 + CLIP_EPSILON) * advantages,
    ).mean()
    value_loss = ((new_value - batch["returns"]) ** 2).mean()
    loss = policy_loss + VALUE_LOSS_COEF * value_loss - ENTROPY_COEF * entropy.mean()
 
    optimizer.zero_grad()
    loss.backward()
    nn.utils.clip_grad_norm_(agent.parameters(), MAX_GRAD_NORM)
    optimizer.step()
 
    return loss

This is the training step implementation. We sample action from the new policy for an observation from the buffer and then calculate the importance sampling ratio. We then backporpogate the clipped loss and do an optimizer step.

RLHF

PPO was a significant algorithm improvement but it needs good reward models for optimizing policies. It is simple to design a reward model for games and robots but much harder for language models. One naive way to solve this is having humans label each response based on metrics like length and accuracy. This is not scalable. It is a lot easier to collect human preferences on pairs of responses. We then use this preference data for Reinforcement Learning with Human Feedback. Here is the essence:

We collect rollouts from our model.
For each pair of response $(x, y_1)$ and $(x, y_2)$ for some initial prompt $x$ and responses $y_1$ and $y_2$ we ask a human to rank their preference.
We then train a neural network (reward model) to fit this human preference data such that it can rank responses like humans.

The initial work on RL with human preferences was focused on robotics applications. InstructGPT from OpenAI was the first paper to popularize this method for LLMs where the authors post-trained GPT-3 on instruction following.

Reward Modeling

RLHF is done after the SFT stage in our language model training process. We take the SFT model backbone and swap the final projection layer with a linear layer that outputs scalar scores or rewards. Once we have a trained reward model we use it with RL methods like PPO to further train the main model. The standard reward model implementation is derived from the Bradly-Terry model of preference. It defines the probability of an item $i$ being preferred over $j$ is written as:

P(i > j) = \frac{p_i}{p_i + p_j}

The format of our dataset is usually a prompt $x$ with a preferred response $y_i$ and rejected response $y_j$ in JSONL. Let's assume that doing a forward pass on both gives us scalar output logits $r_i$ and $r_j$ which we can then exponentiate to give us our reparametrized Bradley Terry representation.

P(i > j) = \frac{e^{r_i}}{e^{r_i} + e^{r_j}} = \sigma(r_i - r_j)

You can think of this intuitively as applying the sigmoid to the difference between scores of responses. If the difference is higher our probability $P(i > j)$ is higher. When training the model we can then use the MLE objective which gives us our loss.

\mathcal{L}(\theta) = - \log (\sigma(r_\theta(y_i \mid x) - r_\theta(y_j \mid x)))

This gives us a reward model which is a strong approximation of the true reward model. There are other types of reward models like Outcome and Process Reward Models which are used for reasoning models. Nathan Lambert's RLHF guide is the perfect resource to dive deeper into this!

Direct Preference Optimization

While the above RLHF pipeline was adopted it had some clear issues. You would have to store the reward model in memory along with the main and reference policies. If the reward model was poorly trained it would affect training downstream. Direct Preference Optimization was a new method introduced to solve this problem and eliminate the need for an extra reward model. Let's understand how it works.

In the standard RLHF approach our objective conditioned on a prompt $x$ maximizes the reward (first term) and keeps the policy we train $\pi_\theta$ close to our reference SFT policy $\pi_{\text{ref}}$ with a KL divergence penalty (second term).

\pi^* = \arg\max_{\pi} \mathbb{E}_{y \sim \pi(\cdot \mid x)}[r(x,y)] - \beta\, D_{\mathrm{KL}}(\pi(\cdot \mid x)\,\|\,\pi_{\mathrm{ref}}(\cdot \mid x))

We can expand the KL divergence term and write the above objective for a fixed prompt over all possible responses $y$ that the model can generate.

J(\pi) = \sum_y \pi(y\mid x)[r(x,y) - \beta \log \frac{\pi(y \mid x)}{\pi_{\text{ref}}(y \mid x)}]

We now want to find a policy that maximizes the above reward subject to the constraint $\sum_y \pi(y|x) = 1$ since it is a distribution over all possible responses. We can then find a closed form solution to the problem with the following Lagrangian.

L(\pi,\lambda) = \sum_{y} \pi(y\mid x)\left[r(x,y) - \beta\log\frac{\pi(y\mid x)}{\pi_{\mathrm{ref}}(y\mid x)}\right] - \lambda\left(\sum_{y} \pi(y\mid x) - 1\right)

If we then take a derivative with respect to $\pi(y|x)$ for a fixed $y$ we get the following expression.

r(x,y) - \beta\log\frac{\pi^{*}(y\mid x)}{\pi_{\mathrm{ref}}(y\mid x)} - \beta - \lambda

If we then set the derivative to zero and solve we get the closed form solution for our objective. The optimal policy is to take the reference and multiply each response's probability by $\exp(r / \beta)$ and then renormalize.

\pi^*(y \mid x) = \frac{1}{Z(x)} \pi_{\mathrm{ref}}(y \mid x)\exp\left(\frac{1}{\beta} r(x,y)\right)

Z(x) = \sum_y \pi_{\mathrm{ref}}(y \mid x)\exp\left(\frac{1}{\beta} r(x,y)\right)

But this is not useful because to calculate $Z(x)$ we will have to sum over every possible response $y$ which is intractable. We solve this by taking log of both sides of the closed form solution and solving for the reward.

r(x,y) = \beta \log \frac{\pi^*(y \mid x)}{\pi_{\mathrm{ref}}(y \mid x)} + \beta \log Z(x)

We can now use this new representation of the reward function in our Bradley Terry model formula and since we are subtracting the rewards the $\beta \log Z(x)$ term cancels out. We can then parameterize the resulting expression for $P(y_i \succ y_j)$ with an MLE objective to give us our DPO loss expression.

L_{\mathrm{DPO}}(\theta) = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\mathrm{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\mathrm{ref}}(y_l \mid x)} \right) \right]

And that is our objective! It is just a supervised loss that depends on four numbers for each sample in our dataset which are the log probabilities of $y_i$ and $y_j$ under our reference policy $\pi_{\text{ref}}$ and the trained policy $\pi_\theta$ . Intuitively the first ratio tells us how much our trained policy favors the accepted response compared to the reference one and the second ratio tells us how much it favors the rejected one. We then try to maximize the difference between the logs of the two terms.

def get_log_probs(model, input_ids, attention_mask, completion_mask):
    outputs = model(input_ids=input_ids, attention_mask=attention_mask)
    logits = outputs.logits
 
    shift_logits = logits[:, :-1, :]
    shift_labels = input_ids[:, 1:]
    shift_mask = completion_mask[:, 1:]
 
    log_probs = F.log_softmax(shift_logits, dim=-1)
    token_log_probs = log_probs.gather(
        dim=-1, index=shift_labels.unsqueeze(-1)
    ).squeeze(-1)
 
    return (token_log_probs * shift_mask).sum(dim=-1)
 
 
def dpo_loss(policy_chosen_logps, policy_rejected_logps, ref_chosen_logps, ref_rejected_logps, beta):
    first_term = policy_chosen_logps - ref_chosen_logps
    second_term = policy_rejected_logps - ref_rejected_logps
    logits = beta * (first_term - second_term)
    return -F.logsigmoid(logits).mean()

These are the two main helper functions in the training loop for DPO. The first one calculates the log probs of a sequence from the input model and the second one implements the loss function. The training loop then samples prompt, accepted response and rejected response from the dataset. We tokenize it, run the forward passes and calculate the loss function.

GRPO

Another common problem with the PPO algorithm is the high memory requirement for an extra value network and its optimizer states. Deepseek Math was the first paper to introduce GRPO which is inspired by PPO but is more memory efficient. We then saw Deepseek R1 use GRPO in its post-training, giving us the first open source reasoning model. This method is now used widely in Reinforcement Learning with Verifiable Reward.

How does GRPO work?

Group Relative Policy Optimization makes two important changes to the PPO objective. Let's understand them better!

Group Baseline Advantage

The objective of an advantage estimate $A^\pi(s,a)$ is to tell us how much more reward can action $a$ get in a state $s$ compared to the average action from the policy. Instead of using a value function as the learned estimate of the "average" reward we can sample $G$ responses from the model in the current state, determine the reward for each of them and then use the group statistics as our baseline.

A^i = \frac{r_i - \mathrm{mean}(\{r_1, \dots, r_G\})}{\mathrm{std}(\{r_1, \dots, r_G\})}

This works well for language models because sampling is easier to scale than training an extra model. Intuitively we assume that the model, given $G$ rollout attempts will reach the correct solution in some of them. If it does then that is our reward signal and we reinforce that behaviour.

KL divergence term

We saw in RLHF-PPO how the reward blends the actual reward score and KL penalty for each token. For easier interpretation we want the GRPO advantage to focus only on the reward/score a specific rollout can achieve. Given this reason the GRPO objective has a separate KL divergence term that uses the k3 estimator.

\mathbb{D}_\text{KL}(\pi_\theta \parallel \pi_\text{ref}) = \frac{\pi_\theta(o_t)}{\pi_\text{ref}(o_t)} - \log \left(\frac{\pi_\theta(o_t)}{\pi_\text{ref}(o_t)}\right) - 1

Combining the two we get our final GRPO objective. We sample the loss across each token in a sample and then each sample in a batch. You'll also see that all tokens in one sample get the same advantage based on the reward from that trajectory.

{\scriptsize \mathcal{L}_{\text{GRPO}}(\theta) = \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \left[ \min\!\left(\rho_{i,t} \hat{A}_i,\, \mathrm{clip}(\rho_{i,t}, 1{-}\varepsilon, 1{+}\varepsilon) \hat{A}_i\right) - \beta D_{\text{KL}}(\pi_\theta \,\|\, \pi_{\text{ref}}) \right]}

Dr GRPO

Deepseek reported that response length increases as RL training progressed and attributed that to more reasoning tokens. This paper did an analysis of the Deepseek R1-Zero training methods and found two biases in the GRPO objective, one of which might be causing the longer responses.

Response level length bias: The normalization we do with $|o_i|$ for each update results in the model preferring shorter correct response since that gives them a higher gradient push and longer incorrect responses since the normalization reduces the negative gradient signal.
Question level difficulty bias: Dividing the baseline with standard deviation of rewards from the $G$ rewards results in rollouts with lower standard deviations (where most rollouts get 0 or 1) getting a stronger gradient signal and causing a question difficulty bias in the updates.

Dr GRPO is a variant this paper proposed that removes both normalization terms from the main GRPO objective.

DAPO

Researchers at Alibaba tried to replicate the reasoning behaviour from Deepseek R1 and OpenAI o1 models and made some interesting observations in this paper. They realized that they weren't able to replicate the results at scale and open sourced a new RL training framework called verl that is widely used today. They also introduced a new variant of GRPO called DAPO which solves some key limitations they observed.

How does DAPO work?

DAPO makes four main modifications to the GRPO objective that improve both scalability and performance during training. Let's understand them!

Raise the Clip Ceiling

The authors observed an entropy mode collapse. As training progressed sampled responses became more and more identical. This causes limited exploration resulting in a deterministic policy. They attributed this to the symmetric clipping used in PPO and GRPO.

If the bounds for the clip $\varepsilon = 0.2$ and you have two tokens with positive advantage and log probabilities $0.01$ and $0.9$ respectively then the upper bound on how much probability can increase for these tokens is $0.012$ and $1.08$ respectively. You can think of the first token here as an exploration bet since it is low probability but can promise good rewards. We can clearly see that a symmetric clip ratio does not do enough to incentivize the exploratory bets.

DAPO therefore uses an asymmetric clipping ratio where $\varepsilon_\text{low}$ and $\varepsilon_\text{high}$ are two separate ratios and the upper ceiling is increased.

Dynamic Sampling

If you have prompts where all $G$ responses have full reward or no reward then there is no advantage and gradients in the update step. Prompts like this don't contribute to learning and lead to more noise in the batch level statistics. DAPO skips such prompts and moves on to the next one until the batch is filled. They observe that this requires more rollouts but also results in faster convergence since each batch has more learning signal.

Token Level Gradients

We know that GRPO objective lets each sample contribute equally to the batch loss. This does not scale well with longer CoT reasoning since the per token contribution to the loss decreases as response lengths increase. DAPO instead uses token level gradients where the loss is averaged across all tokens in a batch. This results in proportionate contribution from each sample. More importantly, if a token or group of tokens do improve model capability they will contribute to the gradient equally irrespective of their response length.

Overlong Filtering

When responses exceed the max token limit it is hard to reward the rollout and extract learning signal from it. One simple solution is overlong filtering where such rollouts are masked during loss calculation. The DAPO authors observed this makes the training more stable. They also propose soft overlong punishment where we define a soft max token limit after which each token gets a penalty that is added to final reward. This gives a smooth gradient signal for the model to limit its response length.

All of these changes combined give us the DAPO objective. One more change from the GRPO objective is the lack of a KL divergence term since the authors thought the restriction was not necessary.

{\scriptsize \mathcal{J}_{\text{DAPO}}(\theta) = \mathbb{E}_{(q,a)\sim\mathcal{D},\, \{o_i\}_{i=1}^{G}\sim\pi_{\theta_{\mathrm{old}}}(\cdot|q)} \left[ \frac{1}{\sum_{i=1}^{G}|o_i|} \sum_{i=1}^{G} \sum_{t=1}^{|o_i|} \min\!\left(r_{i,t}(\theta)\hat{A}_{i,t},\, \mathrm{clip}\!\left(r_{i,t}(\theta), 1{-}\varepsilon_{\mathrm{low}}, 1{+}\varepsilon_{\mathrm{high}}\right)\hat{A}_{i,t}\right) \right]}