Media Summary: In this AI Research Roundup episode, Alex discusses the paper: 'Soft Adaptive In this video, I break down DeepSeek's Group Relative In this AI Research Roundup episode, Alex discusses the paper: 'BAPO: Stabilizing Off-

Sapo Stable Rl Policy Optimization For Llms - Detailed Analysis & Overview

In this AI Research Roundup episode, Alex discusses the paper: 'Soft Adaptive In this video, I break down DeepSeek's Group Relative In this AI Research Roundup episode, Alex discusses the paper: 'BAPO: Stabilizing Off- In this AI Research Roundup episode, Alex discusses the paper: 'SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn ... הרצאה זו היא חלק מכנס GenML 2025 של קהילת MDLI. אתם יכולים לצפות בשאר ההרצאות ובמצגות פה: Training ... Additionally, I will present an approach that simplifies the

A top-down, self-contained guide to RLHF, PPO, and GRPO: how large language models are optimized with reinforcement ... Hands-on whiteboard session on every step of the PPO algorithm! *Support me by buying a copy of the whiteboard:* ... In this video, we break down DAPO: An Open-Source ChatGPT undoubtedly turned the AI industry upside-down, making AI technology mainstream. A key component behind ... As a regular normal swe, I want to share the most typical Reinforcement Learning with Human Feedback (RLHF) is a method used for training Large Language Models (

Photo Gallery

SAPO: Stable RL Policy Optimization for LLMs
SAPO: Stable RL for Large Language Models
Proximal Policy Optimization (PPO) for LLMs Explained Intuitively
DeepSeek's GRPO (Group Relative Policy Optimization) | Reinforcement Learning for LLMs
BAPO: Stabilizing Off‑Policy RL for LLMs
SimpleTIR: Stable RL for Tool-Using LLMs
Soft Adaptive Policy Optimization
Teaching LLMs with RL: From Scratch to GRPO and Beyond
Direct Preference Optimization (DPO) - How to fine-tune LLMs directly without reinforcement learning
Soft Adaptive Policy Optimization (Nov 2025)
Efficient Policy Optimization Techniques for LLMs
RLHF, PPO & GRPO Explained: A Top-Down Guide to LLM Policy Optimization
View Detailed Profile
SAPO: Stable RL Policy Optimization for LLMs

SAPO: Stable RL Policy Optimization for LLMs

In this AI Research Roundup episode, Alex discusses the paper: 'Soft Adaptive

SAPO: Stable RL for Large Language Models

SAPO: Stable RL for Large Language Models

This video explains Soft Adaptive

Proximal Policy Optimization (PPO) for LLMs Explained Intuitively

Proximal Policy Optimization (PPO) for LLMs Explained Intuitively

In this video, I break down Proximal

DeepSeek's GRPO (Group Relative Policy Optimization) | Reinforcement Learning for LLMs

DeepSeek's GRPO (Group Relative Policy Optimization) | Reinforcement Learning for LLMs

In this video, I break down DeepSeek's Group Relative

BAPO: Stabilizing Off‑Policy RL for LLMs

BAPO: Stabilizing Off‑Policy RL for LLMs

In this AI Research Roundup episode, Alex discusses the paper: 'BAPO: Stabilizing Off-

SimpleTIR: Stable RL for Tool-Using LLMs

SimpleTIR: Stable RL for Tool-Using LLMs

In this AI Research Roundup episode, Alex discusses the paper: 'SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn ...

Soft Adaptive Policy Optimization

Soft Adaptive Policy Optimization

Soft Adaptive

Teaching LLMs with RL: From Scratch to GRPO and Beyond

Teaching LLMs with RL: From Scratch to GRPO and Beyond

הרצאה זו היא חלק מכנס GenML 2025 של קהילת MDLI. אתם יכולים לצפות בשאר ההרצאות ובמצגות פה: https://mdli.co.il/en25. Training ...

Direct Preference Optimization (DPO) - How to fine-tune LLMs directly without reinforcement learning

Direct Preference Optimization (DPO) - How to fine-tune LLMs directly without reinforcement learning

Direct Preference

Soft Adaptive Policy Optimization (Nov 2025)

Soft Adaptive Policy Optimization (Nov 2025)

Title: Soft Adaptive

Efficient Policy Optimization Techniques for LLMs

Efficient Policy Optimization Techniques for LLMs

Additionally, I will present an approach that simplifies the

RLHF, PPO & GRPO Explained: A Top-Down Guide to LLM Policy Optimization

RLHF, PPO & GRPO Explained: A Top-Down Guide to LLM Policy Optimization

A top-down, self-contained guide to RLHF, PPO, and GRPO: how large language models are optimized with reinforcement ...

Simply Explaining Proximal Policy Optimization (PPO) | Deep Reinforcement Learning

Simply Explaining Proximal Policy Optimization (PPO) | Deep Reinforcement Learning

Hands-on whiteboard session on every step of the PPO algorithm! *Support me by buying a copy of the whiteboard:* ...

An introduction to Policy Gradient methods - Deep Reinforcement Learning

An introduction to Policy Gradient methods - Deep Reinforcement Learning

In this episode I introduce

GRPO 2.0? DAPO LLM Reinforcement Learning Explained

GRPO 2.0? DAPO LLM Reinforcement Learning Explained

In this video, we break down DAPO: An Open-Source

RLOO: A Cost-Efficient Optimization for Learning from Human Feedback in LLMs

RLOO: A Cost-Efficient Optimization for Learning from Human Feedback in LLMs

ChatGPT undoubtedly turned the AI industry upside-down, making AI technology mainstream. A key component behind ...

LLM Training & Reinforcement Learning from Google Engineer | SFT + RLHF | PPO vs GRPO vs DPO

LLM Training & Reinforcement Learning from Google Engineer | SFT + RLHF | PPO vs GRPO vs DPO

As a regular normal swe, I want to share the most typical

SPO: Off-Policy RL Revolutionizes Sequence Model Training!

SPO: Off-Policy RL Revolutionizes Sequence Model Training!

Dive into Soft

Proximal Policy Optimization (PPO) - How to train Large Language Models

Proximal Policy Optimization (PPO) - How to train Large Language Models

Reinforcement Learning with Human Feedback (RLHF) is a method used for training Large Language Models (