Media Summary: In this AI Research Roundup episode, Alex discusses the paper: ' Speaker: Mengdi Wang Chairman: Sébastien Gerchinovitz Abstract. We discuss some recent results on model-based methods for ... Full episode: Me on twitter: Andrej Karpathy helped ...

Bandpo Probability Aware Bounds For Llm Rl - Detailed Analysis & Overview

In this AI Research Roundup episode, Alex discusses the paper: ' Speaker: Mengdi Wang Chairman: Sébastien Gerchinovitz Abstract. We discuss some recent results on model-based methods for ... Full episode: Me on twitter: Andrej Karpathy helped ... Which is the best strategy for multi-armed bandit? Also includes the Upper Confidence Bound (UCB Method) Link to intro ... Direct Preference Optimization (DPO) is a method used for training Large Language Models (LLMs). DPO is a direct way to train ... In this video, I break down DeepSeek's Group Relative Policy Optimization (GRPO) from first principles, without assuming prior ...

This video is part of the Udacity course "Reinforcement Learning". Watch the full course at Strengthen your technical foundations with Brilliant! Visit to start learning for free and save 20% off ... Understand the mathematical framework behind Reinforcement Learning and Markov Decision Processes in the context of Large ... Author: Bernardo Avila Pires, Csaba Szepesvari. We propose a trajectory-based reinforcement learning method named deep latent policy gradient (DLPG) for learning locomotion ... In this video, we 1) introduce probabilistic models for dynamic system, 2) we discuss their properties and 3) we analyze the kind of ...

Between GPT-4 and the models shipping in 2026, the curve stopped behaving like a curve. The benchmarks still move. In this video, I break down Proximal Policy Optimization (PPO) from first principles, without assuming prior knowledge of ...

Photo Gallery

BandPO: Probability-Aware Bounds for LLM RL
RLVS 2021 - Day 3 - Regret bounds of model-based reinforcement learning
Reinforcement learning is terrible – Andrej Karpathy
Best Multi-Armed Bandit Strategy? (feat: UCB Method)
Direct Preference Optimization (DPO) - How to fine-tune LLMs directly without reinforcement learning
PACOH-RL: Data-Efficient Task Generalization via Probabilistic Model-based Meta RL
DeepSeek's GRPO (Group Relative Policy Optimization) | Reinforcement Learning for LLMs
RL for POMDPs
Reinforcement Learning with Verifiable Rewards - Teaching LLMs to Solve Problems
MDPs and Reinforcement Learning for LLM Agents
Policy Error Bounds for Model-Based Reinforcement Learning with Factored Linear Models
Agentic Reinforcement Learning (RL) for Large Language Models (LLM).Markov Decision Processes (MDPs)
View Detailed Profile
BandPO: Probability-Aware Bounds for LLM RL

BandPO: Probability-Aware Bounds for LLM RL

In this AI Research Roundup episode, Alex discusses the paper: '

RLVS 2021 - Day 3 - Regret bounds of model-based reinforcement learning

RLVS 2021 - Day 3 - Regret bounds of model-based reinforcement learning

Speaker: Mengdi Wang Chairman: Sébastien Gerchinovitz Abstract. We discuss some recent results on model-based methods for ...

Reinforcement learning is terrible – Andrej Karpathy

Reinforcement learning is terrible – Andrej Karpathy

Full episode: https://www.youtube.com/watch?v=lXUZvyajciY Me on twitter: https://x.com/dwarkesh_sp Andrej Karpathy helped ...

Best Multi-Armed Bandit Strategy? (feat: UCB Method)

Best Multi-Armed Bandit Strategy? (feat: UCB Method)

Which is the best strategy for multi-armed bandit? Also includes the Upper Confidence Bound (UCB Method) Link to intro ...

Direct Preference Optimization (DPO) - How to fine-tune LLMs directly without reinforcement learning

Direct Preference Optimization (DPO) - How to fine-tune LLMs directly without reinforcement learning

Direct Preference Optimization (DPO) is a method used for training Large Language Models (LLMs). DPO is a direct way to train ...

PACOH-RL: Data-Efficient Task Generalization via Probabilistic Model-based Meta RL

PACOH-RL: Data-Efficient Task Generalization via Probabilistic Model-based Meta RL

We introduce PACOH-

DeepSeek's GRPO (Group Relative Policy Optimization) | Reinforcement Learning for LLMs

DeepSeek's GRPO (Group Relative Policy Optimization) | Reinforcement Learning for LLMs

In this video, I break down DeepSeek's Group Relative Policy Optimization (GRPO) from first principles, without assuming prior ...

RL for POMDPs

RL for POMDPs

This video is part of the Udacity course "Reinforcement Learning". Watch the full course at https://www.udacity.com/course/ud600.

Reinforcement Learning with Verifiable Rewards - Teaching LLMs to Solve Problems

Reinforcement Learning with Verifiable Rewards - Teaching LLMs to Solve Problems

Strengthen your technical foundations with Brilliant! Visit https://brilliant.org/AdamLucek/ to start learning for free and save 20% off ...

MDPs and Reinforcement Learning for LLM Agents

MDPs and Reinforcement Learning for LLM Agents

Understand the mathematical framework behind Reinforcement Learning and Markov Decision Processes in the context of Large ...

Policy Error Bounds for Model-Based Reinforcement Learning with Factored Linear Models

Policy Error Bounds for Model-Based Reinforcement Learning with Factored Linear Models

Author: Bernardo Avila Pires, Csaba Szepesvari.

Agentic Reinforcement Learning (RL) for Large Language Models (LLM).Markov Decision Processes (MDPs)

Agentic Reinforcement Learning (RL) for Large Language Models (LLM).Markov Decision Processes (MDPs)

Agentic Reinforcement Learning (

Trajectory-based Probabilistic Policy Gradient for Learning Locomotion Behaviors

Trajectory-based Probabilistic Policy Gradient for Learning Locomotion Behaviors

We propose a trajectory-based reinforcement learning method named deep latent policy gradient (DLPG) for learning locomotion ...

[Reinforcement Learning] Offline Lesson 2 - Probabilistic Reasoning Over Time

[Reinforcement Learning] Offline Lesson 2 - Probabilistic Reasoning Over Time

In this video, we 1) introduce probabilistic models for dynamic system, 2) we discuss their properties and 3) we analyze the kind of ...

Probability Is Not Proof. And LLMs Will Never Cross That Line

Probability Is Not Proof. And LLMs Will Never Cross That Line

Between GPT-4 and the models shipping in 2026, the curve stopped behaving like a curve. The benchmarks still move.

Proximal Policy Optimization (PPO) for LLMs Explained Intuitively

Proximal Policy Optimization (PPO) for LLMs Explained Intuitively

In this video, I break down Proximal Policy Optimization (PPO) from first principles, without assuming prior knowledge of ...