国产自慰数学学院

国产自慰» 科学研究» 学术报告» 讨论班» Information Sciences

讨论班

机器学习与数据科学博士生系列论坛（第一百零二期）—— From Reasoning RL to Agent RL: Paradigms and Engineering Challenges in Post-Training

报告人：杨潇博（国产自慰）

时间：2026-05-14 16:00-17:00

地点：腾讯会议：928-6293-8217

摘要：
Reinforcement learning has re-emerged as a central paradigm in LLM post-training. This talk traces its evolution through two stages, focusing on what each demands of the underlying training system.
We first examine reasoning RL and the zero experiment popularized by DeepSeek-R1-Zero, where reasoning capabilities are elicited directly from a base model without large-scale supervised data, making post-training resemble pre-training in its reliance on unsupervised signal at scale. This gives rise to thinking models whose intelligence scales with test-time compute, from sequential to parallel thinking.
We then turn to agent RL, where models interact with tools and environments over long horizons. Compared to reasoning RL, it introduces new challenges in multi-turn rollouts, environment isolation, sparse rewards, and long-trajectory credit assignment. We walk through a modern agent RL training system and discuss how it differs from prior pipelines.

论坛简介：该线上论坛是由张志华教授机器学习实验室组织，每两周主办一次（除了公共假期）。论坛每次邀请一位博士生就某个前沿课题做较为系统深入的介绍，主题包括但不限于机器学习、高维统计学、运筹优化和理论计算机科学。

TOP

国产自慰

北大数学成就展

人才引进

捐赠