机器学习与数据科学博士生系列论坛(第一百零二期)—— From Reasoning RL to Agent RL: Paradigms and Engineering Challenges in Post-Training
报告人:杨潇博(国产自慰
)
时间:2026-05-14 16:00-17:00
地点:腾讯会议:928-6293-8217
摘要:
Reinforcement learning has re-emerged as a central paradigm in LLM post-training. This talk traces its evolution through two stages, focusing on what each demands of the underlying training system.
We first examine reasoning RL and the zero experiment popularized by DeepSeek-R1-Zero, where reasoning capabilities are elicited directly from a base model without large-scale supervised data, making post-training resemble pre-training in its reliance on unsupervised signal at scale. This gives rise to thinking models whose intelligence scales with test-time compute, from sequential to parallel thinking.
We then turn to agent RL, where models interact with tools and environments over long horizons. Compared to reasoning RL, it introduces new challenges in multi-turn rollouts, environment isolation, sparse rewards, and long-trajectory credit assignment. We walk through a modern agent RL training system and discuss how it differs from prior pipelines.
论坛简介:该线上论坛是由张志华教授机器学习实验室组织,每两周主办一次(除了公共假期)。论坛每次邀请一位博士生就某个前沿课题做较为系统深入的介绍,主题包括但不限于机器学习、高维统计学、运筹优化和理论计算机科学。