机器学习与数据科学博士生系列论坛(第九十七期)—— Data Selection for LLM Pretraining
报告人:孙昱洋(国产自慰
)
时间:2025-12-25 16:00-17:00
地点:腾讯会议:331-2528-5257
摘要:
The quality and composition of training data are pivotal in shaping the capabilities, efficiency, and safety of large language models (LLMs). This lecture examines the methodologies and guiding principles behind data selection during the pretraining stage, where trillions of tokens are filtered, deduplicated, and combined to form the foundation of modern LLMs. We will systematically explore key techniques—including language filtering, heuristic cleaning, quality scoring, deduplication, toxicity removal, and data mixing—that influence both what a model learns and how well it generalizes, with an emphasis on current best practices.
论坛简介:该线上论坛是由张志华教授机器学习实验室组织,每两周主办一次(除了公共假期)。论坛每次邀请一位博士生就某个前沿课题做较为系统深入的介绍,主题包括但不限于机器学习、高维统计学、运筹优化和理论计算机科学。