🚀Shippingscore 112.9May 26, 2026·2605.27354cs.LGcs.AIcs.CL

Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders

Yi Jing, Zao Dai, Jinwu Hu, Zijun Yao, Lei Hou, Juanzi Li, Xiaozhi Wang

Narrative

No narrative written yet. The narrate cron picks top papers by score; run /api/cron/narrate to populate this manually.

Abstract

Model internals encode rich information about how a large language model (LLM) processes its training data; however, post-training data engineering largely relies on external signals and ignores rich intrinsic signals lying in model internals. We propose SAERL, a data engineering framework for LLM reinforcement learning (RL). It models three intrinsic data properties: diversity, difficulty, and quality, using model internals extracted with Sparse Autoencoder (SAE), an advanced mechanistic interpretability tool. Each property grounds a concrete data engineering operation: SAE-space clustering with moderate batch mixing for batch diversity control, a difficulty proxy for easy-to-hard curriculum ordering, and a quality probe for data filtering. SAERL improves average accuracy by 3.00% over vanilla GRPO and reaches target accuracy with 20% fewer training steps on Qwen2.5-Math-1.5B, with consistent gains across model scales and RL algorithms. Experiments show that SAE transfers effectively across model families and scales, serving as a lightweight and reusable data engineering tool. These results demonstrate that model internals are a powerful and practical source of signals for post-training data engineering.

Citation timeline

Not enough citation snapshots yet to plot a timeline. Come back after a few cron runs.

Signal

Stars: 500
Repos: 38
Citations: 0
Velocity: 0.00/d

GitHub repos (20)

luohongk/Embodied-AI-Daily⭐ 259
“## LLM | **Title** | **Date** | **Comment** | | --- | --- | --- | | **[Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders](https://arxiv.org/abs/2605.27354v1)** | 2026-05-26 | | | **[FinHarness: An Inline Lifecycle Safety Harness for Finance”
AI-in-Transportation-Lab/awesome-mechanistic-interpretability⭐ 96
“- [Polymorphism Is Rotation: Operational Mechanistic Interpretability from a Two-Layer Transformer to Pythia-70m](https://arxiv.org/abs/2605.24577) - [Learning dynamical systems with biochemically informed neural ordinary differential equations](https://arxiv.org/abs/2605.24170) ”
CSQianDong/Awesome-arXiv-Daily-Reporter⭐ 48
“{'arxiv_id': 'arXiv:2605.27366', 'title': 'MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation', 'authors': 'Huawei Lin, Peng Li, Jie Song, Fuxin Jiang, Tieying Zhang', 'link': 'https://arxiv.org/abs/2605.27366', 'abstract': 'Large language”
pstAmbition/DailyArXiv_Multimodal⭐ 16
“## LLM | **Title** | **Date** | **Comment** | | --- | --- | --- | | **[Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders](https://arxiv.org/abs/2605.27354v1)** | 2026-05-26 | | | **[FinHarness: An Inline Lifecycle Safety Harness for Finance”
ZenAlexa/agi-brief-history⭐ 11
“- **Summary**: Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing alignment influences the prefer”
lonePatient/lonePatient.github.io⭐ 9
“{% hideToggle 点击查看摘要 %} {% note blue no-icon %} ID-9-Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders {% endnote %} **链接**: https://arxiv.org/abs/2605.27354 **作者**: Yi Jing,Zao Dai,Jinwu Hu,Zijun Yao,Lei Hou,Juanzi Li,Xiaozhi Wang **类目**: ”
jyyang621/DailyArXiv⭐ 9
“## LLM | **Title** | **Date** | **Comment** | | --- | --- | --- | | **[Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders](https://arxiv.org/abs/2605.27354v1)** | 2026-05-26 | | | **[FinHarness: An Inline Lifecycle Safety Harness for Finance”
Guesswhat-Studio/Linnet⭐ 8
“ "abstract": "This paper introduces SAERL, a framework that leverages model internals from Sparse Autoencoders (SAEs) to guide Large Language Model (LLM) post-training data engineering for reinforcement learning. SAERL models data diversity, difficulty, and quality using”
2shin0/arxiv-ai-mailing⭐ 7
“ ## 42. Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders - **Authors**: Yi Jing , Zao Dai , Jinwu Hu , Zijun Yao , Lei Hou , Juanzi Li , Xiaozhi Wang - **URL**: [https://arxiv.org/abs/2605.27354](https://arxiv.org/abs/2605.27354) - **Abstra”
sifted-network/sifted-awesome-ai-agents⭐ 7
“ arXiv:2605.27354v1 Announce Type: cross Abstract: Model internals encode rich information about how a large language model (LLM) processes its training data; however, post-training data engineering largely relies on external signals and ignores rich intrinsic signals lying in mo”
ttmens/ai-radar-wiki⭐ 6
“ </item> <item> <title>Guiding LLM Post-training Data Engineering with Model Internals from Sparse Auto</title> <link>http://arxiv.org/abs/2605.27354v1</link> <guid isPermaLink="false">guiding-llm-post-training-data-engineering-with-model-internals-from-s”
zachysun/DailyArXiv⭐ 6
“## LLM | **Title** | **Date** | **Cool Paper** | **Comment** | | --- | --- | --- | --- | | **[Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders](https://arxiv.org/abs/2605.27354v1)** | 2026-05-26 | [Go](https://papers.cool/arxiv/2605.27354v1”
Ponkux/DailyArXiv-cp⭐ 6
“| --- | --- | --- | | **[Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases](https://arxiv.org/abs/2605.27355v1)** | 2026-05-26 | <details><summary>Accep...</summary>Accepted at ICML 2026, Source code: https://alignme”
bailynlove/Rookie-s-Newsletters⭐ 3
“ P1 ★★★★ (评分: 7.5/10) 来源: <a href="https://arxiv.org/abs/2605.27354">arXiv:2605.27354</a> 作者: Yi Jing et al.”
zeaoji/MyDailyArXiv⭐ 2
“## Sparse Autoencoder | **Title** | **Date** | **Comment** | | --- | --- | --- | | **[Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders](https://arxiv.org/abs/2605.27354v1)** | 2026-05-26 | | | **[Mechanistic Interpretability of Antibody La”
NeoCodeSmith/NeoSignal⭐ 1
“ { "id": "edba65efd4fc", "title": "Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders", "url": "https://arxiv.org/abs/2605.27354", "summary": "arXiv:2605.27354v1 Announce Type: cross Abstract: Model internals encode”
amor-mio-de-mi-vida/PaperDigest⭐ 1
“ ### 7. [Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders](https://arxiv.org/abs/2605.27354v1) ”
angrysky56/LLM-WIKI⭐ 1
“type: source summary: "SAERL: Sparse Autoencoder Reinforcement Learning — uses SAE features (diversity, difficulty, quality) as intrinsic signals for post-training data engineering in GRPO; achieves 3% improvement over vanilla GRPO on Qwen2.5-Math-1.5B." tags: [arxiv, rl, grpo, s”
aparasion/turingwire⭐ 1
“source_publisher: "arXiv cs.AI" source_url: "https://arxiv.org/abs/2605.27354v1" arxiv_id: "2605.27354"”
Kiraaa1/ArXic-AI-Paper-Digest-Agent⭐ 1
“ ### 8. Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders **Authors:** Yi Jing, Zao Dai, Jinwu Hu, Zijun Yao, Lei Hou, Juanzi Li, Xiaozhi Wang **Link:** https://arxiv.org/abs/2605.27354v1 **Summary:** The paper addresses the challenge of imp”