Create Next App

All 53 🚀 Shipping 2 📈 Climbing 0 💤 Quiet 41 Unscored 10

What do these badges mean?

🚀ShippingCode exists. Multiple GitHub repos already reference this paper — people are building on it.
📈ClimbingCitation velocity is rising. Researchers are starting to pick it up.
💤QuietPublished but no notable signal yet. Most papers live here — could become anything later.
🎭HypeHeavy social buzz but no shipping signal. The counter-signal — defer until Twitter/X data is wired up.

11 min read
🚀Shipping2606.12384·Jun 10, 2026cs.LGcs.AI
APPO: Agentic Procedural Policy Optimization
Xucong Wang, Ziyu Ma, Yong Wang, Yuxiang Ji, +4
⭐ 1.8k stars / 20 repos📚 0 cites
ELI5When training AI agents that use tools, this paper figures out which decisions actually matter and how to learn from them better. Instead of treating each tool call as a unit, it zooms in on individual tokens and uses a smart scoring system to pick which ones are worth exploring differently.
Problem solvedCurrent RL methods for tool-using agents struggle to pinpoint which intermediate decisions drive success, leading to wasted exploration on unimportant choices. This makes training inefficient and credit assignment unreliable, especially when good decisions are scattered throughout long sequences rather than at obvious tool-call boundaries.
13 min read
🚀Shipping2606.12344·Jun 10, 2026cs.LGcs.CL
Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks
Mengyu Zheng, Kai Han, Boxun Li, Haiyang Xu, +12
⭐ 1.5k stars / 22 repos📚 0 cites
ELI5A toolkit for fairly comparing different AI agent designs on real software engineering tasks across multiple programming languages. It standardizes how agents interact with code, extract patches, and get scored so you can see which agent setup actually works best.
Problem solvedTesting AI coding agents fairly is hard because each agent design needs different handling—there's no standard way to measure them against each other. This benchmark lets you actually compare agent harnesses (the glue code that connects models to tools) on an apples-to-apples basis, including cost.

APPO: Agentic Procedural Policy Optimization

Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks