πŸ’€Quietscore 69.8May 15, 2026Β·2605.16116cs.AI

ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents

Chinmay Savadikar, Mingyu Zhao, Yuanzheng Zhu, Han Li, Shuang Xie, Alberto Castelo, Tianfu Wu, Lingyun Wang

Narrative

ShopGym provides a pipeline for converting live e-commerce storefronts into reproducible sandbox environments and generating grounded benchmark tasks on top of them. The framework has two layers: ShopArena, which anonymizes and reconstructs real storefronts as self-contained simulations, and ShopGuru, which synthesizes 224 tasks across seven skill categories tied to each shop's specific catalog and policies. Validation shows agent performance on synthetic shops correlates positively with performance on live storefronts, though the dataset covers only six shops.

No production traction yet. The GitHub repos referencing this paper are all automated arXiv aggregators with no substantive engagement, and there are zero citations on Semantic Scholar. The paper was posted in late May 2025 and is too new to assess adoption, but the absence of any dedicated code repository or integration with existing agent frameworks like WebArena or OSWorld limits immediate reproducibility.

Abstract

Developing and evaluating e-commerce web agents requires environments that preserve meaningful task structure while enabling controllable, reproducible, and scalable scientific comparison. Existing methodologies force a tradeoff: live storefronts provide realism but are non-stationary, difficult to inspect, and irreproducible, while hand-built sandbox benchmarks provide control but cover only a narrow range of layouts, catalogs, policies, and interaction patterns. We argue that the core bottleneck is methodological: the field lacks a scalable way to construct evaluation settings that are simultaneously realistic, diverse, controllable, inspectable, and reproducible. We introduce ShopGym, an integrated framework for realistic simulation and scalable benchmarking of e-commerce web agents. ShopGym is a framework for constructing e-commerce simulation environments and grounded benchmark tasks. Its simulation layer, ShopArena, converts live seed storefronts into self-contained sandbox shops through anonymized shop specifications and a staged, validated generation process. On top of these simulated storefronts, ShopGuru synthesizes benchmark tasks across seven skill categories, grounding each task in the shop's catalog, navigation structure, policies, and interaction affordances. Together, ShopArena and ShopGuru produce self-contained, resettable, inspectable, and stable evaluation artifacts that preserve structural properties and agent-evaluation signals relevant to shopping tasks. We validate the framework through graph-based structural analysis and agent-based behavioral evaluation with 224 generated tasks across six sandbox shops: three constructed with synthetic data and three with real data. Our results show that the synthetic shops preserve key structural properties of live storefronts, with agent performance on synthetic shops positively correlated with performance on live storefronts.

Citation timeline
Not enough citation snapshots yet to plot a timeline. Come back after a few cron runs.