Sign-Separated Finite-Time Error Analysis of Q-Learning
Donghwan Lee
Constant step-size Q-learning updates are known to overestimate action values due to the Bellman max operator, but the asymmetry between how positive and negative errors propagate has lacked a clean theoretical treatment. This work decomposes Q-learning error componentwise into negative and positive parts, showing the negative side can be bounded by a stable LTI system tied to the optimal policy โ which converges at least as fast as the positive-side bound governed by a linear switching system. The key finding is that positive errors get selected and amplified by the Bellman maximum while negative errors do not, giving a formal finite-time explanation for Q-learning's overestimation bias in both deterministic and stochastic settings.
No production traction yet. The GitHub repos referencing it are generic arxiv aggregators, not implementations or downstream research tools. Zero citations on Semantic Scholar. This is pure theoretical RL analysis, useful as a reference for researchers working on Q-learning convergence guarantees but not something builders are picking up.
This paper develops a sign-separated finite-time error analysis for constant step-size Q-learning. Starting from the switching-system representation, the error is decomposed into its componentwise negative and positive parts. The negative part is dominated by a lower comparison linear time-invariant (LTI) system associated with a fixed optimal policy, whereas the positive part is controlled by a linear switching system. The resulting bounds show that the negative-side LTI certificate is no slower than the positive-side switching certificate and may produce a faster exponential envelope. The analysis identifies a max-induced asymmetry in Q-learning error dynamics. This asymmetry is connected to overestimation: positive action-wise errors can be selected and propagated by the Bellman maximum, whereas negative errors admit an optimal-policy lower comparison. Finite-time bounds are provided for both deterministic and stochastic constant-step-size recursions.