Property-Guided LLM Program Synthesis for Planning
Augusto B. Corrêa, André G. Pereira, Jendrik Seipp
Using formal property verification instead of numeric scoring to guide LLM-based program synthesis, this work shows that counterexample-driven feedback — stopping evaluation early and telling the model exactly how a candidate failed — can cut program generation by 7x compared to score-based methods. The domain is AI planning: an LLM synthesizes heuristic functions for PDDL domains, and the target property is that every state reachable by strictly improving transitions has a strictly improving successor (making hill-climbing complete). Tested across ten planning domains on out-of-distribution tasks, the synthesized heuristics are effectively "direct" on nearly all test cases and require orders of magnitude less computation to evaluate than prior generation-based approaches.
No production traction yet. The GitHub references are all arxiv-tracking aggregators, not implementations or downstream users. Zero citations on Semantic Scholar. The ideas here are directly relevant to anyone building LLM-based code synthesis pipelines where formal specs exist — property-checking with counterexample feedback is a practical alternative to fuzzing or test-suite scoring — but nothing is shipping from this work yet.
LLMs have shown impressive success in program synthesis, discovering programs that surpass prior solutions. However, these approaches rely on simple numeric scores to signal program quality, such as the value of the solution or the number of passed tests. Because a score offers no guidance on why a program failed, the system must generate and evaluate many candidates hoping some succeed, increasing LLM inference and evaluation costs. We study a different approach: property-guided LLM program synthesis. Instead of scoring programs after evaluation, we check whether a candidate satisfies a formally defined property. When the property is violated, we stop the evaluation early and provide the LLM with a concrete counterexample showing exactly how the program failed. This feedback drastically reduces both the number of program generations and the evaluation cost, and can guide the LLM to generate stronger programs. We evaluate this approach on PDDL planning domains, asking the LLM to synthesize direct heuristic functions: every state reachable by strictly improving transitions has a strictly improving successor. A heuristic with this property leads hill-climbing algorithm directly to a goal state. A counterexample-guided repair loop generates one candidate program, checks the property over a training set, and returns the first case that violates the property. We evaluate our approach on ten planning domains with an out-of-distribution test set. The synthesized heuristics are effectively direct on virtually all test tasks, and compared to the best prior generation method our approach generates seven times fewer programs per domain on average, solves more tasks without using search, and requires several orders of magnitude less computation to evaluate candidates. Whenever a problem admits a verifiable property, property-guided LLM synthesis can reduce cost and improve program quality.