SWE-Bench++
A Scalable Framework for Generating Multilingual Software Engineering Benchmarks from Open-Source Repositories
11,133 execution-based tasks from 3,971 repositories across 11 languages — harvested from live GitHub pull requests covering both bug fixes and feature requests.
Why SWE-Bench++?
Addressing the fundamental limitations of static, manually curated benchmarks with an automated, multilingual, living evaluation framework.
Multilingual Coverage
Unlike Python-only benchmarks, SWE-Bench++ covers 11 programming languages from 3,971 repositories, capturing the structural and linguistic diversity of open-source projects.
Bug Fixes & Feature Requests
State-differential task classification distinguishes between regression fixes and new feature implementations — covering tasks that prior benchmarks had to discard.
Living Benchmark
Continuous ingestion of fresh pull requests enables temporally separated, contamination-aware evaluation. Instances are filtered by PR creation time relative to model training cutoffs.
Training Trajectory Synthesis
Hint-guided trajectory synthesis converts instances that strong models fail on into fine-tuning data. Demonstrated measurable improvements on the SWE-bench Multilingual benchmark.
Four-Stage Pipeline
SWE-Bench++ transforms GitHub pull requests into reproducible, execution-based software engineering tasks through a fully automated pipeline.
Programmatic Sourcing
Broad search across the GitHub firehose to identify candidate tasks from merged pull requests that resolve real issues. Filters for active maintenance, community adoption (>100 stars), testing frameworks, and codebases exceeding 10k LOC.
Environment Synthesis
Template-guided Dockerfile synthesis with iterative build-and-test feedback loops. Combines LLM reasoning with static templates, achieving ~137% higher yield than baseline approaches on Python repositories.
Test Oracle Extraction
State-differential oracle compares three repository states — Base, Before, and After — to classify tasks as bug fixes or feature requests. Adaptive parser synthesis handles heterogeneous log formats across build systems.
Quality Assurance
Multi-layer verification including deterministic checks, semantic alignment via LLM-Judge, and human review sampling. Ensures each instance is reproducible, unambiguous, and free from solution leakage.
Model Leaderboard
Performance of frontier LLM coding agents on a verified 1,782-instance cross-lingual subset (pass@10).
Results on a stratified subset of 1,782 instances (~100–280 per language). Models show stronger performance on Python and Java.
Supported Languages
Research Highlights
Higher Yield
Template-guided synthesis achieves ~137% higher yield on Python repositories compared to SetUpAgent baselines.
QA Pipeline
Multi-layer verification including deterministic checks, semantic alignment, and human review ensures each instance is reproducible and unambiguous.
Ready Data
Hint-guided trajectory synthesis transforms failed instances into training data, with measurable gains on external multilingual benchmarks.
Frequently Asked Questions
Ready to Evaluate Your Coding Model?
Request benchmark access, explore evaluation data, or run a scoped SWE-Bench++ evaluation with our research team.
