Research Benchmark

SWE-Bench++

A Scalable Framework for Generating Multilingual Software Engineering Benchmarks from Open-Source Repositories

11,133 execution-based tasks from 3,971 repositories across 11 languages — harvested from live GitHub pull requests covering both bug fixes and feature requests.

11,133
Benchmark Instances
3,971
GitHub Repositories
11
Programming Languages
1,782
Verified Subset

Why SWE-Bench++?

Addressing the fundamental limitations of static, manually curated benchmarks with an automated, multilingual, living evaluation framework.

Multilingual Coverage

Unlike Python-only benchmarks, SWE-Bench++ covers 11 programming languages from 3,971 repositories, capturing the structural and linguistic diversity of open-source projects.

Bug Fixes & Feature Requests

State-differential task classification distinguishes between regression fixes and new feature implementations — covering tasks that prior benchmarks had to discard.

Living Benchmark

Continuous ingestion of fresh pull requests enables temporally separated, contamination-aware evaluation. Instances are filtered by PR creation time relative to model training cutoffs.

Training Trajectory Synthesis

Hint-guided trajectory synthesis converts instances that strong models fail on into fine-tuning data. Demonstrated measurable improvements on the SWE-bench Multilingual benchmark.

Four-Stage Pipeline

SWE-Bench++ transforms GitHub pull requests into reproducible, execution-based software engineering tasks through a fully automated pipeline.

01

Programmatic Sourcing

Broad search across the GitHub firehose to identify candidate tasks from merged pull requests that resolve real issues. Filters for active maintenance, community adoption (>100 stars), testing frameworks, and codebases exceeding 10k LOC.

02

Environment Synthesis

Template-guided Dockerfile synthesis with iterative build-and-test feedback loops. Combines LLM reasoning with static templates, achieving ~137% higher yield than baseline approaches on Python repositories.

03

Test Oracle Extraction

State-differential oracle compares three repository states — Base, Before, and After — to classify tasks as bug fixes or feature requests. Adaptive parser synthesis handles heterogeneous log formats across build systems.

04

Quality Assurance

Multi-layer verification including deterministic checks, semantic alignment via LLM-Judge, and human review sampling. Ensures each instance is reproducible, unambiguous, and free from solution leakage.

Model Leaderboard

Performance of frontier LLM coding agents on a verified 1,782-instance cross-lingual subset (pass@10).

RankModelPass@10
1claude-sonnet-4.536.20%
2gpt-5-2025-08-0734.57%
3gemini-2.5-pro24.92%
4gpt-4o16.89%

Results on a stratified subset of 1,782 instances (~100–280 per language). Models show stronger performance on Python and Java.

Supported Languages

PythonJavaJavaScriptTypeScriptGoRustC++C#RubyPHPSwift

Research Highlights

~137%

Higher Yield

Template-guided synthesis achieves ~137% higher yield on Python repositories compared to SetUpAgent baselines.

4-Layer

QA Pipeline

Multi-layer verification including deterministic checks, semantic alignment, and human review ensures each instance is reproducible and unambiguous.

Fine-Tune

Ready Data

Hint-guided trajectory synthesis transforms failed instances into training data, with measurable gains on external multilingual benchmarks.

Frequently Asked Questions

SWE-Bench++ is an automated framework that generates repository-level coding tasks from open-source GitHub projects. Unlike the original SWE-bench which relies on manual curation and covers only 12 Python repositories, SWE-Bench++ automates the entire pipeline across 11 programming languages and 3,971 repositories, covering both bug fixes and feature requests.
SWE-Bench++ supports 11 programming languages: Python, Java, JavaScript, TypeScript, Go, Rust, C++, C#, Ruby, PHP, and Swift. This multilingual coverage captures the real diversity of open-source software development.
SWE-Bench++ is designed as a 'living benchmark' that continuously ingests fresh pull requests from GitHub. Each model is evaluated only on instances whose PR timestamps fall after its published training cutoff date, minimizing the risk of memorization.
The State-Differential Oracle compares three repository states: Base (original code), Before (test patch applied), and After (full PR applied). This allows SWE-Bench++ to verify both regression fixes and new feature implementations, treating specific build failures in the Before state as semantic signals for feature requests.
Yes. SWE-Bench++ includes a hint-guided trajectory synthesis step that converts difficult instances into high-quality training trajectories. Fine-tuning experiments have demonstrated measurable improvements on external multilingual benchmarks.
SWE-Bench++ uses template-guided Dockerfile synthesis with iterative build-and-test feedback loops. Language-specific templates enforce security best practices (multi-stage builds, minimal base images) while an LLM infers dynamic dependencies. This achieves approximately 137% higher yield compared to baseline approaches.

Ready to Evaluate Your Coding Model?

Request benchmark access, explore evaluation data, or run a scoped SWE-Bench++ evaluation with our research team.