This repository contains the predictions, execution logs, trajectories, and results for model inference evaluation runs on the SWE-bench task.
The repository is organized as follows:
experiment_data/
├── evaluation/
│ ├── lite/
│ └── test/
| ├── <date>_<model>
│ │ ├── all_preds.jsonl
│ │ ├── metadata.yaml
│ │ ├── README.md
│ │ ├── logs/*.log (Execution Logs)
│ │ └── trajs/*.traj (Reasoning Traces)
│ └── ...
└── validation/
├── dev
└── test
More about how the repository is organized
The evaluation/
folder is organized such that the top level directories are different splits of SWE-bench (lite, test).
Data for models that were run on that corresponding split are included as subfolders.
Each subfolder contains the predictions, results, execution logs, and trajectories (if applicable) for the model run on that split.
The validation/
folder contains the validation logs for the dev and test splits of SWE-bench.
Each of these top level folders consist of repo-level subfolders
(e.g. pallets/flask
is a test split repository, so there is a flask/
folder under validation/test/
).
The validation/test_202404
is a re-run of validation performed April 2024 to ensure reproducibility of task instances' behavior since SWE-bench was created in September 2023
(You can read more about the re-run here).
These logs are publicly accessible and meant to enable greater reproducibility and transparency of the experiments conducted on the SWE-bench task.
If you are interested in submitting your model to the SWE-bench Leaderboard, please do the following:
- Fork this repository.
- Clone the repository. Due to this repository's large diff history, consider using
git clone --depth 1
if cloning takes too long. - Under the split that you evaluate on (
evaluation/lite/
orevaluation/test
), create a new folder with the submission date and the model name (e.g.20240415_sweagent_gpt4
). - Within the folder (
evaluation/<split>/<date model>
), please include the following required assets:
all_preds.jsonl
: Model predictionslogs/
: SWE-bench evaluation artifacts dump- Eval. artifacts means 300/2294 (Lite/Test) folders. Each folder (e.g.
astropy__astropy-1234
) contains:eval.sh
: The evaluation scriptpatch.diff
: The model's generated predictionreport.json
: Summary of evaluation outcomes for this instancerun_instance.log
: A log of SWE-bench evaluation stepstest_output.txt
: An output of runningeval.sh
onpatch.diff
- NOTE: You shouldn't have to create any of these files. They should automatically be generated by SWE-bench evaluation.
- Eval. artifacts means 300/2294 (Lite/Test) folders. Each folder (e.g.
metadata.yaml
: Metadata for how result is shown on website. Please include the following fields:name
: The name of your leaderboard entryoss
:true
if your system is open-sourcesite
: URL/link to more information about your systemverified
:false
(See below for results verification)
trajs/
: Reasoning trace reflecting how your system solved the problem- Submit one reasoning trace per task instance. The reasoning trace should show all of the steps your system took while solving the task. If your system outputs thoughts or comments during operation, they should be included as well.
- The reasoning trace can be represented with any text based file format (e.g.
md
,json
,yaml
) - Ensure the task instance ID is in the name of the corresponding reasoning trace file.
- For an example, see SWE-agent GPT 4 Turbo Trajectories
README.md
: Include anything you'd like to share about your model here!
- Run
python -m analysis.get_results evaluation/<split>/<date model>
. - Create a pull request to this repository with the new folder.
You can refer to this tutorial for a quick overview of how to evaluate your model on SWE-bench.
If you are interested in receiving the "verified" checkmark ✅ on your submission, please do the following:
- Create an issue
- In the issue, provide us instructions on how to run your model on SWE-bench.
- We will run your model on a random subset of SWE-bench and verify the results.
(7/29/2024) We have updated the SWE-bench leaderboard submission criteria to require the inclusion of reasoning traces. The goal of this requirement is to provide the community with more insight into how cutting edge methods work without requiring a code release. (although the latter is still highly encouraged!)
What is a reasoning trace?
A reasoning trace is a text-based file that describes the steps your system took to solve a task instance. It should provide a detailed account of the reasoning process that your system used to arrive at its solution.
We purposely do not explicitly define reasoning traces in a strict, explicit format.
We do have some guidelines. the reasoning trace should be...
- Human-readable.
- Reflects the intermediate steps your system took that led to the final solution.
- Generated with the inference process, not post-hoc.
We do not require reasoning traces to be...
- In a specific file format (e.g.
json
,yaml
,md
) - Conform to a specific problem solving style (e.g. agentic, procedural, etc.)
A simple solution to this? When running inference, simply log the intermediate output generated by your system. For an example, see SWE-agent GPT 4 Turbo Trajectories.
In short, our requirements for what a reasoning trace should specific look like are non-specific. We trust you to provide a detailed account of how your system solved the task instance.
Why are we requiring it?
We believe that reasoning traces can provide valuable insights into how cutting edge methods work without requiring a code release.
As of this post (7/29/2024), we have received many submissions that have pushed the state of the art on SWE-bench, which is exciting to see!
However, we have also found that the top-performing submissions to SWE-bench typically have not open sourced their code nor been verified. We recognize that some leaderboard participants (1) would like to add an entry to SWE-bench but (2) do not want to release their code or proprietary system, which is completely understandable. On the other hand, given that open source systems submitted to SWE-bench have propelled the development of closed-source participants, we would like to continuing promoting development on SWE-bench as a community-level collaborative process.
Therefore, we believe that providing reasoning traces serves as a valuable compromise between these two groups.
What should I submit?
- Create a
trajs/
folder in your submission directory. - Within this folder, upload a reasoning trace per task instance that your system generated a prediction for.
- Make sure the naming convention of the reasoning trace file reflects the SWE-bench task instance it corresponds to. (e.g.
astropy__astropy-1234.md
)
We will review the reasoning traces you submit. We plan to only accept submissions with reasoning traces for the SWE-bench leaderboard.
Questions? Please create an issue. Otherwise, you can also contact {carlosej, jy1682}@princeton.edu.