Four days ago, Cognition Lab released Devin, claiming it to be the first AI software engineer.

Devin is the latest state-of-the-art in SWE-Bench coding benchmarks, having successfully passed real engineering interviews at top AI companies and even completed real jobs on Upwork. Devin is an autonomous agent that solves engineering tasks by using its own shell, code editor, and web browser.

Devin's capabilities

With our progress in long-term reasoning and planning, Devin can plan and execute complex engineering tasks requiring thousands of decisions. Devin can recall relevant context at every step, learn over time, and correct mistakes.

We also equipped Devin with common development tools, including a shell, code editor, and browser—all within a sandboxed computational environment—everything a human needs to get work done.

Finally, we gave Devin the ability to collaborate actively with users. Devin can report progress in real-time, accept feedback, and co-design choices with you when necessary.

Some examples of what it can do

Devin can learn how to use unfamiliar technologies. After reading a blog post, Devin ran ControlNet on Modal to generate images for Sara containing hidden messages.
Devin can build and deploy applications end-to-end. Devin created an interactive website simulating Conway's Game of Life! It incrementally added features requested by users and then deployed the application to Netlify.
Devin can independently find and fix bugs in code repositories. Devin helped Andrew maintain and debug his open-source competitive programming book.
Devin can train and fine-tune its own AI models. Given only a link to a research repository on GitHub, Devin set up fine-tuning for a large language model.
Devin can handle bugs and feature requests in open-source repositories. With just a link to a GitHub issue, Devin completed all necessary setup and context collection.
Devin can contribute to mature production repositories. This example is part of the SWE-bench benchmark. Devin solved an error in logarithmic calculations in the sympy Python algebra system. Devin set up the code environment, reproduced the bug, and independently wrote and tested a fix.
Letting Devin take on real-world jobs on Upwork, it performs well! Here, Devin wrote and debugged code to run computer vision models.

Devin's performance

We evaluated Devin on SWE-bench, a challenging benchmark that requires agents to solve real-world GitHub issues found in open-source projects like Django and scikit-learn.

Devin correctly solved 13.86% of the problems end-to-end, far surpassing the previous best of 1.96%. Even when given the exact files to edit, the best previous model could only solve 4.80% of the problems.

Devin was evaluated on a random 25% subset of the dataset. Devin operated without assistance, while all other models were assisted (meaning they were explicitly told which files needed editing).

SWE-bench technical report

SWE-bench was used, an automated benchmark test for software engineering systems composed of GitHub issues and pull requests. SWE-bench deterministically evaluates a system's ability to solve real-world problems in codebases through unit tests, unlike benchmarks such as HumanEval, which are limited to standalone functions.

SWE-bench is a dataset of 2,294 issues and pull requests scraped from popular open-source Python repositories on GitHub. Its goal is to test a system's ability to write real-world code. Each SWE-bench instance consists of a GitHub issue and the pull request that resolves it. The pull request must include a unit test that fails before the code changes and passes afterward (referred to as a "fail-to-pass" test). The diff is split into two parts: patch and test_patch, containing the code changes and test changes, respectively. The system is then asked to generate a diff based on the GitHub issue description and the repository's state at the time of the issue. If all unit tests pass after applying the modifications, the example is considered successful.

In SWE-bench, LLMs are either given a set of correct files to edit ("assisted") or a separate system retrieves the files to edit based on the similarity of the issue text ("unassisted"). As an agent, Devin did not receive any file lists but navigated the files itself, making it more similar to an "unassisted" LLM. Solving SWE-bench examples correctly is challenging. More difficult PRs require changing dozens of files, maintaining backward compatibility, and/or performing extensive complex reasoning. Even with assistance, the best LLM achieved only a 4.80% success rate.

Analysis

Multi-step planning

Devin can perform multi-step planning to obtain feedback from the environment. 72% of passing tests took over 10 minutes to complete, and the iterative capability helped Devin succeed.

Qualitative examples

Example 1: ✅ scikit-learn__scikit-learn-10870

Devin was initially misled by the description and literally added self.lower_bound_ = max_lower_bound, then returned self. This was actually incorrect because the variable had not been defined.

As provided by the test code in the problem description, Devin then updated the test file:

But after running the tests and receiving an error, Devin corrected the file:

After completing this correction, Devin reran the tests to make them pass and exited successfully.

This example has several interesting points:

Devin strictly followed the original instructions in the question, even though they were inaccurate. This indicates over-alignment with user preferences.
Devin was able to run tests in its environment and correct errors. The ability to iterate is crucial for software developers, and agents should be able to do the same.

Example 2: ✅ django__django-10973

Devin identified the correct file django/db/backends/postgresql/client.py and completed the full edit:

Here, Devin successfully modified a large amount of code. Many successful edits in SWE-bench consist of single-line diffs, but Devin was able to handle multiple lines simultaneously.

Example 3: ❌ sympy__sympy-17313

This was a difficult task involving modifying a computer algebra system to correctly handle comparison operators on floor and ceiling objects that can be specified as positive or negative. This required complex logical reasoning and multiple derivation steps.

Devin failed to edit the correct class, mistakenly editing the frac class instead of the floor and ceiling classes. Moreover, Devin only edited one comparison operator __gt__, when __lt__, __le__, and __ge__ needed modification. This edit was far from correct.

The correct diff can be found here: https://github.com/sympy/sympy/pull/17313/files. The diff is quite complex, involving many edge case handling and extensive unit tests, requiring deep understanding of the sympy codebase. (Note: To pass an SWE-bench instance, every test must pass.)

Example 4: ❌ scikit-learn__scikit-learn-10774

This task involved adding an extra return option functionality to all datasets in the repository. Devin successfully made such edits for several datasets; an example is shown below.

Devin successfully made similar edits for datasets california_housing.py, covtype.py, kddcup99.py, and mldata.py (these were actually excluded in the original PR). Unfortunately, Devin missed two datasets lfw.py and rcv1.py, so the final tests failed. The Cognition team plans to improve Devin's ability to edit multiple files.

Test-driven experiments

Cognition conducted an additional experiment where the final unit tests and problem statements were provided to Devin. In this "test-driven development" setting, the success rate increased to 23% (out of 100 sampled tests). (Note: Any changes to the tests themselves were erased before evaluation.)

Although the agent had access to the ground-truth test patch, this result is not comparable to other SWE-bench results. However, test-driven development is a common pattern in software engineering, so this setting is a natural extension of SWE-bench. A natural way for human engineers and agents to collaborate is to give the agent a target test to pass, expecting more test-driven agents in the future.

New examples solving test problems

✅ django__django-13321: Devin solved this by adding a print statement before the function, then running the unit test, and editing the file based on the printed results. The existence of test cases made it easier for Devin to debug.

✅ django__django-16983: The new unit test assertion emitted an exact error message: "The value of 'filter_horizontal[0]' cannot include [...]". Without the test patch, it would have been impossible to pass the test, highlighting an issue with the benchmark and indicating that perfect scores are impossible without test patches.