Rebuilding Devin on Sonnet 4.5: Performance, Planning, and Architecture

Rebuilding Devin around Claude Sonnet 4.5

Cognition rebuilt Devin to run on Sonnet 4.5 rather than simply swapping models. The new Devin is reported to be twice as fast and to score 12% higher on the team's Junior Developer Evals. It is available in Agent Preview while the previous Devin remains an option. The engineering effort revealed several behavioral shifts in Sonnet 4.5 that required architectural changes rather than a drop-in replacement.

What changed with Sonnet 4.5

The model shows substantially improved planning performance and reliability in multi-hour sessions — metrics that compound across agent feedback loops. Alongside the gains, several new behaviors emerged that influenced how agents should be structured:

Context awareness and “context anxiety”

Sonnet 4.5 is aware of its context window and adapts behavior as it approaches perceived limits. That includes proactively summarizing progress and making decisive moves to conclude work.
This awareness can degrade outcomes: the model sometimes takes shortcuts or leaves tasks incomplete when it believes the window is nearly exhausted. It also consistently underestimates remaining tokens, and is precise about that incorrect estimate.
Mitigations included aggressive prompting placed at both the start and end of prompts, plus an unusual but effective trick: enable the 1M token beta while capping usage at 200k. That combination appears to make the model behave as if it has sufficient runway without suffering performance loss from true long-context usage.
Architecturally, planning token budgets must now consider the model’s internal sense of context and when it will naturally want to summarize.

Externalizing state and note-taking

The model frequently writes files (e.g., CHANGELOG.md, SUMMARY.md) to externalize state without prompting, treating the file system as memory. This behavior increases when the model believes the context window is shrinking.
These notes are useful but incomplete: summaries can omit important details and are not a reliable replacement for compacted memory systems. Relying solely on the model’s notes led to performance gaps.
The model’s tendency to produce many summary tokens — sometimes more than the tokens used to solve the problem — and uneven effort depending on context length suggests careful prompting and validation remain necessary.

Proactive testing and feedback loops

Sonnet 4.5 shows improved willingness to write and execute short scripts and tests to validate changes, which helps long-running tasks. At times this leads to creative but convoluted workarounds instead of addressing root causes (for example, crafting a custom script rather than terminating a conflicting process).

Parallelism and tool usage

The model executes multiple tools and reads files in parallel more aggressively, increasing actions per context window. This leads to faster-feeling sessions but burns through context more quickly, reinforcing context-awareness behaviors.
The model appears to modulate parallelism based on its sense of remaining context, being more aggressive earlier and more cautious nearer the perceived limit.

Directions under exploration

The team intends to continue experiments in several areas:

Subagents and context-aware tool calls: Sonnet 4.5’s judgment about when to externalize state and create feedback loops could make subagent delegation more practical, though careful state management is required (see guidance on multi-agent design).
Meta-agent prompting: Early experiments show promise for letting the model reason about verification and its own development workflow rather than only executing tasks.
Context-management models: The model’s emerging intuition about managing context raises the possibility of custom models focused on intelligent context compaction and routing.

These behaviors indicate a shift in how agents can be architected around planning, verification, and state management. The team plans to publish additional learnings as experimentation continues and both the new Devin and Windsurf are further tested in Agent Preview.

Original source: https://cognition.ai/blog/devin-sonnet-4-5-lessons-and-challenges