Sonoma Alpha Sky & Dusk: 2M Context Windows, Real Coding Tasks, and Early Limits
Two new models with 2M token context windows — Sonoma Sky Alpha and Sonoma Dusk Alpha — appeared on major gateways in early September 2025. Both showed up with free alpha access and rapid inference, prompting testing across thousands of real coding edits in Cline to evaluate practical performance beyond headline specs.
The models and the test bed
Sky is positioned as the more capable reasoning model, while Dusk focuses on faster inference. Cline tracked thousands of diff edit operations from August 26 – September 9, 2025, with the Sonoma models first appearing on September 6.
Performance measured as success rate on those real-world edits:
- Claude 4 Sonnet — 96%
- GPT-5 — 92%
- Gemini 2.5 Pro — 90%
- Dusk — 87%
- Sky — 84%
These figures place the Sonoma Alphas behind established models on accuracy, despite the notable context window and speed.
Observations from real usage
- The 2M context window represents a significant capability for large-code or multipage contexts, but raw context size did not translate into top-tier reliability in the tested workflows.
- Both Sonoma models offered notable inference speed, aligning with Dusk’s intended design point.
- Community reports in Discord documented mixed experiences, including instances of hallucinations and tool calling failures, which contributed to lower success rates relative to mature competitors.
Practical implications for teams
- The Sonoma Alpha models present an intriguing experiment in scaling context and responsiveness, but current reliability metrics suggest continued reliance on proven models for critical coding tasks.
- Free alpha access is available via Vercel AI Gateway and OpenRouter, making hands-on evaluation straightforward for non-critical experimentation.
- Given measured success rates, a reasonable approach for engineering teams is to explore Sonoma Alphas for exploratory or non-production workflows while maintaining established models for production automation and higher-stakes editing.
Results may vary with task complexity and integration patterns, but the early readout emphasizes that large context windows alone are not a substitute for established model reliability.
Original source: https://cline.bot/blog/sonoma-alpha-sky-dusk-models-cline