Cline: Sonoma Sky and Dusk Alpha: 2M Context Windows, Fast but Less Reliable

Sonoma Sky and Dusk Alpha bring 2M-token context windows and fast inference but lag on accuracy in real coding edits (Sky 84%, Dusk 87%). Use them for experimentation; stick with proven models for production.

Cline: Sonoma Sky and Dusk Alpha: 2M Context Windows, Fast but Less Reliable

TL;DR

  • Sonoma Sky Alpha & Sonoma Dusk Alpha: new models with 2M token context windows (appeared Sep 6, 2025)
  • Sky oriented toward reasoning; Dusk prioritized faster inference
  • Cline testbed: thousands of diff edit ops tracked from Aug 26–Sep 9, 2025
  • Measured success rates on real coding edits: Claude 4 Sonnet 96% · GPT-5 92% · Gemini 2.5 Pro 90% · Dusk 87% · Sky 84%
  • Large context benefits large-code/multipage contexts but did not translate to top-tier reliability in these workflows
  • Both models delivered notable inference speed, consistent with Dusk’s design goal
  • Community reports documented hallucinations and tool-calling failures that lowered success rates
  • Practical takeaway: appropriate for exploratory or non-production workflows; retain established models for production/high-stakes tasks
  • Free alpha access available via Vercel AI Gateway and OpenRouter
  • Original source: https://cline.bot/blog/sonoma-alpha-sky-dusk-models-cline

Sonoma Alpha Sky & Dusk: 2M Context Windows, Real Coding Tasks, and Early Limits

Two new models with 2M token context windows — Sonoma Sky Alpha and Sonoma Dusk Alpha — appeared on major gateways in early September 2025. Both showed up with free alpha access and rapid inference, prompting testing across thousands of real coding edits in Cline to evaluate practical performance beyond headline specs.

The models and the test bed

Sky is positioned as the more capable reasoning model, while Dusk focuses on faster inference. Cline tracked thousands of diff edit operations from August 26 – September 9, 2025, with the Sonoma models first appearing on September 6.

Performance measured as success rate on those real-world edits:

  • Claude 4 Sonnet96%
  • GPT-592%
  • Gemini 2.5 Pro90%
  • Dusk87%
  • Sky84%

These figures place the Sonoma Alphas behind established models on accuracy, despite the notable context window and speed.

Observations from real usage

  • The 2M context window represents a significant capability for large-code or multipage contexts, but raw context size did not translate into top-tier reliability in the tested workflows.
  • Both Sonoma models offered notable inference speed, aligning with Dusk’s intended design point.
  • Community reports in Discord documented mixed experiences, including instances of hallucinations and tool calling failures, which contributed to lower success rates relative to mature competitors.

Practical implications for teams

  • The Sonoma Alpha models present an intriguing experiment in scaling context and responsiveness, but current reliability metrics suggest continued reliance on proven models for critical coding tasks.
  • Free alpha access is available via Vercel AI Gateway and OpenRouter, making hands-on evaluation straightforward for non-critical experimentation.
  • Given measured success rates, a reasonable approach for engineering teams is to explore Sonoma Alphas for exploratory or non-production workflows while maintaining established models for production automation and higher-stakes editing.

Results may vary with task complexity and integration patterns, but the early readout emphasizes that large context windows alone are not a substitute for established model reliability.

Original source: https://cline.bot/blog/sonoma-alpha-sky-dusk-models-cline

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community