Gemini 3 Flash: 90% on Coding Benchmarks at Flash Speed

Gemini 3 Flash scored 90% across three coding benchmarks, beating Gemini 3 Pro while running 3–4x faster and costing 6x less. Best used for high-volume implementation when paired with a planning model.

gemini cover

TL;DR

  • Gemini 3 Flash: 90% average across three coding tests; completed the suite in 2.5 minutes for $0.17 vs Gemini 3 Pro’s 84.7%, 9 minutes, $1.10; reached Kilo leaderboard top 20 within 24 hours
  • Test suite: Prompt Adherence Test, Code Refactoring Test, System Extension Test; all started from an empty project in Kilo Code
  • Python Rate Limiter: matched Gemini 3 Pro’s correctness while taking half the time and a quarter of the cost; minor signature mismatch (used Optional[int] vs exact int = None)
  • TypeScript API refactor: Flash scored 9 points higher by implementing rate limiting (express-rate-limit) and database transactions (BEGIN/COMMIT/ROLLBACK); both Gemini models hardcoded the JWT secret and omitted authorization checks
  • Notification system extension: Flash scored 7 points higher, produced a longer architectural analysis (71 vs 51 lines) and a mermaid flowchart; identified fragile channel detection, private-property access violations, missing handler-registration validation, and queue bottlenecks; CC/BCC unused and provider integrations left as console.log placeholders
  • Positioning: Flash is ~6x cheaper and 3–4x faster than Pro but trails GPT-5.2/Claude by 7–9 points on security and integrations; suited for high-volume, fast implementation when paired with a stronger planning model — Full benchmarks: https://blog.kilo.ai/p/gemini-3-flash-outperforms-gemini; model page: https://kilo.ai/models/google-gemini-3-flash-preview; Kilo Code: https://kilocode.ai/

Gemini 3 Flash delivers Pro-grade coding at Flash speed

Google’s newest model, Gemini 3 Flash, has emerged as a compelling option for code generation workflows that prioritize throughput and cost. In Kilo’s benchmark of three coding challenges, Gemini 3 Flash averaged 90%, outscoring Gemini 3 Pro (84.7%) while completing the suite in 2.5 minutes for $0.17 versus Pro’s 9 minutes for $1.10. Within 24 hours of release the model reached the Kilo leaderboard top 20.

Test methodology

Kilo ran the same three tests previously used in comparisons with GPT-5.1, GPT-5.2/Pro, and Claude Opus 4.5:

  • Prompt Adherence Test: implement a Python TokenBucketLimiter with 10 exact requirements (including using time.monotonic() and threading.Lock(), specific method signatures and error messages).
  • Code Refactoring Test: refactor a 365-line TypeScript API handler riddled with SQL injection, hardcoded secrets, and missing features.
  • System Extension Test: analyze a 400-line notification system, then add an EmailHandler matching existing patterns (Ask Mode then Code Mode).
    All tests began from an empty project in Kilo Code.

Results by test

Python Rate Limiter

Gemini 3 Flash matched Gemini 3 Pro’s correctness while taking half the time and a quarter of the cost. Both Gemini models produced concise implementations and only lost points for a minor signature mismatch (using Optional[int] instead of the exact int = None signature). GPT-5.2 Pro was the only model to match that exact signature.

TypeScript API refactor

Here Gemini 3 Flash outpaced Pro by 9 points. The difference was practical: Gemini 3 Flash implemented rate limiting (using express-rate-limit) and database transactions (BEGIN/COMMIT/ROLLBACK), while Gemini 3 Pro skipped these requirements. Both Gemini models, however, hardcoded the JWT secret as 'hardcoded-secret-key-123' and omitted authorization checks on user-scoped endpoints—security gaps that GPT-5.2 and Claude Opus 4.5 corrected.

Notification system extension

Gemini 3 Flash scored 7 points higher than Pro, producing a more detailed architectural analysis (71 lines vs 51) and including a mermaid flowchart. Flash identified fragile channel detection, private-property access violations, missing handler registration validation, and queue bottlenecks, and suggested concrete fixes. Implementation gaps remained: CC/BCC fields were defined but unused, and provider integrations were left as console.log placeholders.

Performance summary and positioning

  • Overall: Gemini 3 Flash achieved 90% average across the three tests, 6x cheaper and roughly 3–4x faster than Gemini 3 Pro on these tasks.
  • Frontier gap: Flash trails GPT-5.2 and Claude Opus 4.5 by 7–9 points, primarily due to incomplete security hygiene and stubbed provider integrations.
  • Practical tradeoff: For plan-then-implement workflows, a high-capability model (GPT-5.2 or Claude Opus 4.5) can handle architecture and security requirements, while Gemini 3 Flash can be used in Code Mode for fast, cost-effective implementation.

When Gemini 3 Flash makes sense

Gemini 3 Flash is attractive for high-volume code generation where iteration speed and cost are priorities and where a planning pass can supply missing security and integration specifics. It consistently implemented requirements that Pro missed in these benchmarks, demonstrating that a budget-priced model can still meet many production needs when paired with a stronger planning model.

For the complete write-up and original benchmarks, see the Kilo post: https://blog.kilo.ai/p/gemini-3-flash-outperforms-gemini

Related links from the tests:

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community