Gemini 3 Flash: 90% on Coding Benchmarks at Flash Speed

Gemini 3 Flash delivers Pro-grade coding at Flash speed

Google’s newest model, Gemini 3 Flash, has emerged as a compelling option for code generation workflows that prioritize throughput and cost. In Kilo’s benchmark of three coding challenges, Gemini 3 Flash averaged 90%, outscoring Gemini 3 Pro (84.7%) while completing the suite in 2.5 minutes for $0.17 versus Pro’s 9 minutes for $1.10. Within 24 hours of release the model reached the Kilo leaderboard top 20.

Test methodology

Kilo ran the same three tests previously used in comparisons with GPT-5.1, GPT-5.2/Pro, and Claude Opus 4.5:

Prompt Adherence Test: implement a Python TokenBucketLimiter with 10 exact requirements (including using time.monotonic() and threading.Lock(), specific method signatures and error messages).
Code Refactoring Test: refactor a 365-line TypeScript API handler riddled with SQL injection, hardcoded secrets, and missing features.
System Extension Test: analyze a 400-line notification system, then add an EmailHandler matching existing patterns (Ask Mode then Code Mode).
All tests began from an empty project in Kilo Code.

Results by test

Python Rate Limiter

Gemini 3 Flash matched Gemini 3 Pro’s correctness while taking half the time and a quarter of the cost. Both Gemini models produced concise implementations and only lost points for a minor signature mismatch (using Optional[int] instead of the exact int = None signature). GPT-5.2 Pro was the only model to match that exact signature.

TypeScript API refactor

Here Gemini 3 Flash outpaced Pro by 9 points. The difference was practical: Gemini 3 Flash implemented rate limiting (using express-rate-limit) and database transactions (BEGIN/COMMIT/ROLLBACK), while Gemini 3 Pro skipped these requirements. Both Gemini models, however, hardcoded the JWT secret as 'hardcoded-secret-key-123' and omitted authorization checks on user-scoped endpoints—security gaps that GPT-5.2 and Claude Opus 4.5 corrected.

Notification system extension

Gemini 3 Flash scored 7 points higher than Pro, producing a more detailed architectural analysis (71 lines vs 51) and including a mermaid flowchart. Flash identified fragile channel detection, private-property access violations, missing handler registration validation, and queue bottlenecks, and suggested concrete fixes. Implementation gaps remained: CC/BCC fields were defined but unused, and provider integrations were left as console.log placeholders.

Performance summary and positioning

Overall: Gemini 3 Flash achieved 90% average across the three tests, 6x cheaper and roughly 3–4x faster than Gemini 3 Pro on these tasks.
Frontier gap: Flash trails GPT-5.2 and Claude Opus 4.5 by 7–9 points, primarily due to incomplete security hygiene and stubbed provider integrations.
Practical tradeoff: For plan-then-implement workflows, a high-capability model (GPT-5.2 or Claude Opus 4.5) can handle architecture and security requirements, while Gemini 3 Flash can be used in Code Mode for fast, cost-effective implementation.

When Gemini 3 Flash makes sense

Gemini 3 Flash is attractive for high-volume code generation where iteration speed and cost are priorities and where a planning pass can supply missing security and integration specifics. It consistently implemented requirements that Pro missed in these benchmarks, demonstrating that a budget-priced model can still meet many production needs when paired with a stronger planning model.

For the complete write-up and original benchmarks, see the Kilo post: https://blog.kilo.ai/p/gemini-3-flash-outperforms-gemini

TL;DR

Gemini 3 Flash delivers Pro-grade coding at Flash speed

Test methodology

Results by test

Python Rate Limiter

TypeScript API refactor

Notification system extension

Performance summary and positioning

When Gemini 3 Flash makes sense

Continue the conversation on Slack

Related Articles

Amp boosts codebase search with Gemini 3 Flash for 3× speed

Google Announces Public Release of Gemini 3