Benchmarking GPT-5.1, Gemini 3.0, and Claude Opus 4.5 for Developers

Kilo Code benchmarks GPT-5.1, Gemini 3.0, and Claude Opus 4.5 on three dev tasks: rate limiter, TypeScript API refactor, and notification extension. Opus is most complete; GPT adds defensive checks; Gemini stays minimal.

Benchmarking GPT-5.1, Gemini 3.0, and Claude Opus 4.5 for Developers

TL;DR

  • Benchmark scope: three developer tasks (Python rate limiter, TypeScript API refactor, notification-system extension) evaluated for prompt adherence, code quality, completeness, and system understanding.
  • Gemini 3.0: literal, minimal implementations; highest prompt adherence; faster and lower-cost outputs but often skips deeper fixes and documentation.
  • GPT-5.1: defensive and verbose; adds validations, JSDoc, and backward-compatibility fixes; produces ~1.5–1.8x more code than Gemini and may change method contracts.
  • Claude Opus 4.5: most complete and production-oriented; implemented all refactor requirements, broad notification templates, and runtime features; overall fastest across tasks (7 minutes) but costlier.
  • Performance and cost: Opus fastest and most thorough; example cost comparison cited as $1.68 (Opus) vs $1.10 (Gemini) for similar tasks; Opus code volume sits between GPT and Gemini.
  • Workflow guidance: prefer Gemini for strict, low-cost minimal implementations; GPT for defensive, compatibility-focused changes; Opus for thorough first-pass production readiness; include explicit instructions to constrain or expand scope as needed.

Benchmarking three leading coding models: GPT-5.1 vs Gemini 3.0 vs Opus 4.5

A recent Kilo Code Blog benchmark compared OpenAI’s GPT-5.1, Google’s Gemini 3.0, and Anthropic’s Claude Opus 4.5 across three representative developer tasks: (1) a Python rate limiter with rigid requirements, (2) a large-scale TypeScript API refactor, and (3) understanding and extending a notification system. The evaluation focused on prompt adherence, code quality, completeness, and system understanding, producing consistent patterns in each model’s behaviour.

Testing methodology

Three tasks were chosen to surface different strengths and trade-offs:

  • Prompt Adherence Test: a Python TokenBucket limiter with ten strict rules (names, error messages, use of time.monotonic and threading.Lock).
  • Code Refactoring Test: a 365-line TypeScript API handler with security holes and legacy issues, requiring layered refactor, Zod validation, and secure practices.
  • System Extension Test: a 400-line notification system (Webhook + SMS) where the model had to explain the architecture then add an EmailHandler that matched the project’s style.

The tests used Kilo Code’s Code Mode for implementation and Ask Mode for architectural analysis in the third task.

Test 1 — Python rate limiter

  • Gemini 3.0: followed the spec literally, producing minimal, clean code and scoring highest for strict adherence.
  • GPT-5.1: added defensive validations (constructor checks, token positivity) that changed method behaviour beyond the request.
  • Claude Opus 4.5: balanced strictness with polish — clean implementation plus more detailed docstrings, but a minor naming inconsistency cost a point.

Key takeaway: Gemini excels at exact, minimal implementations; GPT-5.1 prefers defensive additions; Opus aims for clarity and documentation.

Test 2 — TypeScript API refactor

Opus 4.5 implemented all ten requirements and was the only model to include rate limiting and to use environment variables for secrets. GPT-5.1 identified and fixed important security issues — adding authorization checks and database transactions — and preserved backward compatibility for legacy field names. Gemini 3.0 produced cleaner code quickly but missed some deeper architectural fixes (e.g., incomplete transaction implementation and legacy compatibility).

Notable differences:

  • Opus 4.5: most complete refactor, environment-variable usage, rate limit headers and custom RateLimitError.
  • GPT-5.1: defensive hardening, transactional safety, backward compatibility support.
  • Gemini 3.0: faster, minimal changes, missed several deeper requirements.

Test 3 — Notification system extension

All three models could add email support, but approaches varied:

  • Opus 4.5: fastest and most thorough — produced templates for all seven event types, added runtime template management, and supported display names.
  • GPT-5.1: produced an extensive architectural audit (including diagrams and line-level evidence) and implemented full-featured email support with CC/BCC and attachments.
  • Gemini 3.0: implemented a basic EmailHandler covering core fields but omitted richer features and defensive logic.

Opus prioritized breadth of implementation; GPT emphasized deep analysis and feature-rich matching; Gemini prioritized the minimal working extension.

Performance, cost, and style

  • Speed: Opus 4.5 was the fastest overall (7 minutes total across tasks) while producing the most thorough output.
  • Code volume: GPT-5.1 produced 1.5–1.8x more code than Gemini due to JSDoc, validation, and explicit types. Opus sat between the two.
  • Cost: Opus 4.5 was more expensive; an example comparison cited $1.68 vs $1.10 (Gemini) for similar tasks.
  • Stylistic tendencies:
    • GPT-5.1: verbose, defensive, well-documented, and likely to add unrequested safeguards.
    • Gemini 3.0: minimal, efficient, and literal; tends to skip documentation and extra safety.
    • Claude Opus 4.5: organized, complete, and production-oriented (strict types, custom error classes, section headers).

Prompt adherence versus helpfulness

Strict specs favour Gemini 3.0, which tends to produce exactly what the prompt requests. For complex, completeness-oriented tasks, Opus 4.5 often delivers the most thorough first-pass implementation. GPT-5.1 sits between those poles: strong at defensive engineering and backwards compatibility, but liable to introduce contract changes or extra validation when minimal output is required.

Practical guidance for workflows

  • Expect extra features and robust organization from Opus 4.5; confirm that added complexity aligns with project needs.
  • Review GPT-5.1 output for over-engineering and potential contract shifts caused by added validations.
  • Treat Gemini 3.0 as a precise, low-cost engine for minimal implementations; manually add safeguards and documentation when needed.
  • For each model, supply explicit instructions if a minimal or maximal implementation is required (for example, “do not add extra validation” or “include JSDoc and edge-case handling”).

Verdict

All three models handle complex coding tasks effectively but prioritize different trade-offs:

  • Claude Opus 4.5: completeness and production readiness.
  • GPT-5.1: defensive, well-documented code with backward-compatibility sensitivity.
  • Gemini 3.0: precise, minimal, and cost-efficient implementations.

Selection depends on whether completeness, defensiveness, or strict fidelity to the prompt matters most.

Original analysis and full results: https://blog.kilo.ai/p/benchmarking-gpt-51-vs-gemini-30-vs-opus-45

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community