Coding Benchmark: GPT-5.2 vs GPT-5.2 Pro, Opus 4.5, Gemini 3.0

Kilo benchmarked GPT-5.2, GPT-5.2 Pro, Claude Opus 4.5, and Gemini 3.0 on three coding tasks. GPT-5.2 Pro was most accurate but slow/costly; Opus 4.5 was fastest; GPT-5.2 balanced cost and correctness.

Coding Benchmark: GPT-5.2 vs GPT-5.2 Pro, Opus 4.5, Gemini 3.0

TL;DR

  • Test suite: three developer tasks — strict Python prompt adherence, TypeScript API refactor (security, Zod, layered architecture, rate limiting), and notification-system extension with handler integration.
  • GPT-5.2 Pro: perfect scores across tests; total ~82 minutes and $23.99; exact signature/naming compliance, stronger edge-case handling, enforced config safety (JWT_SECRET), and substantive architectural fixes in the notification system.
  • Claude Opus 4.5: fastest strong performer — 98.7% average in ~7 minutes for $1.68; often delivered more complete first-pass implementations, added extra rate-limit headers and cleanup, and supplied templates for all 7 notification events.
  • GPT-5.2: cleaner and faster than GPT-5.1, implemented rate limiting, avoided unsolicited defensive validation, produced a ~1,010-line refactor vs 1,307 previously, and was roughly 17% cheaper per run than Opus while trading some completeness.
  • Gemini 3.0: best on the simple, well-specified prompt-compliance task, but missed features in the more complex refactor and extension tasks.
  • Practical takeaways: prioritize Opus for speed and near-complete first passes, GPT-5.2 for cleaner general-purpose code without extra defensive logic, and GPT-5.2 Pro when extended reasoning, correctness, or architecture/security audits matter.

Kilo’s coding benchmark: GPT-5.2, GPT-5.2 Pro, Claude Opus 4.5, and Gemini 3.0 face three real-world tasks

Kilo compared the latest OpenAI models against Claude Opus 4.5 and Gemini 3.0 across three practical developer tasks: implementing a Python rate limiter, refactoring a TypeScript API handler, and extending a notification system. The tests reuse the same scenarios from a prior comparison to highlight how model behavior has changed with GPT-5.2 and GPT-5.2 Pro. Full methodology and source artifacts began from empty projects inside Kilo Code and employed the platform’s Code and Ask modes where appropriate.

How the tests were run

The suite used three tasks:

  1. Prompt Adherence Test — Implement a Python TokenBucketLimiter with 10 strict requirements (exact class name, prescribed method signatures, error message format, plus use of time.monotonic() and threading.Lock()).
  2. Code Refactoring Test — Clean and secure a 365-line TypeScript API handler: remove SQL injection risks, introduce Repository/Service/Controller layers, add Zod validation, remove hardcoded secrets, and meet 10 explicit requirements (including rate limiting).
  3. System Extension Test — Read a 400-line notification system, explain its architecture, and add an EmailHandler that matches existing patterns.

Both GPT-5.2 models ran with the highest available reasoning effort settings.

Test 1 — Python rate limiter

GPT-5.2 Pro was the only model to get a perfect score, though it took 7 minutes and cost $1.99 for that test. The Pro model matched the exact constructor signature (initial_tokens: int = None rather than Optional[int]), named its internal variable _current_tokens to satisfy get_stats() requirements, and used the requested primitives.

A notable improvement for GPT-5.2 over GPT-5.1 was avoiding unsolicited defensive validation. GPT-5.1 had added a ValueError for non-positive tokens; GPT-5.2 followed the spec more closely while still handling edge cases gracefully. GPT-5.2 Pro’s extra runtime produced three concrete wins: exact signature matching, internal naming consistency, and comprehensive handling of edge cases (for example returning math.inf when tokens > capacity or refill_rate <= 0). The trade-off is clear: Pro’s extended reasoning yields more exact output at significantly higher latency and cost.

Test 2 — TypeScript API refactor

Both GPT-5.2 and GPT-5.2 Pro implemented rate limiting, which GPT-5.1 had missed. GPT-5.2 included reasonable Retry-After calculations and a factory pattern for configuration; Claude Opus 4.5 added additional rate-limit headers (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset) and periodic cleanup of expired entries. GPT-5.2 Pro also enforced configuration safety: it required JWT_SECRET as an environment variable and failed fast if missing, while other models either hardcoded secrets or relied on insecure defaults.

GPT-5.2 made the codebase more concise than GPT-5.1 (about 1,010 lines vs 1,307 lines) while meeting more requirements. It matched Opus 4.5’s feature set in roughly 27% fewer lines.

Test 3 — Notification system extension

This task measures comprehension and stylistic integration. GPT-5.2 Pro achieved a perfect score, spending 59 minutes reasoning through architecture and producing multiple fixes that no other model attempted. Key architectural improvements included adding a getChannel() method on the base handler class (replacing fragile type-casting in registerHandler()), adding a getEventEmitter() accessor instead of directly indexing a private property, and performing validation during registration rather than deferring to send time.

GPT-5.2 produced a full-featured EmailHandler with dynamic imports for nodemailer and @aws-sdk/client-ses, plus a normalizeEmailBodies() helper to generate text from HTML or vice versa. Template coverage varied: Claude Opus 4.5 provided templates for all 7 notification events; GPT-5.2 covered 4 events and GPT-5.2 Pro covered 3.

Performance and cost trade-offs

  • GPT-5.2 Pro: perfect scores across tests but at 82 minutes total and $23.99. Pro shines when extended reasoning and correctness matter.
  • Claude Opus 4.5: the fastest strong performer — 98.7% average in 7 minutes for $1.68 — and often delivers a more complete implementation on the first pass.
  • GPT-5.2: faster and cleaner than GPT-5.1, avoids unnecessary defensive code, and implements previously missed features like rate limiting. Against GPT-5.1 it represented a 17% cost increase ($0.20 total) that bought meaningful improvements. Compared to Opus 4.5, GPT-5.2 was about 17% cheaper per run, though Opus often produced more complete outputs.
  • Gemini 3.0: excelled on the simplest, well-specified task (winning Test 1 with compact, literal compliance), but missed features on the more complex refactor and extension tasks.

Practical guidance

The choice of model depends on priorities: speed and near-complete first-pass implementations favor Claude Opus 4.5; general-purpose coding and cleaner outputs without extraneous defensive logic favor GPT-5.2; critical architecture or security audits justify GPT-5.2 Pro’s higher latency and cost because extended reasoning produced substantive architectural fixes in these tests.

For the original report and full details, see the Kilo blog post: https://blog.kilo.ai/p/we-tested-gpt-52pro-vs-opus-45-vs

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community