Kilo’s coding benchmark: GPT-5.2, GPT-5.2 Pro, Claude Opus 4.5, and Gemini 3.0 face three real-world tasks
Kilo compared the latest OpenAI models against Claude Opus 4.5 and Gemini 3.0 across three practical developer tasks: implementing a Python rate limiter, refactoring a TypeScript API handler, and extending a notification system. The tests reuse the same scenarios from a prior comparison to highlight how model behavior has changed with GPT-5.2 and GPT-5.2 Pro. Full methodology and source artifacts began from empty projects inside Kilo Code and employed the platform’s Code and Ask modes where appropriate.
How the tests were run
The suite used three tasks:
- Prompt Adherence Test — Implement a Python
TokenBucketLimiterwith 10 strict requirements (exact class name, prescribed method signatures, error message format, plus use of time.monotonic() and threading.Lock()). - Code Refactoring Test — Clean and secure a 365-line TypeScript API handler: remove SQL injection risks, introduce Repository/Service/Controller layers, add Zod validation, remove hardcoded secrets, and meet 10 explicit requirements (including rate limiting).
- System Extension Test — Read a 400-line notification system, explain its architecture, and add an
EmailHandlerthat matches existing patterns.
Both GPT-5.2 models ran with the highest available reasoning effort settings.
Test 1 — Python rate limiter
GPT-5.2 Pro was the only model to get a perfect score, though it took 7 minutes and cost $1.99 for that test. The Pro model matched the exact constructor signature (initial_tokens: int = None rather than Optional[int]), named its internal variable _current_tokens to satisfy get_stats() requirements, and used the requested primitives.
A notable improvement for GPT-5.2 over GPT-5.1 was avoiding unsolicited defensive validation. GPT-5.1 had added a ValueError for non-positive tokens; GPT-5.2 followed the spec more closely while still handling edge cases gracefully. GPT-5.2 Pro’s extra runtime produced three concrete wins: exact signature matching, internal naming consistency, and comprehensive handling of edge cases (for example returning math.inf when tokens > capacity or refill_rate <= 0). The trade-off is clear: Pro’s extended reasoning yields more exact output at significantly higher latency and cost.
Test 2 — TypeScript API refactor
Both GPT-5.2 and GPT-5.2 Pro implemented rate limiting, which GPT-5.1 had missed. GPT-5.2 included reasonable Retry-After calculations and a factory pattern for configuration; Claude Opus 4.5 added additional rate-limit headers (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset) and periodic cleanup of expired entries. GPT-5.2 Pro also enforced configuration safety: it required JWT_SECRET as an environment variable and failed fast if missing, while other models either hardcoded secrets or relied on insecure defaults.
GPT-5.2 made the codebase more concise than GPT-5.1 (about 1,010 lines vs 1,307 lines) while meeting more requirements. It matched Opus 4.5’s feature set in roughly 27% fewer lines.
Test 3 — Notification system extension
This task measures comprehension and stylistic integration. GPT-5.2 Pro achieved a perfect score, spending 59 minutes reasoning through architecture and producing multiple fixes that no other model attempted. Key architectural improvements included adding a getChannel() method on the base handler class (replacing fragile type-casting in registerHandler()), adding a getEventEmitter() accessor instead of directly indexing a private property, and performing validation during registration rather than deferring to send time.
GPT-5.2 produced a full-featured EmailHandler with dynamic imports for nodemailer and @aws-sdk/client-ses, plus a normalizeEmailBodies() helper to generate text from HTML or vice versa. Template coverage varied: Claude Opus 4.5 provided templates for all 7 notification events; GPT-5.2 covered 4 events and GPT-5.2 Pro covered 3.
Performance and cost trade-offs
- GPT-5.2 Pro: perfect scores across tests but at 82 minutes total and $23.99. Pro shines when extended reasoning and correctness matter.
- Claude Opus 4.5: the fastest strong performer — 98.7% average in 7 minutes for $1.68 — and often delivers a more complete implementation on the first pass.
- GPT-5.2: faster and cleaner than GPT-5.1, avoids unnecessary defensive code, and implements previously missed features like rate limiting. Against GPT-5.1 it represented a 17% cost increase ($0.20 total) that bought meaningful improvements. Compared to Opus 4.5, GPT-5.2 was about 17% cheaper per run, though Opus often produced more complete outputs.
- Gemini 3.0: excelled on the simplest, well-specified task (winning Test 1 with compact, literal compliance), but missed features on the more complex refactor and extension tasks.
Practical guidance
The choice of model depends on priorities: speed and near-complete first-pass implementations favor Claude Opus 4.5; general-purpose coding and cleaner outputs without extraneous defensive logic favor GPT-5.2; critical architecture or security audits justify GPT-5.2 Pro’s higher latency and cost because extended reasoning produced substantive architectural fixes in these tests.
For the original report and full details, see the Kilo blog post: https://blog.kilo.ai/p/we-tested-gpt-52pro-vs-opus-45-vs
