GPT‑5.1‑Codex‑Max Delivers Compaction, Multicontext and Long Coding Workflows

GPT‑5.1‑Codex‑Max adds compaction and token‑efficient reasoning to Codex, enabling coherent multi‑hour project coding across CLI, IDE, and cloud. It runs sandboxed by default with monitoring and safety controls.

openai cover

TL;DR

  • GPT‑5.1‑Codex‑Max: coding‑focused model for extended, agentic workflows; available in Codex for ChatGPT Plus, Pro, Business, Edu, and Enterprise; API access planned. See Codex models: https://developers.openai.com/codex/models
  • Native multicontext operation via compaction: session history pruned while preserving salient context, enabling coherent work across multiple context windows; compaction runs automatically as Codex sessions fill.
  • Token‑efficiency and reasoning: medium effort recommended; about 30% fewer thinking tokens than GPT‑5.1‑Codex at equal effort; xhigh option for longer‑latency tasks. Benchmarks: SWE‑bench Verified 73.7% → 77.9% (high→xhigh), SWE‑Lancer IC SWE 66.3% → 79.9%, Terminal‑Bench 2.0 52.8% → 58.1%.
  • Long‑running task capability: sustained independent work for >24 hours in internal evaluations, iterating implementations, resolving failing tests, and delivering finished results; aimed at project‑scale refactors, extended debugging, and multi‑hour agent loops.
  • Safety and deployment posture: sandboxed by default (file writes confined to workspace; network disabled unless explicitly enabled); terminal logs, tool‑call citations, test results, cybersecurity monitoring, and automated vulnerability scanning included.
  • Operational notes and resources: replaces GPT‑5.1‑Codex as the default in Codex surfaces; internal adoption reported at 95% weekly usage among OpenAI engineers. See system card: GPT‑5.1‑Codex‑Max system card

Introduction

OpenAI has introduced GPT‑5.1‑Codex‑Max, a coding-focused model designed for extended, agentic workflows within Codex. Built on an updated foundational reasoning model trained on agentic tasks across software engineering, research, and mathematics, the model targets long-running development tasks by operating coherently across multiple context windows.

GPT‑5.1‑Codex‑Max is available in Codex today for the CLI, IDE extension, cloud, and code review surfaces, with API access slated to arrive soon.

Frontier coding capabilities

The model was trained on real-world software engineering tasks such as PR creation, code review, frontend coding, and Q&A. It includes explicit training to operate in Windows environments and to collaborate more effectively within the Codex CLI. These design choices aim to improve performance on practical engineering workflows beyond benchmark gains.

Key technical points:

  • Native multicontext operation via compaction, enabling work that spans multiple context windows.
  • Training emphasis on agentic tasks to improve collaboration inside the Codex CLI and related tooling.

Speed, token efficiency, and reasoning effort

GPT‑5.1‑Codex‑Max emphasizes token efficiency through more effective internal reasoning. On SWE‑bench Verified, the model at medium reasoning effort outperforms GPT‑5.1‑Codex with the same effort while using about 30% fewer thinking tokens. For tasks that tolerate longer latency, a new Extra High (xhigh) reasoning effort is available; medium remains recommended for typical daily use.

A few evaluation snapshots:

  • SWE‑bench Verified: 73.7% → 77.9% (GPT‑5.1‑Codex high → Codex‑Max xhigh)
  • SWE‑Lancer IC SWE: 66.3% → 79.9%
  • Terminal‑Bench 2.0: 52.8% → 58.1%

Improvements in token efficiency are positioned to reduce real-world usage costs for extended coding tasks, including producing comparable frontend apps at lower token expense.

Long-running tasks and compaction

The defining operational feature is compaction, a process that prunes session history while preserving salient context so the model can continue coherent work across very long horizons. In Codex applications, sessions are compacted automatically as the context window fills, allowing repeated refreshes until the task completes.

OpenAI reports internal evaluations where GPT‑5.1‑Codex‑Max sustained independent work for more than 24 hours, iterating on implementations, resolving failing tests, and delivering finished results. The capability is intended for project‑scale refactors, extended debugging, and multi‑hour agent loops that previously hit context limits.

Safety, cybersecurity, and deployment posture

GPT‑5.1‑Codex‑Max shows improved performance on long‑horizon reasoning challenges, including areas relevant to cybersecurity. Under OpenAI’s Preparedness Framework, the model does not reach High capability on Cybersecurity, though it is described as the most capable cybersecurity model deployed to date. To address dual‑use risks, OpenAI continues iterative deployments, monitoring, and targeted safeguards; examples include cybersecurity‑specific monitoring and interventions for suspicious activity.

Codex runs in a sandboxed environment by default: file writes are confined to the workspace and network access is disabled unless explicitly enabled, reducing exposure to untrusted content and prompt‑injection risks. Codex also surfaces terminal logs, tool call citations, and test results to aid human review. Defensive tools such as automated vulnerability scanning and remediation assistance remain part of the ecosystem.

Relevant links:

Availability

GPT‑5.1‑Codex‑Max is available in Codex for ChatGPT Plus, Pro, Business, Edu, and Enterprise plans. For details on plan usage limits, see the Codex models documentation: documentation. The model will replace GPT‑5.1‑Codex as the default in Codex surfaces; API availability for Codex CLI users with API keys is planned for the near future.

Closing notes

The release highlights two practical shifts: sustained multicontext operation via compaction and measurable token‑efficiency gains during reasoning. Those shifts aim to make longer, project‑scale agent workflows feasible within Codex tooling while keeping deployments under tightened sandbox and monitoring controls. OpenAI also reports internal adoption metrics—95% weekly usage among OpenAI engineers and increased pull request throughput—framing Codex‑Max as an incremental toolchain enhancement within existing Dev workflows.

Original source: https://openai.com/index/gpt-5-1-codex-max/

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community