Prompt Engineering

SWE Quiz – Master System Design & ML Interviews

SWE Quiz – Master System Design & ML Interviews

“`json
{
“title”: “Agent Loop Mastery: OpenAI Codex System Design and Prompt Caching Strategies”,
“content”: “

Agent Loop Mastery: OpenAI Codex System Design and Prompt Caching Strategies

Modern AI agent architecture presents challenges for system designers. OpenAI’s Codex Harness shows patterns for building scalable agent systems. This article examines the Agent Loop, Prompt Caching strategies, and the JSON-RPC protocol that connects client interfaces to a shared codebase.

Overview: The Codex Architecture

OpenAI’s Codex is an agent system with CLI, Web, VS Code, and macOS app interfaces. The \”Codex Harness\” is a shared Rust library. It provides the agent loop, thread lifecycle, tool execution, and auth logic for all interfaces. This design lets features be built once and deployed everywhere. The system uses an app-server protocol based on JSON-RPC over stdio, which works across languages and platforms.

Prompt Analysis: The Agent Loop in Action

The Prompt

{
  \"instructions\": \"gpt-5.2-codex_prompt.md\",
  \"tools\": [
    {
      \"name\": \"shell_tool\",
      \"description\": \"Execute shell commands in sandbox\",
      \"parameters\": {...}
    },
    {
      \"name\": \"update_plan\",
      \"description\": \"Update the current execution plan\",
      \"parameters\": {...}
    }
  ],
  \"input\": [
    {
      \"role\": \"developer\",
      \"content\": \"Sandbox permissions: read/write access to /home/user/project. Folders: /src, /docs\"
    },
    {
      \"role\": \"developer\", 
      \"content\": \"User config from ~/.codex/config.toml\"
    },
    {
      \"role\": \"user\",
      \"content\": \"AGENTS.md content from git root to cwd\"
    },
    {
      \"role\": \"user\",
      \"content\": \"Environment: cwd=/home/user/project, shell=bash\"
    },
    {
      \"role\": \"user\",
      \"content\": \"Actual user query: Create a new React component\"
    }
  ]
}

Components

The Codex prompt has a structured hierarchy with four parts:

1. Instructions (System/Developer Message): Contains model-specific instructions and developer guidelines. This static content is placed at the prompt’s start to help caching. The \”developer\” role has higher priority than \”user,\” affecting how the model weights content.

2. Tools Definition: A list of available tools with descriptions and parameters. Each tool is a JSON object with name, description, parameter schema, and execution logic. Tool order must be deterministic to avoid cache misses.

3. Input Hierarchy: Inputs follow an order: sandbox permissions first, then user configuration, AGENTS.md contents, environment context, and finally the user query. This layering by stability improves cache efficiency.

4. Role-Based Weighting: The system uses a role hierarchy (system > developer > user > assistant). This determines how the model weights different messages. Developer messages influence model behavior more than user inputs.

The Agent Loop: Inference and Tool Calls

The Codex system’s core is the Agent Loop. This cyclical process connects model inference with tool execution. The loop starts with prompt construction, sends it to the Responses API, and processes server-sent events. When the model requests a tool call, like \”run ls,\” it is executed and the result is added to the prompt. The model is then queried again with the updated prompt. This cycle repeats until the model outputs a final assistant message, ending a \”turn.\”

The loop is O(n²) for bytes sent, as each iteration resends the entire conversation history. This makes caching and compaction important for scaling.

Prompt Caching: Efficiency through Intelligent Structuring

Prompt caching is key for Codex efficiency. Since LLM inference processes every input token on each call, prefix caching reuses already computed results. Codex does this through:

1. Static Prefix Organization: All static content (instructions, tool definitions) is at the prompt’s beginning. Since this doesn’t change, it can be loaded from cache each turn.

2. Append-Only History: Conversation history is only appended, never inserted or modified. This preserves the common prefix between turns.

3. Deterministic Tool Ordering: Tools must be defined in a consistent order. Non-deterministic ordering causes cache misses, as the team found with MCP tools.

4. Context-Aware Appending: When context changes, new messages are appended at the end, not inserted in the middle. This preserves the cache prefix.

Context Window Management: Compaction Strategies

In long agent sessions, conversation history can exceed the model’s context limit. Codex uses three strategies:

1. Automatic Compaction: When the token count exceeds auto_compact_limit, Codex calls /responses/compact. This returns a compressed representation of the conversation.

2. Latent State Preservation: Instead of simple text summaries, Codex uses encrypted_content blobs. These encode the model’s latent understanding, which is richer than a text summary.

3. Privacy-Preserving Design: For Zero Data Retention customers, OpenAI only keeps the decryption key, not the data itself. This combines privacy with state preservation.

App Server Protocol: JSON-RPC for Stable Clients

The app-server protocol defines three primitive abstractions:

Item: Atomic unit of input/output with explicit lifecycle (started → delta events → completed). Allows immediate UI rendering without waiting for complete content.

Turn: A unit of agent work. It begins with client input and ends when the agent has produced all outputs. Contains a sequence of items for intermediate steps.

Thread: Persistent container for sessions. It holds multiple turns, persists event history, and enables resuming after disconnect.

Frequently Asked Questions

Why is the order of tool definitions so important for caching?

Prompt caching only works with exact prefix matches. If tool definitions are generated in non-deterministic order, the prompt prefix changes each turn, causing cache misses. Codex solves this with deterministic sorting and order management.

How does Codex compaction differ from traditional text summarization?

Traditional summaries lose context and nuance. Codex uses encrypted_content blobs that encode the model’s latent understanding. This is a richer representation that preserves more semantic information. For Zero Data Retention, only the encryption key is stored, not the data.

Why does Codex use JSON-RPC over stdio instead of REST or GraphQL?

JSON-RPC over stdio offers bidirectional streaming, language-independent compatibility, and easy integration into various client environments. The protocol enables backward compatibility. This lets older clients communicate with newer server binaries, which is important for independent release cycles.

How does Codex prevent O(n²) costs in the Agent Loop?

Through combined strategies: Prompt caching reduces computational costs for static prefixes, append-only history preserves cache coherence, and compaction manages context window overflows. Together, these enable near-linear scaling instead of quadratic costs.

What are the biggest challenges in prompt structuring for agents?

The main challenges are: balancing static content for caching with dynamic context, managing role hierarchy for correct model weighting, handling context limits without information loss, and ensuring deterministic prompt generation for reliable caching.

Source

Based on this article.

“,
“excerpt”: “OpenAI’s Codex Harness shows AI agent architecture with Agent Loop, Prompt Caching, and JSON-RPC protocol. Learn how system design problems are solved in production.”,
“tags”: [“Agent Loop”, “System Design”, “Prompt Caching”, “Tool Calls”, “JSON-RPC Protocol”, “OpenAI Codex”, “LLM Architecture”, “AI Agents”]
}
“`