Skip to main content
Back to Blog

Agents Are Stateless: How to Solve It in Production

AI agents forget everything between runs. Here's how to implement checkpoints, persistent storage, and recovery for production-grade agent systems.

By QuantumFabrics
ai-agentsproductionstate-managementlanggraphpersistenceawsazuregcpgooglelangchain
Agents Are Stateless: How to Solve It in Production

Agents Are Stateless: How to Solve It in Production

Every AI agent invocation starts fresh. No memory of previous runs. No access to files created last session. No awareness that a task was 90% complete before the connection dropped.

This is the dirty secret of agent development: the statelessness that makes agents simple to reason about also makes them fragile in production.

The Problem

Consider a typical agent workflow:

  1. User requests a research report
  2. Agent executes 14 tool calls (web search, file writes, analysis)
  3. Connection drops on step 13
  4. Agent has no idea what happened

Without persistent state, the only options are:

  • Start over from step 1 (wasted compute, frustrated users)
  • Fail silently (broken workflows)
  • Hope it doesn't happen (not a strategy)

The root cause: agent state lives in memory. Memory doesn't survive restarts.

Solution 1: PostgreSQL Checkpointing

The first line of defense is checkpointing. After each graph step, persist the agent's complete state to a database.

LangGraph's PostgresSaver handles this automatically:

import { PostgresSaver } from "@langchain/langgraph-checkpoint-postgres";

const checkpointer = new PostgresSaver(pool, {
  schema: "agent_checkpoints"
});

await checkpointer.setup(); // Creates tables

// Use with agent
const agent = createDeepAgent({
  model,
  checkpointer, // State saved after each step
});

// Each invocation uses thread_id for continuity
await agent.stream({ messages }, {
  configurable: { thread_id: chatId }
});

What gets persisted:

  • All messages (human, AI, tool)
  • Tool call history with inputs and outputs
  • Intermediate state between nodes
  • Parent checkpoint references for history traversal

When failures occur, resume from the last checkpoint:

// Resume stuck session
const state = await agent.getState({
  configurable: { thread_id: chatId }
});

// Continue execution from checkpoint
await agent.stream(null, {
  configurable: { thread_id: chatId }
});

The agent picks up exactly where it left off. No repeated work.

Solution 2: Hybrid Storage Architecture

Checkpoints handle agent state. But agents also create files, process documents, and store results. These need persistent storage too.

The challenge: different data has different requirements.

Data TypeNeedSolution
Work in progressFast read/writeNFS mount
Shareable documentsPublic URLsBlob storage
Temporary filesSpeed, not durabilityIn-memory

We solved this with a composite backend that routes by path:

const backend = new CompositeBackend(
  new StateBackend(config), // Default: ephemeral

  {
    "/projects/": new NFSBackend("projects"),   // NFS (Azure Files, EFS, Filestore)
    "/documents/": new NFSBackend("documents"), // NFS
    "/uploads/": new BlobBackend("uploads"),    // Blob (S3, Cloud Storage)
  }
);

When the agent writes to /projects/acme-corp/report.md, it goes to NFS. Fast, persistent, available across restarts.

When it writes to /uploads/report.pdf, it goes to blob storage. Slower, but generates a shareable URL.

Files at the root (/scratch.txt) stay in memory. Fast, but gone after the session.

Performance Matters

NFS vs blob storage isn't just about persistence. It's about speed:

OperationBlob StorageNFS Mount
Read150-200ms5-10ms
Write200-300ms10-20ms
Edit400-600ms10-50ms

A 14-step workflow with file operations can save 3-5 seconds per run just from storage choice. At scale, that's hours of compute time.

Solution 3: Recovery Service

Checkpoints and persistent storage are the foundation. But you also need operational tooling to detect and recover stuck sessions.

Our recovery service provides three capabilities:

1. Stuck Detection

const stuckCheck = await checkStuckState(chatId);
// Returns:
// - isStuck: boolean
// - lastNode: where execution stopped
// - pendingToolCalls: incomplete tool calls
// - canRecover: whether resume is possible

2. Automatic Recovery

if (stuckCheck.canRecover) {
  const result = await recoverStuckSession(chatId, agent);
  // Agent resumes from checkpoint
}

3. Health Monitoring

const stats = await getCheckpointStats();
// Total checkpoints, unique threads, oldest/newest timestamps

This enables:

  • Alerts when sessions get stuck
  • Automatic recovery attempts
  • Cleanup of old checkpoints

Putting It Together

Here's how these patterns combine in a production agent:

export class ProductionAgent {
  private checkpointer: PostgresSaver;

  static async create(config: AgentConfig) {
    const agent = new ProductionAgent();

    // 1. Initialize checkpointer
    agent.checkpointer = await getCheckpointer({
      schema: "agent_checkpoints"
    });

    // 2. Create agent with composite backend
    agent.agent = createDeepAgent({
      model,
      backend: createCompositeBackend, // Hybrid storage
      checkpointer: agent.checkpointer, // State persistence
      middleware: [
        // 3. Error recovery middleware
        createErrorRecoveryMiddleware()
      ]
    });

    return agent;
  }

  async *stream(message: string) {
    try {
      yield* this.agent.stream(
        { messages: [new HumanMessage(message)] },
        { configurable: { thread_id: this.chatId } }
      );
    } catch (error) {
      // 4. Attempt recovery on failure
      if (this.checkpointer) {
        yield* this.attemptRecovery();
      }
      throw error;
    }
  }
}

The result: an agent that survives restarts, handles network failures, and maintains context across sessions.

The Bigger Picture

State management is what separates demos from production systems.

A demo can restart when things go wrong. A production system runs 24/7, handles thousands of concurrent sessions, and can't afford to lose work.

The patterns here - checkpointing, persistent storage, recovery services - aren't optional features. They're the foundation that everything else builds on.

Without them, you have a chatbot that occasionally succeeds at complex tasks.

With them, you have a production system that reliably completes work over hours or days.

Implementation Checklist

  1. Enable checkpointing - PostgreSQL for production, SQLite for development
  2. Route storage by durability needs - Fast NFS for work in progress, blob for shareable URLs
  3. Build recovery tooling - Detection, automatic recovery, monitoring
  4. Test failure scenarios - Kill the process mid-task, verify recovery works

State is not an afterthought. It's the first thing you should design.


References: