Agents Are Stateless: How to Solve It in Production

Every AI agent invocation starts fresh. No memory of previous runs. No access to files created last session. No awareness that a task was 90% complete before the connection dropped.

This is the dirty secret of agent development: the statelessness that makes agents simple to reason about also makes them fragile in production.

The Problem

Consider a typical agent workflow:

User requests a research report
Agent executes 14 tool calls (web search, file writes, analysis)
Connection drops on step 13
Agent has no idea what happened

Without persistent state, the only options are:

Start over from step 1 (wasted compute, frustrated users)
Fail silently (broken workflows)
Hope it doesn't happen (not a strategy)

The root cause: agent state lives in memory. Memory doesn't survive restarts.

Solution 1: PostgreSQL Checkpointing

The first line of defense is checkpointing. After each graph step, persist the agent's complete state to a database.

LangGraph's PostgresSaver handles this automatically:

import { PostgresSaver } from "@langchain/langgraph-checkpoint-postgres";

const checkpointer = new PostgresSaver(pool, {
  schema: "agent_checkpoints"
});

await checkpointer.setup(); // Creates tables

// Use with agent
const agent = createDeepAgent({
  model,
  checkpointer, // State saved after each step
});

// Each invocation uses thread_id for continuity
await agent.stream({ messages }, {
  configurable: { thread_id: chatId }
});

What gets persisted:

All messages (human, AI, tool)
Tool call history with inputs and outputs
Intermediate state between nodes
Parent checkpoint references for history traversal

When failures occur, resume from the last checkpoint:

// Resume stuck session
const state = await agent.getState({
  configurable: { thread_id: chatId }
});

// Continue execution from checkpoint
await agent.stream(null, {
  configurable: { thread_id: chatId }
});

The agent picks up exactly where it left off. No repeated work.

Solution 2: Hybrid Storage Architecture

Checkpoints handle agent state. But agents also create files, process documents, and store results. These need persistent storage too.

The challenge: different data has different requirements.

Data Type	Need	Solution
Work in progress	Fast read/write	NFS mount
Shareable documents	Public URLs	Blob storage
Temporary files	Speed, not durability	In-memory

We solved this with a composite backend that routes by path:

const backend = new CompositeBackend(
  new StateBackend(config), // Default: ephemeral

  {
    "/projects/": new NFSBackend("projects"),   // NFS (Azure Files, EFS, Filestore)
    "/documents/": new NFSBackend("documents"), // NFS
    "/uploads/": new BlobBackend("uploads"),    // Blob (S3, Cloud Storage)
  }
);

When the agent writes to /projects/acme-corp/report.md, it goes to NFS. Fast, persistent, available across restarts.

When it writes to /uploads/report.pdf, it goes to blob storage. Slower, but generates a shareable URL.

Files at the root (/scratch.txt) stay in memory. Fast, but gone after the session.

Performance Matters

NFS vs blob storage isn't just about persistence. It's about speed:

Operation	Blob Storage	NFS Mount
Read	150-200ms	5-10ms
Write	200-300ms	10-20ms
Edit	400-600ms	10-50ms

A 14-step workflow with file operations can save 3-5 seconds per run just from storage choice. At scale, that's hours of compute time.

Solution 3: Recovery Service

Checkpoints and persistent storage are the foundation. But you also need operational tooling to detect and recover stuck sessions.

Our recovery service provides three capabilities:

1. Stuck Detection

const stuckCheck = await checkStuckState(chatId);
// Returns:
// - isStuck: boolean
// - lastNode: where execution stopped
// - pendingToolCalls: incomplete tool calls
// - canRecover: whether resume is possible

2. Automatic Recovery

if (stuckCheck.canRecover) {
  const result = await recoverStuckSession(chatId, agent);
  // Agent resumes from checkpoint
}

3. Health Monitoring

const stats = await getCheckpointStats();
// Total checkpoints, unique threads, oldest/newest timestamps

This enables:

Alerts when sessions get stuck
Automatic recovery attempts
Cleanup of old checkpoints

Putting It Together

Here's how these patterns combine in a production agent:

export class ProductionAgent {
  private checkpointer: PostgresSaver;

  static async create(config: AgentConfig) {
    const agent = new ProductionAgent();

    // 1. Initialize checkpointer
    agent.checkpointer = await getCheckpointer({
      schema: "agent_checkpoints"
    });

    // 2. Create agent with composite backend
    agent.agent = createDeepAgent({
      model,
      backend: createCompositeBackend, // Hybrid storage
      checkpointer: agent.checkpointer, // State persistence
      middleware: [
        // 3. Error recovery middleware
        createErrorRecoveryMiddleware()
      ]
    });

    return agent;
  }

  async *stream(message: string) {
    try {
      yield* this.agent.stream(
        { messages: [new HumanMessage(message)] },
        { configurable: { thread_id: this.chatId } }
      );
    } catch (error) {
      // 4. Attempt recovery on failure
      if (this.checkpointer) {
        yield* this.attemptRecovery();
      }
      throw error;
    }
  }
}

The result: an agent that survives restarts, handles network failures, and maintains context across sessions.

The Bigger Picture

State management is what separates demos from production systems.

A demo can restart when things go wrong. A production system runs 24/7, handles thousands of concurrent sessions, and can't afford to lose work.

The patterns here - checkpointing, persistent storage, recovery services - aren't optional features. They're the foundation that everything else builds on.

Without them, you have a chatbot that occasionally succeeds at complex tasks.

With them, you have a production system that reliably completes work over hours or days.

Implementation Checklist

Enable checkpointing - PostgreSQL for production, SQLite for development
Route storage by durability needs - Fast NFS for work in progress, blob for shareable URLs
Build recovery tooling - Detection, automatic recovery, monitoring
Test failure scenarios - Kill the process mid-task, verify recovery works

State is not an afterthought. It's the first thing you should design.

References: