Agents Are Stateless: How to Solve It in Production
AI agents forget everything between runs. Here's how to implement checkpoints, persistent storage, and recovery for production-grade agent systems.

Agents Are Stateless: How to Solve It in Production
Every AI agent invocation starts fresh. No memory of previous runs. No access to files created last session. No awareness that a task was 90% complete before the connection dropped.
This is the dirty secret of agent development: the statelessness that makes agents simple to reason about also makes them fragile in production.
The Problem
Consider a typical agent workflow:
- User requests a research report
- Agent executes 14 tool calls (web search, file writes, analysis)
- Connection drops on step 13
- Agent has no idea what happened
Without persistent state, the only options are:
- Start over from step 1 (wasted compute, frustrated users)
- Fail silently (broken workflows)
- Hope it doesn't happen (not a strategy)
The root cause: agent state lives in memory. Memory doesn't survive restarts.
Solution 1: PostgreSQL Checkpointing
The first line of defense is checkpointing. After each graph step, persist the agent's complete state to a database.
LangGraph's PostgresSaver handles this automatically:
import { PostgresSaver } from "@langchain/langgraph-checkpoint-postgres";
const checkpointer = new PostgresSaver(pool, {
schema: "agent_checkpoints"
});
await checkpointer.setup(); // Creates tables
// Use with agent
const agent = createDeepAgent({
model,
checkpointer, // State saved after each step
});
// Each invocation uses thread_id for continuity
await agent.stream({ messages }, {
configurable: { thread_id: chatId }
});
What gets persisted:
- All messages (human, AI, tool)
- Tool call history with inputs and outputs
- Intermediate state between nodes
- Parent checkpoint references for history traversal
When failures occur, resume from the last checkpoint:
// Resume stuck session
const state = await agent.getState({
configurable: { thread_id: chatId }
});
// Continue execution from checkpoint
await agent.stream(null, {
configurable: { thread_id: chatId }
});
The agent picks up exactly where it left off. No repeated work.
Solution 2: Hybrid Storage Architecture
Checkpoints handle agent state. But agents also create files, process documents, and store results. These need persistent storage too.
The challenge: different data has different requirements.
| Data Type | Need | Solution |
|---|---|---|
| Work in progress | Fast read/write | NFS mount |
| Shareable documents | Public URLs | Blob storage |
| Temporary files | Speed, not durability | In-memory |
We solved this with a composite backend that routes by path:
const backend = new CompositeBackend(
new StateBackend(config), // Default: ephemeral
{
"/projects/": new NFSBackend("projects"), // NFS (Azure Files, EFS, Filestore)
"/documents/": new NFSBackend("documents"), // NFS
"/uploads/": new BlobBackend("uploads"), // Blob (S3, Cloud Storage)
}
);
When the agent writes to /projects/acme-corp/report.md, it goes to NFS. Fast, persistent, available across restarts.
When it writes to /uploads/report.pdf, it goes to blob storage. Slower, but generates a shareable URL.
Files at the root (/scratch.txt) stay in memory. Fast, but gone after the session.
Performance Matters
NFS vs blob storage isn't just about persistence. It's about speed:
| Operation | Blob Storage | NFS Mount |
|---|---|---|
| Read | 150-200ms | 5-10ms |
| Write | 200-300ms | 10-20ms |
| Edit | 400-600ms | 10-50ms |
A 14-step workflow with file operations can save 3-5 seconds per run just from storage choice. At scale, that's hours of compute time.
Solution 3: Recovery Service
Checkpoints and persistent storage are the foundation. But you also need operational tooling to detect and recover stuck sessions.
Our recovery service provides three capabilities:
1. Stuck Detection
const stuckCheck = await checkStuckState(chatId);
// Returns:
// - isStuck: boolean
// - lastNode: where execution stopped
// - pendingToolCalls: incomplete tool calls
// - canRecover: whether resume is possible
2. Automatic Recovery
if (stuckCheck.canRecover) {
const result = await recoverStuckSession(chatId, agent);
// Agent resumes from checkpoint
}
3. Health Monitoring
const stats = await getCheckpointStats();
// Total checkpoints, unique threads, oldest/newest timestamps
This enables:
- Alerts when sessions get stuck
- Automatic recovery attempts
- Cleanup of old checkpoints
Putting It Together
Here's how these patterns combine in a production agent:
export class ProductionAgent {
private checkpointer: PostgresSaver;
static async create(config: AgentConfig) {
const agent = new ProductionAgent();
// 1. Initialize checkpointer
agent.checkpointer = await getCheckpointer({
schema: "agent_checkpoints"
});
// 2. Create agent with composite backend
agent.agent = createDeepAgent({
model,
backend: createCompositeBackend, // Hybrid storage
checkpointer: agent.checkpointer, // State persistence
middleware: [
// 3. Error recovery middleware
createErrorRecoveryMiddleware()
]
});
return agent;
}
async *stream(message: string) {
try {
yield* this.agent.stream(
{ messages: [new HumanMessage(message)] },
{ configurable: { thread_id: this.chatId } }
);
} catch (error) {
// 4. Attempt recovery on failure
if (this.checkpointer) {
yield* this.attemptRecovery();
}
throw error;
}
}
}
The result: an agent that survives restarts, handles network failures, and maintains context across sessions.
The Bigger Picture
State management is what separates demos from production systems.
A demo can restart when things go wrong. A production system runs 24/7, handles thousands of concurrent sessions, and can't afford to lose work.
The patterns here - checkpointing, persistent storage, recovery services - aren't optional features. They're the foundation that everything else builds on.
Without them, you have a chatbot that occasionally succeeds at complex tasks.
With them, you have a production system that reliably completes work over hours or days.
Implementation Checklist
- Enable checkpointing - PostgreSQL for production, SQLite for development
- Route storage by durability needs - Fast NFS for work in progress, blob for shareable URLs
- Build recovery tooling - Detection, automatic recovery, monitoring
- Test failure scenarios - Kill the process mid-task, verify recovery works
State is not an afterthought. It's the first thing you should design.
References: