| Building AI agents with LangGraph, I noticed graph invocations were slow even before hitting the LLM. Dug into the Pregel execution engine to find out why. THE PROBLEM Profiled my LangGraph agents. 50-100ms per invocation, most of it not the LLM. Found two culprits: 1. ThreadPoolExecutor created fresh every invoke() — 20ms overhead 2. Checkpointing uses deepcopy() — 52ms for 35KB state, 206ms for 250KB THE FIX Rewrote hot paths in Rust via PyO3: Checkpoint serialization (serde vs deepcopy): 35KB state: 0.29ms vs 52ms = 178x faster 250KB state: 0.28ms vs 206ms = 737x faster E2E with checkpointing: 2-3x faster Drop-in usage: export FAST_LANGGRAPH_AUTO_PATCH=1 # or explicit
from fast_langgraph import RustSQLiteCheckpointer checkpointer = RustSQLiteCheckpointer("state.db") KEY INSIGHT PyO3 boundary costs ~1-2μs per call. Rust only wins when you: - Avoid intermediate Python objects (checkpoint serialization) - Batch operations (channel updates) - Handle large data (state > 10KB) For simple dict ops, Python's C-dict still wins. Architecture: Python orchestration (compatibility) + Rust hot paths (performance). Runs regular compatibility checks! MIT licensed. Feedback welcome. |