Hacker News new | ask | show | jobs
Show HN: Self-hosted RAG with MCP support for OpenClaw (github.com)
4 points by 2dogsanerd 138 days ago
I've been using OpenClaw to control my home server via WhatsApp, but it couldn't access my documents. Instead of uploading my private contracts to OpenAI, I built ClawRAG – a self-hosted RAG engine that connects to OpenClaw via MCP (Model Context Protocol). Now I can ask "What did the contract say about liability?" and get cited answers, not hallucinations.

Most RAG systems are either too complex for a solo dev's home setup or they rely on cloud-hosted vector stores. I needed something that runs in a single Docker container, understands messy PDFs (tables!), and integrates natively as a "tool" for agents rather than just another REST endpoint.

## Technical Deep Dive

### Why MCP instead of REST? I chose the Model Context Protocol (MCP) because it provides structured schemas that LLMs understand natively. The MCP server exposes `query_knowledge` as a tool, allowing the agent to decide exactly when to pull from the knowledge base vs. when to use its built-in memory. It prevents "tool-drift" and ensures type-safe responses.

### The Stack - *Parsing*: Docling 2.13.0 (The first parser I've found that doesn't choke on nested tables in legacy PDFs). - *Storage*: ChromaDB (Lightweight, file-based, no Postgres/pgvector overhead needed for personal knowledge bases). - *Search*: Hybrid (Vector similarity + BM25 keyword search) fused using Reciprocal Rank Fusion (RRF) for better retrieval on specific legal jargon. - *Footprint*: Optimized to run under 2GB RAM (excluding the local LLM).

### The tricky part: Citation Preservation Getting citations to work reliably over a WhatsApp round-trip was the biggest challenge. I had to ensure chunk IDs and source metadata survive the transformation from ChromaDB → LlamaIndex → LLM → OpenClaw → WhatsApp without getting "summarized away" or sanitized by the LLM's output formatting.

## Use Case Last week my landlord claimed I signed a clause about garden/snow maintenance. I pulled up my phone, wrote to my OpenClaw bot: "Search my lease for gardening obligations". It found the relevant paragraph in 3 seconds, cited the page/section, and provided the exact quote. Argument closed.

## Quick Start The repo includes a `docker-compose.yml` that spins up everything including the vector store:

```bash # 1. Start ClawRAG docker compose up -d

# 2. Add your documents curl -X POST http://localhost:8080/api/v1/rag/documents/upload \ -F "files=@my_lease.pdf" \ -F "collection_name=personal"

# 3. Connect to your agent openclaw mcp add --transport stdio clawrag npx -y @clawrag/mcp-server ```

## Community & Feedback Code is MIT licensed. I'd love feedback on the MCP implementation – specifically if you see better ways to handle tool schemas for multi-collection search.

*Ask me anything about the architecture or how I handled the citation logic!*

---

### Hidden Technical Details - *Privacy*: Zero external data leaks. Everything stays on your metal. - *LLM Agnostic*: Tested with Ollama (Llama 3.2) and Claude 3.5 via API. - *Context Management*: Explicit context window limiting to prevent GPU crashes on 8GB VRAM cards.

1 comments

This is a brilliant use of the Model Context Protocol (MCP). Using query_knowledge as a tool rather than a generic REST endpoint is definitely the right move for reducing hallucinations in legal/contractual contexts. The citation preservation over WhatsApp is a particularly nice touch—that's usually where these workflows fall apart.

My only concern with the self-hosted Docker + Docling + ChromaDB stack is the 'maintenance tax.' It’s great for a solo dev, but for a production-grade personal assistant that needs to stay 'always-on' without me babying the container, I've been looking at PAIO (Personal AI Operator).

They seem to be aiming for this exact 'Private RAG' sweet spot but as a managed, one-click service. Their BYOK architecture is what sold me; it keeps the security risk low because it’s using your own keys, but you get that fortress-level privacy that’s hard to replicate in a home-server setup without a lot of manual hardening.

Are you planning to add support for other 'operators' like PAIO, or is the goal to keep ClawRAG strictly as a standalone self-hosted primitive?

Thanks! The Community Edition is intentionally a 'primitive' – single container, zero config However, I disagree on the 'maintenance tax'. ClawRAG isn't a weekend project; it's the retrieval engine extracted directly from our Enterprise RAG Core V4 system. It keeps the same connection pooling and health checks, so it is built to be 'always-on' without babying The full V4 system adds Governance (Solomon Consensus/Multi-lane validation), not basic stability. You don't need the Enterprise layer just to keep the lights on

Re: PAIO – if they implement an MCP Client, ClawRAG can serve them. But I'd argue: if you already run a host, adding a container gives you provable privacy vs. 'trust us' managed services. I prefer owning the keys AND the lock ;-)