| @silentsvn - thank you for reading carefully enough to ask this. You're correct that the core scoring is tag‑based and deterministic, which is lexical, not "semantic" in the modern embedding sense. The terminology is worth unpacking. We call it "semantic" in the broader sense of meaning‑bearing structure—the graph encodes relationships between concepts, and retrieval walks those relationships. But you're correct that at query time, it's matching on tags, not vector similarity. Why not embeddings?
We made a deliberate trade‑off: determinism and explainability over fuzziness. With vector search, you get a black‑box similarity score and no way to debug why something was retrieved. With tag‑based traversal, you can trace the exact path: "This result matched because it shares tags X, Y, Z and is within 2 hops of your query." That matters for agentic workflows where auditability is critical. Tag extraction is where we do the work to bridge the lexical gap. The atomization pipeline uses:
- Wink NLP for entity recognition and part‑of‑speech filtering (so "authentication" and "JWT" both get tagged with relevant concepts if they appear in context).
- Co‑occurrence windows to infer relationships (e.g., if "JWT" and "authentication" repeatedly appear near each other, they get linked in the graph).
- Synonym expansion (via Standard 111) so queries for "authentication" can surface nodes tagged with "JWT" if the system has learned that relationship from your corpus. It's not magic - if you never mention "JWT" in the same context as "authentication," the graph won't connect them. But that's a feature, not a bug: the system reflects your actual usage, not a statistical average of the internet. The trade‑off is real: you give up the fuzzy "close enough" retrieval of vectors in exchange for perfect traceability and no embedding drift. For many use cases (project memory, execution traces, personal knowledge bases), that's the right call. I'd love to hear more about what you're building in this space. Always good to find others thinking about these trade‑offs. |
The determinism trade-off is genuinely interesting — auditability over fuzziness is a real design philosophy, not just a limitation.
We've been building something that tries to avoid forcing that choice. Engram uses three strategies in parallel: vector embeddings (nomic-embed-text via Ollama, local-first), BM25 keyword, and temporal recency — merged with Reciprocal Rank Fusion. Each result comes back with an explicit similarity score and the tier it came from (working memory / long-term / archived), so the retrieval path is still traceable even when it's fuzzy.
We also layer on a graph component similar to yours — entity-relationship extraction that augments top results with connected context. The difference is that graph is additive on top of embedding retrieval rather than the primary mechanism.
The place your approach wins clearly is corpus-specific precision. If the graph is built from your actual usage (your JWT/authentication example), tag traversal will reliably surface relationships that vectors would miss or dilute with internet priors. That's a real advantage for execution traces and project memory.
Still working through the right defaults for consolidation (when to summarize old working memories vs keep them granular). Curious whether you've thought about memory aging in your model.
Repo if curious: github.com/Cartisien/engram (http://github.com/Cartisien/engram)