Hacker News new | ask | show | jobs
by pjsousa79 110 days ago
One thing that seems to be missing in most discussions about "context" is infrastructure.

The dream system for AI agents is probably something like a curated data hub: a place where datasets are continuously ingested, cleaned, structured and documented, so agents can query it to obtain reliable context.

Right now most agents spend a lot of effort stitching context together from random APIs, web scraping, PDFs, etc. The result is brittle and inconsistent.

If models become interchangeable, the real leverage might come from shared context layers that many agents can query.

2 comments

Am working on making this layer currently. It’s a more interesting problem even when you remove AI agents from the picture, I feel a context layer can be equally as useful for humans and deterministic programs. I view it as a data structure sitting on top of your entire domain and this data structure’s query interface plus some basic tools should be enough to bootstrap non trivial agents imo. I think the data structure that is best suited for this problem is a graph and the different types of data represented as graphs.

Stitching api calls is analogous to representing relationships between entities and that’s ultimately why I think graph databases have a chance in this space. As any domain grows, the relationships usually grow at a higher rate than the nodes so you want a query language that is optimal for traveling relationships between things. This is where a pattern matching approach provided by ISO GQL inspired by Cypher is more token efficient compared to SQL. The problem is that our foundation models have seen way way way more SQL so there is a training gap, but I would bet if the training data was equally abundant we’d see better performance on Cypher vs SQL.

I know there is GraphRAG and hybrid approaches involving vector embeddings and graph embeddings, but maybe we also need to reduce API calls down to semantic graph queries on their respective domains so we just have one giant graph we can scavenge for context.

This resonates strongly. We've been working on exactly this problem with ArcadeDB — a multi-model database that natively supports graphs, documents, key-value, time-series, and vector search in a single engine. (https://arcadedb.com)

The insight about relationships growing faster than nodes is spot on, and it's why we think the graph model is the natural fit for context layers. But in practice, you also need documents, vectors, and sometimes time-series data alongside the graph. Forcing everything into a single model (or stitching together multiple databases) creates friction that kills agent workflows.

On the GQL/Cypher vs SQL point — agreed on token efficiency. We support both SQL (extended with graph capabilities) and Cypher-style syntax, and the difference in prompt size for traversal queries is dramatic. An N-hop relationship query that takes 5+ lines of SQL JOINs is a single readable line in a graph query language. For LLM-generated queries, that's not just an aesthetic win — it directly reduces error rates and token costs.

Re: GraphRAG — we've seen the same convergence. Vector similarity to find the right neighborhood, then graph traversal for structured context. Having both in one engine (ArcadeDB supports vector indexing natively) means you avoid the API orchestration overhead you mention. One query, one database, full context.

The training gap for graph query languages is real but closing fast. As more agent frameworks adopt graph-based context, the flywheel will kick in.

Data should not be ingested. Data should originate from the same environment that you want to activate it in. That means you need build a system from the ground up for your searches, your document creation etc, so that this data is native to your system and then easily referenced in your commands to the llm interface.

The best example of this is probably CrewAI and Alibaba CoPaw. CoPaw has a demo up.