| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by danenania 798 days ago

This all seems pretty sensible. Another area that would be nice to see addressed are strategies for balancing latency/cost/performance when data is frequently updated. I'm building a terminal-based AI coding tool[1] and have been thinking about how to bring RAG into the picture, as it clearly could add value, but the tradeoffs are tricky to get right.

The options, as far as I can tell, are:

- Re-embed lazily as needed at prompt-time. This should be the cheapest as it minimizes the number of embedding calls, but it's the most expensive in terms of latency.

- Re-embed eagerly after updates (perhaps with some delay and throttling to avoid rapid-fire embed calls). Great for latency, but can get very expensive.

- Some combination of the above two options. This seems to be what many IDE-based AI tools like GH Copilot are doing. An issue with this approach is that it's hard to ever know for sure what's updated in the RAG index and what's stale, and what exactly is getting added to context at any given time.

I'm leaning toward the first option (lazy on-demand embedding) and letting the user decide whether the latency cost is worth it for their task vs. just manually selecting the exact context they want to load.

1 - https://github.com/plandex-ai/plandex

1 comments

3abiton 798 days ago

Any benchmarks on performance for on-demand embeddings?

link