Hacker News new | ask | show | jobs
by asteroidz 1178 days ago
The problem with your suggested approach is the resulting lack of holistic context. The problem of OP's approach (direct parsing) is cost and context-window-limits. There has to be a better way.
1 comments

It's searching by semantic meaning, so it should be able to find all relevant pieces. Using overlap during chunking should help too.

Using the "give it everything" method will cause it to forget most of what you're feeding it if you have a large repo anyway, right?

I don't think it will forget anything as long as everything fits in the context window, but I could totally be wrong. That's the big problem with the "give it everything" approach: if your codebase doesn't fit then it's game over. I've had success limiting what I give it to the relevant files.
Right- "forgetting" assuming a rolling context window maxed at the models max token count. "If it doesn't fit then it's game over" - assuming 8k tokens with each token being ~4 characters, that's a pretty small repo. And that's a motivator behind the similarity search approach.