Hacker News new | ask | show | jobs
by zibzob 1199 days ago
Does anyone know if there's a way to use this technology to help understand a large codebase? I want a way to ask questions about how a big unfamiliar codebase works. It seems like ChatGPT isn't trained on open source code, so it can't answer questions there. When I asked it how something works in the Krita source, it just hallucinated some gibberish. But if there's a way to train this AI on a specific codebase, maybe it could be really useful. Or is that not possible with this type of AI?
12 comments

ChatGPT does not understand your code, does not have the same mental model as you do of your code, and from my experiments does not have the ability to connect related but spatially disconnected concepts across even small codebases which will cause it to introduce bugs.

Asking it about these things sounds like it would result in questionable, at best, responses.

Saying a machine cannot understand the way humans understand is like saying airplanes cannot fly the way birds fly.
Which is correct and a big reason for why early flight machines had no chance at all of working

Of course, that doesn't tell you whether the machine understanding will be useful or not

I see, that's what I was worried about. It would be really helpful if it could answer high-level questions about a big confusing codebase, but maybe it's not just a matter of showing it the code and having it work.
ChatGPT has a published context window of 4096 tokens. Although, I saw someone on Twitter saying the real figure, based on experiments, was closer to 8192 tokens. [0] Still, that’s an obvious roadblock to “understanding” large code bases - large code bases are too big to fit in its “short-term memory”, and at runtime its “long-term memory” is effectively read-only. Some possible approaches:

(A) wait for future models that are planned to have much longer contexts

(B) fine tune a model on this specific code base, so the code base is part of the training data not the prompt

(C) Break the problem up into multiple invocations of the model. Feed each source file in separately and ask it to give a brief plain text summary of each. Then concatenate those summaries and ask it questions about it. Still probably not going to perform that well, but likely better than just giving it a large code base directly

Another issue is that, even the best of us make mistakes sometimes, but then we try the answer and see it doesn’t work (compilation error, we remembered the name of the class wrong because there is no class by that name in the source code, etc). OOTB, ChatGPT has no access to compilers/etc so it can’t validate its answers. If one gave it access to an external system for doing that, it would likely perform better.

[0] https://mobile.twitter.com/goodside/status/15988746742046187...

Have you checked out Copilot Labs, the experimental version of Copilot? It's bundled with ability to explain and document source code, among other things.

https://githubnext.com/projects/copilot-labs/

That looks promising! But I think it only works on small snippets of code and doesn't have an overview of the whole codebase...still, maybe it's coming down the line as they improve it.
OpenAI Codex understands code. Though it’s primary use case is code completion, it might be to do Q&A well given a prompt with context.

https://platform.openai.com/docs/guides/code

I’d you’re interested in trying the very cheap models behind ChatGPT, you may want to have a look at langchain and langchain-chat for an example of how to build a chatbot that uses vectorized source code to build context-aware prompts.

Thanks for the links, I'll take a look at this and see if it's something I could reasonably achieve.
ChatGPT also has some understanding of code. For example, you can ask it to translate from one programming language to another.
This is what we've designed LlamaIndex for! https://github.com/jerryjliu/gpt_index. Designed to help you "index" over a large doc corpus in different ways for use with LLM prompts.
In this case you can feed it bits of code you're interested in and ask it to explain, the API has a limit of 4096 tokens (which is a good chunk of text).

I actually built a slack bot for work and daily ask it to refactor code or "write jsdocs for this function"

Yeah, and this is pretty useful for small bits of code, but what I want is a way to ask questions about large projects. It would be nice to ask something like "which classes are responsible for doing X", or "describe on a high level how Y works in this code". But I'm not sure if that is actually possible with the current technology.
It’s possible to do this either by fine-tuning an existing model or using an existing chat model prompts enriched by a vector search for relevant code. See my comment elsewhere.
We've invested a lot into helping LLMs reason and explain large codebases. We use a hybrid approach of local models for semantic search and a mix of OpenAI and Anthropic's models for language output and summarisation.

We're two years in but everything still feels super early given how quickly the fundamentals are improving. Would love your feedback - https://bloop.ai

There's good work happening in this area, e.g. Sourcegraph is working on "Cody" to understand and search your code base https://twitter.com/beyang/status/1614895568949764096
It “knows” the AWS API, CloudFormation and from what others have told me the CDK pretty well. I’ve asked it to write plenty of 20-30 line utility scripts and with the proper prompts, it gets me 95% of the way there.

I assume it would “understand” more popular open source frameworks.

No, not large understanding. But if you are unfamiliar with specific language features, or there is confusing code it can help you figure things out. But no it is not good for any large corpus of text, and you can't give it new stuff and teach it anything.
On which data would you train it?

For code completion for example, you can just train it with a whole bunch of code.

But to explain large code bases, you need to train it with both large codebases and explanations. As far as I know, there are no such explanations available.

That's true, but it works with smaller pieces of code already. You can paste a function into ChatGPT and it will attempt to explain how the code works. Maybe there are enough existing explanations of high-level concepts on the internet for it to work, if it just has the larger codebase in its training data as well. At this point I am wary of making predictions about what this type of AI is able to do. :)
You are describing https://www.buildt.ai/ (which uses ChatGPT among other technologies, and was also featured on HN a while back.)
I'm actually building this very thing—shoot me an email at govind <dot> gnanakumar <at> outlook <dot> com if you'd like to be a beta tester.