| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zibzob 1199 days ago
	Does anyone know if there's a way to use this technology to help understand a large codebase? I want a way to ask questions about how a big unfamiliar codebase works. It seems like ChatGPT isn't trained on open source code, so it can't answer questions there. When I asked it how something works in the Krita source, it just hallucinated some gibberish. But if there's a way to train this AI on a specific codebase, maybe it could be really useful. Or is that not possible with this type of AI?

12 comments

bcrosby95 1199 days ago

ChatGPT does not understand your code, does not have the same mental model as you do of your code, and from my experiments does not have the ability to connect related but spatially disconnected concepts across even small codebases which will cause it to introduce bugs.

Asking it about these things sounds like it would result in questionable, at best, responses.

panarky 1199 days ago

Saying a machine cannot understand the way humans understand is like saying airplanes cannot fly the way birds fly.

Dudeman112 1198 days ago

Which is correct and a big reason for why early flight machines had no chance at all of working

Of course, that doesn't tell you whether the machine understanding will be useful or not

zibzob 1199 days ago

I see, that's what I was worried about. It would be really helpful if it could answer high-level questions about a big confusing codebase, but maybe it's not just a matter of showing it the code and having it work.

skissane 1199 days ago

ChatGPT has a published context window of 4096 tokens. Although, I saw someone on Twitter saying the real figure, based on experiments, was closer to 8192 tokens. [0] Still, that’s an obvious roadblock to “understanding” large code bases - large code bases are too big to fit in its “short-term memory”, and at runtime its “long-term memory” is effectively read-only. Some possible approaches:

(A) wait for future models that are planned to have much longer contexts

(B) fine tune a model on this specific code base, so the code base is part of the training data not the prompt

(C) Break the problem up into multiple invocations of the model. Feed each source file in separately and ask it to give a brief plain text summary of each. Then concatenate those summaries and ask it questions about it. Still probably not going to perform that well, but likely better than just giving it a large code base directly

Another issue is that, even the best of us make mistakes sometimes, but then we try the answer and see it doesn’t work (compilation error, we remembered the name of the class wrong because there is no class by that name in the source code, etc). OOTB, ChatGPT has no access to compilers/etc so it can’t validate its answers. If one gave it access to an external system for doing that, it would likely perform better.

[0] https://mobile.twitter.com/goodside/status/15988746742046187...

Jiocus 1199 days ago

Have you checked out Copilot Labs, the experimental version of Copilot? It's bundled with ability to explain and document source code, among other things.

https://githubnext.com/projects/copilot-labs/

zibzob 1199 days ago

That looks promising! But I think it only works on small snippets of code and doesn't have an overview of the whole codebase...still, maybe it's coming down the line as they improve it.

roseway4 1199 days ago

OpenAI Codex understands code. Though it’s primary use case is code completion, it might be to do Q&A well given a prompt with context.

https://platform.openai.com/docs/guides/code

I’d you’re interested in trying the very cheap models behind ChatGPT, you may want to have a look at langchain and langchain-chat for an example of how to build a chatbot that uses vectorized source code to build context-aware prompts.

zibzob 1199 days ago

Thanks for the links, I'll take a look at this and see if it's something I could reasonably achieve.

pfdietz 1198 days ago

ChatGPT also has some understanding of code. For example, you can ask it to translate from one programming language to another.

freezed88 1198 days ago

This is what we've designed LlamaIndex for! https://github.com/jerryjliu/gpt_index. Designed to help you "index" over a large doc corpus in different ways for use with LLM prompts.

vorticalbox 1199 days ago

In this case you can feed it bits of code you're interested in and ask it to explain, the API has a limit of 4096 tokens (which is a good chunk of text).

I actually built a slack bot for work and daily ask it to refactor code or "write jsdocs for this function"

zibzob 1199 days ago

Yeah, and this is pretty useful for small bits of code, but what I want is a way to ask questions about large projects. It would be nice to ask something like "which classes are responsible for doing X", or "describe on a high level how Y works in this code". But I'm not sure if that is actually possible with the current technology.

roseway4 1199 days ago

It’s possible to do this either by fine-tuning an existing model or using an existing chat model prompts enriched by a vector search for relevant code. See my comment elsewhere.

louiskw 1198 days ago

We've invested a lot into helping LLMs reason and explain large codebases. We use a hybrid approach of local models for semantic search and a mix of OpenAI and Anthropic's models for language output and summarisation.

We're two years in but everything still feels super early given how quickly the fundamentals are improving. Would love your feedback - https://bloop.ai

rocauc 1199 days ago

There's good work happening in this area, e.g. Sourcegraph is working on "Cody" to understand and search your code base https://twitter.com/beyang/status/1614895568949764096

scarface74 1199 days ago

It “knows” the AWS API, CloudFormation and from what others have told me the CDK pretty well. I’ve asked it to write plenty of 20-30 line utility scripts and with the proper prompts, it gets me 95% of the way there.

I assume it would “understand” more popular open source frameworks.

roflyear 1199 days ago

No, not large understanding. But if you are unfamiliar with specific language features, or there is confusing code it can help you figure things out. But no it is not good for any large corpus of text, and you can't give it new stuff and teach it anything.

koonsolo 1198 days ago

On which data would you train it?

For code completion for example, you can just train it with a whole bunch of code.

But to explain large code bases, you need to train it with both large codebases and explanations. As far as I know, there are no such explanations available.

zibzob 1198 days ago

That's true, but it works with smaller pieces of code already. You can paste a function into ChatGPT and it will attempt to explain how the code works. Maybe there are enough existing explanations of high-level concepts on the internet for it to work, if it just has the larger codebase in its training data as well. At this point I am wary of making predictions about what this type of AI is able to do. :)

darkvertex 1198 days ago

You are describing https://www.buildt.ai/ (which uses ChatGPT among other technologies, and was also featured on HN a while back.)

sandkoan 1199 days ago

I'm actually building this very thing—shoot me an email at govind <dot> gnanakumar <at> outlook <dot> com if you'd like to be a beta tester.