| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by benvolio 1031 days ago

>The Code Llama models provide stable generations with up to 100,000 tokens of context.

Not a bad context window, but makes me wonder how embedded code models would pick that context when dealing with a codebase larger than 100K tokens.

And this makes me further wonder if, when coding with such a tool (or at least a knowledge that they’re becoming more widely used and leaned on), are there some new considerations that we should be applying (or at least starting to think about) when programming? Perhaps having more or fewer comments, perhaps more terse and less readable code that would consume fewer tokens, perhaps different file structures, or even more deliberate naming conventions (like Hungarian notation but for code models) to facilitate searching or token pattern matching of some kind. Ultimately, in what ways could (or should) we adapt to make the most of these tools?

9 comments

wokwokwok 1031 days ago

That seems daft.

You can, I suppose, contract your code so that it’s context free and uses less tokens, but that makes it more confusing for humans and language models.

Taken to the extreme, you can see obviously with one letter functions and variables like i, j, k the model will be able to infer literally nothing and, thus, produce arbitrary nonsense.

Clearly the solution is to do what we already do to manage complexity which is to decompose large tasks into smaller black box modules with an api where the (large number of tokens) implementation is hidden and not known or relevant to using it.

If you give an LLM a function signature and good description, maybe some usage examples, it doesn’t need the implementation to use it.

Terseness decreases the ability of LLMs to process code; it doesn’t solve context length, and even at best it doesn’t scale.

100k tokens is plenty.

You don’t need to do anything like that.

sicariusnoctis 1030 days ago

64k tokens ought to be enough for anybody.

roguas 1028 days ago

I see what you did there mr Gates

emporas 1030 days ago

The process of decomposing the task into smaller steps and generate each step independently seems to be the correct way in my experience too. It works very well with GPT (chatGPT or GPT4).

>100k tokens is plenty.

The context window can be really helpful, in case there is a release of a new library and the user wants to generate code targeting the API of the library. When the training date stops at August 2023, any library released after that date is not known to the engine.

My general opinion in regards to context window, is that 1 trillion tokens context window still may not be enough for all use cases.

ttul 1031 days ago

Your developer tool already maps out the entire code base in useful ways, such as knowing all the symbols available in the current context and the structure of classes. This information can be distilled for presentation to the LLM. For instance, if you’re wanting to generate a method implementation inside a C++ class, the LLM can be given a condensed version of the header files that the compiler would have access to on compiling that specific class. Removing white space and comments and boiling macros down saves a lot of tokens.

You can also probably skip including standard library headers since those will be well known to the LLM through its fine tuning.

Either way, consider that a typical preprocessed C++ file would push against the 100K limit even with some optimizations. You will definitely want to have some middleware doing additional refinement before presenting that file to the LLM.

roughly 1030 days ago

I’ve found the utility of the coding LLMs gets a lot higher when you’ve got code comments and descriptive variable and function names - the LLM makes better inferences and suggestions. We’ve seen similar on data - properly tagged data and descriptive field names helps the LLM to produce much more useful responses. I’m secretly hoping the spread of these tools will finally lead my fellow developers to comment their code and stop using three character variable names.

GreedClarifies 1030 days ago

Commenting the code in this manner sounds like a job for an LLM, maybe with human assistance in the short run.

bbor 1030 days ago

This is my ultimate (short term) AI fear - letting it get into a feedback loop with itself, leading to perverse and incorrect results.

To state my position more clearly: I don’t think an AI could comment code from scratch very well - how would it know all the decisions made, business logic considerations, historical conventions, micro-industry standards, etc?

A good benchmark I was told once was “if a human expert couldn’t do it, an AI probably can’t either”. And commenting code I didn’t write would certainly test the bounds of my abilities

gonzan 1030 days ago

I built a VS code extension a while back that I still use that wraps GPT-4 and writes code directly in my editor.

The method I used to choose which files to feed GPT-4 was embeddings-based. I got an embedding for each file and then an embedding from the instruction + some simple processing to pick the files more likely to be relevant. It isn't perfect but good enough most of the time in medium-sized codebases (not very large ones).

The one thing I started doing because of how I implemented this is make files shorter and move stuff into different files. Having a 1k+ LOC file is prohibitive because it eats up all the context window (although with 100k context window maybe less so). I think it's a good idea to keep files short anyways.

There's other smarter things that can be done (like embed and pass individual functions/classes instead of entire files) so I have no doubt someone will build something smarter soon. You'll likely not have to change your coding patterns at all to make use of AI.

brucethemoose2 1031 days ago

This sounds like a job for middleware. Condensing split code into a single huge file, shortening comments, removing whitespace and such can be done by a preprocessor for the llm.

gabereiser 1031 days ago

So now we need an llmpack like we did webpack? Could it be smart enough to truncate comments, white space, etc?

brucethemoose2 1031 days ago

You dont even need an llm for trimming whitespace, just a smart parser with language rules like ide code checkers already use. Existing llms are fine at summarizing comments, especially with language specific grammar constraints.

gabereiser 1030 days ago

My point. We don’t need the middleware.

visarga 1030 days ago

A good practice is to have a prompt file where you keep the information you want the model to have at its disposal. Then you put it in the start of your conversations with GPT-4. It's also good documentation for people.

You start a project by defining the task. Then as you iterate, you can add new information to the prompt. But it can be also partially automated - the model can have a view of the file structure, classes, routes, assets and latest errors.

I was really hoping that the one year update of Codex would be that - a LLM that can see deep into the project, not just code, but runtime execution, debugging, inspecting and monitoring. Something that can iterate like autoGPT. Unfortunately it didn't improve much and has weird conflicts with the native code completion in VSCode, you get freezes or doubled brackets.

archibaldJ 1029 days ago

I’m working on a project related to that: https://github.com/0a-io/Arch-GPT

I think hypergraph is an overlooked concept in programming language theory

adamgordonbell 1031 days ago

Solutions exist that feed LLMS ctags, and seem to work well. The function signatures and symbols names for a code base are much smaller than the actual code.

sean_flanigan 1030 days ago

I know about https://github.com/paul-gauthier/aider. Have you got a link to any others?

smcleod 1030 days ago

I'm using this right now, but it's noted that "ctags only work with GPT4" so I'm yet to get them working with llama locally.

adamgordonbell 1030 days ago

Aider was exactly what I was thinking of!

rawrawrawrr 1030 days ago

> Not a bad context

A little understated, this is state of the art. GPT-4 only offers 32k.