Hacker News new | ask | show | jobs
by Geniuzz 39 days ago
> Algorithms written in "pseudo-code", aka a higher level language without type information, are far more readable to a human, and thus likely an LLM too.

What’s the basis of this claim? There are many many more lines of code LLM’s are trained versus pseudo-code.

Also I agree, anecdotally the self-correction is key benefit from static types. If there is a mistake, it is caught at compile time and not at runtime.

1 comments

It seems clear to me from first principles.

Humans are trained on human language. LLMs are trained on human language.

Thus something that is easier for a human to understand is likely easier for an LLM to understand.

That higher level language with well named variables reads more comprehensibly than code:VERB with:PREPOSITION types:NOUN, intermixed:ADJECTIVE, stems:VERB from:PREPOSITION first:ADJECTIVE principles:NOUN too:ADVERB

For models as complex as these I'm not confident we can apply arguments from first principles; we could just as easily argue that type information is helpful, from first principles. What is much more useful is empirical evidence, and AutoCodeBench [1] found that LLMs are most proficient in Elixir (dynamic) followed by Kotlin (static), with Rust and PHP at the bottom. So it would seem like, as of publication, typing style doesn't really matter!

[1] https://autocodebench.github.io/

As far as the AI is concerned, it's more like

Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo

versus

Buffalo:PN buffalo:N Buffalo:PN buffalo:N buffalo:V buffalo:V Buffalo:PN buffalo:N

I think the second one makes much more sense.

In the rare case that all your concepts use the exact same descriptive word, you are probably right!

The majority of the time you can infer the type from reading well written code (to the extent that the shape of the type matters in the context of that piece of code)

If the type can be inferred by the reader it should be inferred by the type system and at least be available to the LLM as a query. But we're also talking about dynamic languages in which type cannot be inferred until runtime. What's the type of x?

x = y + z

Well that depends on the types of y and z, which themselves may depend on the types of other operands, which themselves may not be known until the program actually runs. All that inference takes a lot of thinking, which takes tokens, which cost money. Why not just write the types down? Although we call these things "inference engines" they're really pattern matching explicit tokens, so it's better to actually write down the types so they can be pattern matched than to figure them out at inference time.

You are basically rehashing the false beliefs of the codeless programming camp. Human language that is 99% correct is a standing ovation for a speech writer while it is paying a cyber ransom as the software maker.