Hacker News new | ask | show | jobs
by twoodfin 82 days ago
Wildly speculating here, but if you buy that human brains have innate / evolved syntactic knowledge, and that this knowledge projects itself as the common syntactic forms across the bulk of human languages, then it’s no surprise that LLMs don’t have particularly deep grooves for s-expressions, regardless of the programming language distribution of the training set.
3 comments

OK, I'll bite. I want to know more of the reasoning behind this, because I think it implies that S-expressions are alien to the innate/evolved syntactic knowledge in human languages. A lot of American linguistics, like Chomsky's gropings for how to construct universal grammar and deep syntax trees, or the lambda calculus of semantic functions, looks like S-expressions, and I think that's because there was some coordination between human linguists and computer science (Chomsky was, after all, at MIT). At the same time, I've had a gut instinct that these theories described some languages (like English) better than others (like ancient Greek), requiring more explanation of changes between deep structure and surface structure for languages that were less like English. If models trained on actual language handle s-expressions poorly, that could imply that s-expressions were not a good model for the deep structure of human language, or that the deep-structure vs surface-structure model did not really work. I'd be very happy to learn more about this.
S-expressions are just lists and trees. That’s it. If a language has groups of words and any hierarchy, you can use s-expressions to represent it. Sure, some human languages might be more or less flat and the groups might represent different things, but I don’t see how that prevents s-expressions from being suitable. Greek doesn’t rely on word order nearly as much as English (it does more with suffixes to indicate subject and object, for instance), but all of that can still be represented in s-expressions.
Sure, no argument that s-expressions are wonderfully simple & expressive.

But most human languages—or at least the dominant ones that compose the vast bulk of the LLM training set—use more complex structuring rules for whatever evolutionary linguistic reasons. Easier error correction? Auditory disambiguation?

You could tell similar “just so” stories about computer language syntax, & why s-expressions didn’t win out over (say) XML-style tagging. And it turns out pseudo-XML is a great way to talk to LLMs.

EDIT: To be clear, by “s-expressions” I mean their typical use in Lisp programming of a function expression followed by a series of parameter expressions. The “grammar” is just eval/apply.

There is an interesting on-going research https://dnhkng.github.io/posts/sapir-whorf/ that shows LLMs think in a language-agnostic way. (It will probably get posted to HN after it is finished.)
I would expect that. But I’d also expect the pattern of their thoughts to look more varied in structure like C or German, and less like totally uniform s-expressions.
Is Java or Haskell any closer to human language?