Hacker News new | ask | show | jobs
by ISV_Damocles 1054 days ago
I don't believe that I can change your mind on this, so I didn't intend to respond, but as this is the top comment, I do want to provide a rebuttal on why we do think this is actually a programming language, that the code we have written is actually a compiler, and why Marsha is a useful exploration of the programming language design space.

First, a programming language is just a syntax to describe functionality that could be turned into an actual program. Lisp[1] was defined in 1958 but didn't have a full compiler until 1962. Was it not a programming language in the intervening 4 years? Marsha does not fall into this, since it can already generate working code, but the bar for what is a programming language, I believe, is lower than most would immediately think.

Second, a programming language does not need to be imperative to be a programming language, or languages like Lean[2] that have you write proofs that the compiler then figures out how to generate the code to fulfill would not be programming languages. Lean, Coq, and other such languages are much more technically impressive than Marsha, true, but they share the property you describe the properties a function should have and then the compiler generates the program that fulfills those properties.

Marsha differs from these Proof-based languages in that poor specificity still produces some sort of program instead of a compilation error, which makes it sort of like Javascript that will do something with the code you write as long as it is syntactically valid. This is not a desirable property of Marsha, but it is a trade-off that in practice makes it more immediately usable to a larger number of people than Lean or Coq, because the skill level required is lower.

This is also, as you allude to, the current state of the world in most software development -- project managers come up with high-level requirements for new features, technical leads on engineering teams convert this into tasks and requirements for individual contributors who then write the code and tests which are then peer reviewed by the team as a sanity check and then committed. This process may or may not cover all situations and the specifications at all levels are likely not as rigorous as what Lean would require of you.

Marsha mimics this process, starting from the tech lead level and bleeding into the individual contributor level. The type and function descriptions are analogous to the tech lead requirements and the examples are analogous to the test suite the individual contributor would write. Just like in real world development, if these are not well specified, the resulting code will likely have logic bugs that would need to be addressed with a stricter definition and improved test cases.

The compiler consumes this definition into an AST[3], walks the tree to generate intermediate forms, and generates an output in a format that can be executed by a computer. Some use "transpiler" for a compiler that targets another language, but that is a subset of compilers, not a separate kind of tool, in my opinion, or the Java compiler would be a "transpiler" for the JVM bytecode format that is also not directly executable by a computer.

We are still in the very early stages with Marsha and agree that more syntax could be helpful -- we already have 4 different syntactic components to Marsha versus the fully open-ended text entry behavior of Github Copilot or ChatGPT. But what makes Marsha interesting (to me) is that it makes it possible to explore a totally new dimension in programming language design: the formalization of the syntax to define a program itself. In many papers on new algorithms, the logic is often described in a human-readable list of steps without the hard specificity of programming languages, improving the ability of the reader to understand the core of the algorithm, rather than getting bogged down in the implementation details of this or that programming language. There is still a formalism, but it differs from that of traditional programming languages, and Marsha lets you work with your computer in a similar way.

Are there cases where this is a bad idea? Absolutely. Just like there are cases where writing your code in Python is a bad idea versus writing it in Rust. There is no perfect programming language useful for all scenarios, and probably never will exist. But there will be a subset of situations where the trade-offs Marsha provides makes sense. By being more forgiving than even the most forgiving interpreted languages out there, Marsha is in a good position to fill that niche if the primary barrier is difficulty.

[1]: https://en.wikipedia.org/wiki/Lisp_(programming_language)#Hi... [2]: https://en.wikipedia.org/wiki/Lean_(proof_assistant) [3]: https://github.com/alantech/marsha/blob/main/marsha/parse.py...

3 comments

If I understand this correctly, the source code of a Marsha program does not fully determine the running code. And we aren’t talking about immaterial optimizations, the LLM could do vastly different things with the same Marsha source.

A programmer is a human who connects the world of humans with the world of machines. To do this, he is required to sufficiently understand both worlds. On the human side this requires social competence and professional accountability, which machines don’t have. On the computing side, it requires at least that machines behave in predictable and comprehensible ways. Marsha appears to fall short on both counts.

Using an LLM for programming is inherently irresponsible. The people arguing in favor of doing so have not subjected LLMs to any kind of rigorous testing. They simply have unshakeable faith.

I am in the midst of a careful review and surgical takedown of a 9000 word demonstration of ChatGPT’s supposed ability to help testers test. It took maybe 20 minutes for some drooling consultant fan-boy to produce the demo. It has so far been about 30 hours of work to carefully pore through each sentence and show how it is wrong. I am doing the testing and critical thinking that the original consultant failed to do.

The Marsha site has a brief line about how it produces “tested” Python code. The one thing you can bank on with LLMs is none of you big eyed enthusiasts have a serious attitude about testing. It’s all simplistic demonstration.

I’m frustrated by this culture of fawning adoration of unproven and unprovable tools. I hope this trend peaks and become a generally acknowledged joke soon! Then we can resume with craftsmanship and responsible engineering.

I would say that you do not quite understand it. Part of the process of generating the code that does work is that it also generates a test suite using the examples you provide as the test cases and it actually executes the test suite against the code that was generated and iterates with the LLM until the test suite passes.

This is where the claim that it's tested code comes from, because it is literally tested.

One of the examples we added is a simple tool to get headlines from CNN.com[1]. We don't commit the generated python to the repository because we're treating it as a compiler artifact, but here's a gist[2] of one of the runs, including the test suite it created to validate proper behavior. It's not just relying purely on the LLM's ability to string tokens together, but goes through a validation phase to make sure what it built is real.

[1]: https://github.com/alantech/marsha/blob/main/examples/web/cn... [2]: https://gist.github.com/dfellis/a758a7321b4f62f820ddbad57aac...

> First, a programming language is just a syntax to describe functionality that could be turned into an actual program. Lisp[1] was defined in 1958 but didn't have a full compiler until 1962. Was it not a programming language in the intervening 4 years?

The claim that you make here is not true, and the example that you give does not support your claim. A programming language is more than just syntax - it is the combination of both syntax and semantics together to give a computational meaning to the strings in a language. This is not controversial, this is emphasized in the introduction to any textbook on compilers / language theory so I'll just give you one easy to google reference for this claim:

* https://www.cs.mcgill.ca/~rwest/wikispeedia/wpcd/wp/p/Progra...

A programming language is more than a well defined set of strings. Each of those strings defines a particular computation. This is not true of natural language, where any definition of semantics relies on the semantics of the natural language.

For your specific example of lisp, here is the original 1958 letter:

* https://dl.acm.org/doi/10.1145/368405.1773349

As you can see the description is more than just the syntax of expressions - it describes the evaluation process and how to perform it. This is different from a modern description of semantics as it predates the introduction of operational and denotational styles by a couple of decades.

From the same era here is ALGOL, again it is more than the syntax as a description of the semantics is required to defined which computation is being written down in the language:

* https://www.softwarepreservation.org/projects/ALGOL/report/B...

One of the pillars that you are building your argument upon is very faulty, and I think it would be good to take a moment and consider what that means. Marsha is clearly a program synthesis tool. It is clearly automated in the production of programs. It looks useful in the overall process of programming. But describing it as a programming language is not helpful or useful. Watering down language and definitions does not help to explain what Marsha is or can do, and when you have made something new there is no particular need to try to fit it into an old label that means something else.

> Marsha does not fall into this, since it can already generate working code, but the bar for what is a programming language, I believe, is lower than most would immediately think.

Well, we can go back and forth about the technical definition of individual words all day, but 'is it a programming language?' is such a vague question, the argument is basically meaningless.

Do you want to put that label on it? Ok. Someone else disagrees? Huh. Someone called something else a programming language? Someone disagreed with that?

eh...

Since it's purely opinion based, who cares? There's no answer which is 'right'.

I would argue that regardless of semantic details about terminology, there is a fundamental difference between what you're doing here and most common programming languages:

You can have:

1) A series of instructions to do a task, which can be unambiguously mapped into a series of instructions in another format.

or

2) A series of instructions to do a task, which is mapped non-deterministically into a series of instructions in another format.

Just like you have functions (deterministic) and probability functions (non-deterministic), there is a difference here between those two things.

...

In this case, you're basically generating non-deterministic imperative logic; that's obviously and unambiguously distinct from a deterministic sequence of imperative logic.

It is novel; it is interesting. ...but I don't think it's worth the argument about 'is it a programming language'; it's clearly very different from existing languages.

> improving the ability of the reader to understand the core of the algorithm, rather than getting bogged down in the implementation details of this or that programming language. There is still a formalism, but it differs from that of traditional programming languages, and Marsha lets you work with your computer in a similar way.

I applaud this intent, but I'm skeptical.

Once again, you are non-deterministically mapping the 'core logic' of the algorithmic into a sequence of deterministic steps that may or may not match the request. That's the point; it's non-deterministic.

It could do anything; the P value of it doing something crazy might drop, but it's not zero; and fundamentally, how can you rely on a system where the instructions you give may or may not map to the machine code output?

You add tests? Sure... but, those are generated too right?

You have to dance through a series of tighter and tighter hoops to try to reduce the P value of "crazy hallucination and chaos", but I see no meaningful insight here about how you plan to mitigate that problem completely?

...and if you don't mitigate it completely, unlike a constraint solver, the non-deterministic output you get cannot be validated to be correct...

It's not about specifying the syntax in a different more readable form; it's about confidence that the output matches the constraints of the input; and I don't see that here.

Given the context length (and nature of large contexts in general) in LLMs, I also ponder whether it's even possible to do this beyond the trivial form, because it seems like as the constraint set scales, the capability of any LLM to address those constraints (and to be confident that it has) seems like a difficult problem to solve.

However, I would like to say that I see this domain as an interesting area of research; and most certainly neither a) a solved problem, or b) a dead end. There's definitely stuff here worth playing with and exploring.

...regardless of if people think of it 'as a programming language', or not.

I'm going to bed soon, so I need to be more brief with my responses. This is not meant to be snarky, so I apologize if any of the short sentences seem that way.

> 1) A series of instructions to do a task, which can be unambiguously mapped into a series of instructions in another format.

> or

> 2) A series of instructions to do a task, which is mapped non-deterministically into a series of instructions in another format.

You're a bit too black-and-white on this situation. Floating point calculations often suffer subtle differences in behavior based on optimization flags[1] or CPU architecture[2]. I would not consider C non-deterministic, but this is a situation where differences show up without changes to the code being compiled.

> It could do anything; the P value of it doing something crazy might drop, but it's not zero; and fundamentally, how can you rely on a system where the instructions you give may or may not map to the machine code output?

>

> You add tests? Sure... but, those are generated too right?

>

> You have to dance through a series of tighter and tighter hoops to try to reduce the P value of "crazy hallucination and chaos", but I see no meaningful insight here about how you plan to mitigate that problem completely?

1. Marsha in the here-and-now definitely does have a small probability of actually generating junk tests that it can also somehow generate working code for.

2. Different tools for different scenarios, so if that is a huge problem, don't use Marsha as it currently is.

3. Since LLMs are trained on human generated code explained by humans, the code it generates is human readable, so you can always review the generated output before you rely on it, right now. Human still in the loop, but the amount of work significantly reduced.

4. Trivially, you can get determinism in output by setting temperature to 0, though that also means if it fails to generate an output it will always fail to generate an output.

5. A fully predictable output requires a formalism essentially equal to existing programming languages. The purpose of Marsha is to explore relaxing that for development velocity and simplicity, but it is intended to be a gradient you can choose from so I agree it should be possible. Nothing solid figured out now, but simply dropping into your target language of choice would be an "easy" patch, though it defeats the purpose of the language. Something like Rust/Haskell pattern matching or Lean/Coq constraints informing you of missing definitions would be better, but honestly unsure how to get there.

> Given the context length (and nature of large contexts in general) in LLMs, I also ponder whether it's even possible to do this beyond the trivial form, because it seems like as the constraint set scales, the capability of any LLM to address those constraints (and to be confident that it has) seems like a difficult problem to solve.

This one doesn't seem as hard to me, because of the divide-and-conquer nature of programming. Each individual function gets its own context, and if that function is too big, break it into chunks and generate those independently. Definitely more of a Lisp-y style instead of a big blob of old school PHP.

May also become effectively irrelevant if useful context size exceeds the length of something massive like a novel.

> However, I would like to say that I see this domain as an interesting area of research; and most certainly neither a) a solved problem, or b) a dead end. There's definitely stuff here worth playing with and exploring.

Fully agree. Marsha of today doesn't "solve" it (and depending on your acceptance level of C floating point changes, may never do so) but I say pretty confidently that it is further along than Copilot, and I don't see why it won't improve in the future.

[1]: https://stackoverflow.com/questions/7517588/different-floati... [2]: https://stackoverflow.com/questions/64036879/differing-float...

> You're a bit too black-and-white on this situation.

While I agree with your other points, I feel this argument doesn't really hold water.

The output of the c compiler is deterministic.

I struggle very hard to believe that the floating point rounding errors when you compile C will cause it to occasionally emit a binary that is not byte-identical multiple sequential runs in a row.

What any program does at runtime is essentially non-deterministic, and that's 100% not what we're talking about here.

If you consider https://github.com/alantech/marsha/blob/main/examples/web/we... ...

The generated output of this file is a probability distribution with a sweet spot where the code does what you want; there are multiple outputs of code that sit in the sweet spot. You want one of these.

The actual output of this file is a probability distribution that includes the examples, but may or may not overlap the sweet spot of 'actually does the right thing'.

...in fact, and there's no specific reason to expect that, regardless of the number of examples you provide, the distribution that includes those examples also includes the sweet spot.

For common examples it will, but I'd argue that it's actually provable that there are times (eg. where the output length of a valid solution would be > the possible out of the model), that regardless of the examples / tests, it's not actually possible to generate a valid solution from. Just like how constraint solvers will sometimes tell you there's no solution that matches all the constraints.

So, that would be like a compiler error. "You've asked for something impossible".

...but I imagine it would be very very difficult to tell the difference between inputs that overlap the sweet spot and those that don't; the ones that don't will have solutions that look right, but actually only cover the examples; and there's literally no way of telling the difference between that and a correct solution without HFRL.

It seem like an intractable problem to me.

> Different tools for different scenarios, so if that is a huge problem, don't use Marsha as it currently is.

As you say~