Hacker News new | ask | show | jobs
by perching_aix 38 days ago
Really unsure why this is getting downvoted, to my understanding this is a massive, unsettled concern.

It wasn't even a disasm/pseudocode to formal spec flow, and then a separate human implementation. The same human has been in the loop throughout, and large parts of it were generated directly.

It's basically guaranteed tainted.

Edit: I should have skimmed a bit more patiently, there was in fact no "disasm/pseudocode + the human getting tainted" part to this apparently.

3 comments

I read the post you're replying to as saying "this is copyright-encumbered and nonfree because it's a derivative work of everything in Claude's and GPT-5.5's training corpus", which is an argument I find fairly tiresome. (Realistically, if courts actually rule that this is the case, this tiny little project will be the least of anyone's concerns.)

"This is copyright-encumbered and nonfree because it's a derivative work of the legacy RAR binaries" is a different argument (and seems like it depends on details of the setup that were somewhat glossed over in the post).

The point is, excepting current legal standards which are already very murky, how can _you_ claim copyright, if you don't _know_ it isn't encumbered?

You can get these LLMs to generate copyrighted outputs both intentionally and accidentally. This is a known fact; therefore, if you're not checking the output to see if this has occurred then you're potentially generating legal risks for yourself and anyone who uses your code.

To not only ignore this for your own use case but to then release the code under a proclaimed license seems legally problematic if not ethically concerning.

If you did get sued for infringement I can't imagine that your defense would be that you find the argument tiresome? Honestly, do you think this would never happen, or how would you go about defending your actions here?

What do you mean by "checking the output"? Is there some kind of check the author says he didn't do that you think he should have? Or is your claim that using an LLM for coding is always copyright infringement? If so, I think the risk that I'll personally be the test case that resolves whatever ambiguities exist in the law is basically zero, and I don't think derailing the thread to be about that topic enlightens anyone.
> What do you mean by "checking the output"?

At the very least you could see if it's already been open sourced under a different license. If you take GPL code and just slap MIT on it do you not consider that a violation?

> Or is your claim that using an LLM for coding is always copyright infringement?

I'm claiming you cannot really know.

> I'll personally be

It may be someone who uses or redistributes your code in any fashion.

> derailing the thread

I've made two posts. One with an idea and the second clarifying it. This is not "derailing the thread" under any sane definition. This is simply a complicated and relatively unexplored topic that clearly draws a lot of interest and resulting conversation from the crowd here.

I think using this type of bullying rhetoric damages that conversation and harms the reputation of Hacker News in general and I always regret it when I see it.

I didn't actually read any code. I generated spec documents using Claude, then later on used Codex to generate from the spec docs. Are the specs tainted? If someone else independently develops from my spec, is that also tainted? What if they hear it second hand? It's an interesting legal situation for sure.
I also am skeptical of the "LLM output is derivative of everything in the training corpus" argument in general, but in this specific case I think it may have more merit. If the model was trained on unrar source code, and obtained specific information about the RAR format from that code which it then used in the code generation step, then the output is arguably tainted because of that.
Does the source-available UnRAR do anything that the existing FOSS implementations can't do? IIUC the interesting part of this particular project is that it supports really old versions of the file format that were never publicly documented anywhere.
The human wasn't looking at the copyrighted code and was giving high level steering instructions. If you look at the spec generated it doesn't look like a derivative work of the copyrighted material. The program was generated from the spec. It seems mostly fine from my perspective.
If I use a decompiler on existing binaries, then some machine translation utility to turn that into a different language, that still feels like a derivative work, even if no human were reviewing the specifics.
The idea is to make it so that the parts of the output that are derived from the existing binary are not themselves eligible for copyright protection. I.e., factual descriptions of the file format, without any implementation details from the binary.
> Really unsure why this is getting downvoted

Because it’s a boring argument that we’re not going to make progress on until it is actually tested in court.

Also, if/when this is is tested, the court’s options seem to be (a) say yeah this is fine, or (b) cause unending havoc that if followed through on would destroy the economy (a precedent that any org who’s proprietary code made it into ai training data could sue any org that was using code generated by that model? Do the math on how many suits that is.)