Hacker News new | ask | show | jobs
by ameliaquining 38 days ago
I read the post you're replying to as saying "this is copyright-encumbered and nonfree because it's a derivative work of everything in Claude's and GPT-5.5's training corpus", which is an argument I find fairly tiresome. (Realistically, if courts actually rule that this is the case, this tiny little project will be the least of anyone's concerns.)

"This is copyright-encumbered and nonfree because it's a derivative work of the legacy RAR binaries" is a different argument (and seems like it depends on details of the setup that were somewhat glossed over in the post).

2 comments

The point is, excepting current legal standards which are already very murky, how can _you_ claim copyright, if you don't _know_ it isn't encumbered?

You can get these LLMs to generate copyrighted outputs both intentionally and accidentally. This is a known fact; therefore, if you're not checking the output to see if this has occurred then you're potentially generating legal risks for yourself and anyone who uses your code.

To not only ignore this for your own use case but to then release the code under a proclaimed license seems legally problematic if not ethically concerning.

If you did get sued for infringement I can't imagine that your defense would be that you find the argument tiresome? Honestly, do you think this would never happen, or how would you go about defending your actions here?

What do you mean by "checking the output"? Is there some kind of check the author says he didn't do that you think he should have? Or is your claim that using an LLM for coding is always copyright infringement? If so, I think the risk that I'll personally be the test case that resolves whatever ambiguities exist in the law is basically zero, and I don't think derailing the thread to be about that topic enlightens anyone.
> What do you mean by "checking the output"?

At the very least you could see if it's already been open sourced under a different license. If you take GPL code and just slap MIT on it do you not consider that a violation?

> Or is your claim that using an LLM for coding is always copyright infringement?

I'm claiming you cannot really know.

> I'll personally be

It may be someone who uses or redistributes your code in any fashion.

> derailing the thread

I've made two posts. One with an idea and the second clarifying it. This is not "derailing the thread" under any sane definition. This is simply a complicated and relatively unexplored topic that clearly draws a lot of interest and resulting conversation from the crowd here.

I think using this type of bullying rhetoric damages that conversation and harms the reputation of Hacker News in general and I always regret it when I see it.

I didn't actually read any code. I generated spec documents using Claude, then later on used Codex to generate from the spec docs. Are the specs tainted? If someone else independently develops from my spec, is that also tainted? What if they hear it second hand? It's an interesting legal situation for sure.
I also am skeptical of the "LLM output is derivative of everything in the training corpus" argument in general, but in this specific case I think it may have more merit. If the model was trained on unrar source code, and obtained specific information about the RAR format from that code which it then used in the code generation step, then the output is arguably tainted because of that.
Does the source-available UnRAR do anything that the existing FOSS implementations can't do? IIUC the interesting part of this particular project is that it supports really old versions of the file format that were never publicly documented anywhere.