| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jacob019 390 days ago

Well that didn't take long, available from 7 providers through openrouter.

https://openrouter.ai/deepseek/deepseek-r1-0528/providers

May 28th update to the original DeepSeek R1 Performance on par with OpenAI o1, but open-sourced and with fully open reasoning tokens. It's 671B parameters in size, with 37B active in an inference pass.

Fully open-source model.

4 comments

jazzyjackson 390 days ago

No sign of what source material it was trained on though right? So open weight rather than reproducible from source.

I remember there's a project "Open R1" that last I checked was working on gathering their own list of training material, looks active but not sure how far along they've gotten:

https://github.com/huggingface/open-r1

pradn 390 days ago

Isn't it basically not possible for the input data set list to be listed? It's an open secret all these labs are using immense amounts of copyrighted material.

There's a few efforts at full open data / open weight / open code models, but none of them have gotten to leading-edge performance.

ratamacue 389 days ago

My brain was largely trained using immense amounts of copyrighted material as well. Some of it I can even regurgitate almost exactly. I could list the names of many of the copyrighted works I have read/watched/listened to. I suppose my brain isn't open source, although I don't think it would currently be illegal to take a snapshot of my brain and publish it if the technology existed and open-source that. Granted, this would only be "reproducible" from source if you define the "source" as "my brain" rather than all of the material I consumed to make that snapshot.

overfeed 389 days ago

> Some of it I can even regurgitate almost exactly

If you (or any human) violate copyright law, legal redress can be sought. The amount of damage you can do is limited because there's only one of you vs the marginal cost of duplicating AI instances.

There are many other differences between humans and AI in terms of capabilities and motivations to f the legal persons making decisions.

ljosifov 389 days ago

You may be right about the damage (will not dispute it even if I personally doubt it) - but what about the amount of good that it can do too? When deciding "what is to be done now" under uncertainty, we typically look at both sides of the ledger, the upsides in addition to the downsides.

Assume for a moment, that the current AI is teaching us that compute transforming data → information → knowledge → intelligence → agency → ... → AGI → ASI, is all there is to Intelligence-on-Tap? And imagine an AI path opens to AGI now and ASI later, where previously we didn't see any. Seems a bad deal to me, to frustrate, slow down, or even forego the 2050-s Intelligence Revolution that may multiply total human wealth by a factor of 10 to 20 in value, the way the Industrial Revolution did in the 1800-s. And we are to forego this, for what - so that we provide UBI to Disney shareholders? Every one of us is richer, better off now, than any king of old. Not too long ago, even the most powerful person in the lands could not prevent their 17 miscarriages/stillbirths/child_deaths failing to produce an heir to ascend the throne (a top priority that was, for sure for a king+queen). So in our imagined utopia, even the Disney shareholders are better off than they would be otherwise.

overfeed 388 days ago

> Seems a bad deal to me, to frustrate, slow down, or even forego the 2050-s Intelligence Revolution that may multiply total human wealth by a factor of 10 to 20 in value...

Why do you assume the emergence of a super intelligence would result in human wealth increasing instead of decreasing? Looking at how humans with superior technology used it to exploit fellow humans throughout history should give you pause. Humans don't care about the aggregate "dog wealth" - let alone that of ants.

CamperBob2 389 days ago

The amount of damage you can do is limited because there's only one of you vs the marginal cost of duplicating AI instances

But enough about whether it should be legal to own a Xerox machine. It's what you do with the machine that matters.

overfeed 389 days ago

> It's what you do with the machine that matters.

The capabilities of a machine matter a lot under law. See current US gun legislation[1], or laws banning export of dual-use technology for examples of laws that have inherent capabilities - not just the use of the thing- as core considerations.

1. It's illegal to possess a new, automatic weapon with some grandfathering prior to 1986

ljosifov 389 days ago

:-) I like the symmetry of this. If I want to keep my creations outside the hands of others, I can keep them private. I don’t have to publish these words or broadcast them to the world. I could write this on my laptop, save it in a file, and keep it to myself. Fine.

However, once these words are broadcast—once they’re read, and the ideas expressed here enter someone else’s mind—I believe it’s only fair that the person on the receiving end has the right to use, replicate, or create something from them. After all, they lent me their brain—ideas that originated in my mind now live in theirs.

This uses up their mental "meat space," their blood sugar, and their oxygen—resources they provide. So, they have rights too: the right to do as they please with those ideas, including creating any and all data derived from them. Denying them that right feels churlish, as if it isn’t the most natural thing in the world.

(Before people jump on me:- Yes, creators need to be compensated—they deserve to make a living from their work. But this doesn’t extend to their grandchildren. Copyright laws should incentivize creation, not provide luxury for the descendants of the original creator a century later.)

MrSkelter 383 days ago

This is a fundamental misunderstanding of copyright.

Copyright isn’t violated when someone consumes a copyrighted work.

Copyright is violated when a copyrighted work is used by someone who isn’t the author to generate profit without prior permission.

You can read a copyrighted book and remember it. You cannot copy it and sell copies. If you want to excerpt it you must give credit and there are limits to what’s considered “fair use”.

3abiton 390 days ago

The only way this would work is with "leaks". But even then as we saw with everything on the internet, it just added another guardrail on content. Now I can't watch youtube videos without logging in, and nearly every website I need to solve some weird ash captchas. It's becoming easier to interact with this chatbots rather than search for a solution online. And I wonder with Veo 4 copy cats, it might be even easier to prompt for a video rather than search for one.

prmoustache 390 days ago

That doesn't mean it isn't possible.

bee_rider 390 days ago

“Not possible” = “a business-destroying level of honesty”?

rcxdude 390 days ago

Even if training on the copyrighted material is OK, just providing a data dump of it almost certainly is not.

alpaca128 390 days ago

No need for a data dump, just list all URLs or whatever else of their training data sources. Afaik that's how the LAION training dataset was published.

anonymoushn 390 days ago

providing a large list of bitrotted URLs and titles of books which the user should OCR themselves before attempting to reproduce the model doesn't seem very useful.

tokioyoyo 390 days ago

There is a "keep doing what you're doing, as we would want one of our companies to be on top of the AI race" signal from the governments. It could've been stopped, maybe, 5 years ago. But now we're way past it, so nobody cares about these sort of arguments.

behnamoh 390 days ago

> No sign of what source material it was trained on though right?

out of curiosity, does anyone do anything "useful" with that knowledge? it's not like people can just randomly train models..

marci 390 days ago

When you're trully open source, you can make ethings like this:

Today we introduce OLMoTrace, a one-of-a-kind feature in the Ai2 Playground that lets you trace the outputs of language models back to their full, multi-trillion-token training data in real time. OLMoTrace is a manifestation of Ai2’s commitment to an open ecosystem – open models, open data, and beyond.

https://allenai.org/blog/olmotrace

kreijstal 390 days ago

you can do these same, except you would need to be a pirate website. It would even be better. except illegal. but it would be better.

marci 390 days ago

That is why the others can't provide stuff like this. RAG/Hallucination check. I just wish Allen.AI models had bigger context, 4k is too small nowadays.

ToValueFunfetti 390 days ago

Would be useful for answering "is this novel or was it in the training data", but that's not typically what the point of open source is

anonymoushn 390 days ago

If labs provided the corpus and source code for training their tokenizers, it would be a lot easier to produce results about tokenizers. As it is, they provide neither, so it is impossible to compare different algorithms running on the same data if you also want to include the vocabs that are commonly used.

m00x 390 days ago

Many are speculating it was trained by o1/o3 for some of the initial reasoning.

fulafel 390 days ago

Are there any widely used models that publish this? If not, then no I guess.

DANmode 390 days ago

Depending on how you use "randomly", they absolutely can..?

chrsw 390 days ago

Based on commit history Open R1 still active and they're still making progress. Long may it continue, it's an ambitious project.

therealpygon 390 days ago

This was simply a mad scramble to prove/disprove the claims OpenAI was peddling that the model wasn’t actually performing as well as advertised and that they were lying about the training/compute resources. Open-R1 has since applied the training to a similar 7B model and got similar results. At the end of the day, no one really cares what the data was that it was trained on and most AI providers don’t always share this either when releasing open source models, and certainly not available for closed source models.

make3 390 days ago

I don't think people make the distinction like that. The open source vs non open source distinction boils down to, usually, can you use it for commercial use.

what you're saying is just that it's non reproducible, which is a completely valid but separate issue

alpaca128 390 days ago

There's already established terms and licenses for non-commercial use. Like "open weights".

Open source has the word "source" in it for a reason, and those models ain't open source and have nothing to do with it.

ben_w 389 days ago

Took me until this thread to remember that in the 90s we had "freeware".

piperswe 390 days ago

But where's the source? I just see a binary blob, what makes it open source?

jacob019 390 days ago

The weights are the source. It isn't as though something was compiled into weights. They're trained directly. But I know what you mean, it would be more open to have the training pipeline and souce dataset available.

timschmidt 390 days ago

The weights seem much more like a binary to me, the training pipeline the compiler, and the training dataset the source.

jumski 390 days ago

Come here to write this - perfect analogy!

otabdeveloper4 390 days ago

You can fine-tune their weights and release your own take.

E.g. see all the specialized third-party models out there based on Qwen.

"Open-source" is the wrong word here, what they mean is "you can modify and redistribute these weights".

yetihehe 390 days ago

You can also reverse engineer and modify closed source programs (see mods for games). Weights are like compiled version of source data.

otabdeveloper4 390 days ago

Finetuning isn't reverse engineering. Finetuning is a standard supported workflow for these models.

Also, the "redistribute" part is key here.

macrolime 390 days ago

Not legally. That's the difference.

microtonal 390 days ago

There is work to try to reproduce (the original) R1: https://huggingface.co/open-r1

1una 390 days ago

I won't call it "binary blob". Safetensors is just a simple format for storing tensors safely: https://huggingface.co/docs/safetensors/index

JKCalhoun 390 days ago

Is there a downloadable model? (Not familiar with openrouter and not seeing the model on ollama.)

zargon 390 days ago

This HN submission goes directly to the downloadable model.

angst 389 days ago

DeepSeek-R1-0528-Qwen3-8B

> ollama run deepseek-r1

from https://ollama.com/library/deepseek-r1

fragmede 390 days ago

It's. not. open. source!

https://www.downloadableisnotopensource.org/

echelon 390 days ago

Open source is a crazy new beast in the AI/ML world.

We have numerous artifacts to reason about:

- The model code

- The training code

- The fine tuning code

- The inference code

- The raw training data

- The processed training data (which might vary across various stages of pre-training and potentially fine-tuning!)

- The resultant weights

- The inference outputs (which also need a license)

- The research papers (hopefully it's described in literature!)

- The patents (or lack thereof)

The term "open source" is wholly inadequate here. We need a 10-star grading system for this.

This is not your mamma's C library.

AFAICT, DeepSeek scores 7/10, which is better than OpenAI's 0/10 (they don't even let you train on the outputs).

This is more than enough to distill new models from.

Everybody is laundering training data, and it's rife with copyrighted data, PII, and pilfered outputs from other commercial AI systems. Because of that, I don't expect we'll see much legally open training data for some time to come. In fact, the first fully open training data of adequate size (not something like LJSpeech) is likely to be 100% synthetic or robotically-captured.

reedciccio 390 days ago

Https://opensource.org/ai ... Lots of reasoning has been done on those artifacts

Tepix 390 days ago

I think you‘re trying to make it look more complex than it is. Put the amount of data next to every entry in that list of yours.

echelon 390 days ago

Most of those items map to a job description.

If you think the data story isn't a complicated beast, then consider:

If you wanted an "open" dataset, would you want it before or after it was processed? There are a lot of cleaning, categorizing, feature extraction steps. The data typically undergoes a lot of analysis, extra annotation, bucketing, and transformation.

If the pre-train was done in stages, and the training process was complicated, how much hand-holding do you need to replicate that process?

Do you need all of the scripts to assist with these processes? All of the infra and MLOps pieces? There's a lot of infrastructure to just move the data around and poke it.

Where are you going to host those terabytes or petabytes of data? Who is going to download it? How often? Do you expect it to be downloaded as frequently as the Linux kernel sources?

Did you scrub it of PII? Are you sure?

And to clarify, we're not even talking about trained models at this point.

xnickb 390 days ago

I'd argue we don't need a 10 star system. The single bit we have now is enough. And the question is also pretty clear: did $company steal other peoples work?

The answer is also known. So the reason one would want an open source model (read reproducible model), would be that of ethics

selfhoster11 390 days ago

We use pop-cultural references to communicate all the time these days. Those don't necessarily come from only the most commonly known sections of these works, so the AI would necessarily need the full work (or a functional transformation of the work) to be able to hit the theoretical maximum of the ability to decode about and reason using such references. To exclude copyrighted works from the training set is to expect it to decode from the outside what amounts to humanity's own in-group jokes.

That's my formal argument. The less formal one is that copyright protection is something that smaller artists deserve more than rich conglomerates, and even then, durations shouldn't be "eternity and a day". A huge chunk of what is being "stolen" should be in the commons anyway.

yencabulator 389 days ago

"Your honor, if I hadn't robbed that bank I wouldn't have gotten all that money!"

echelon 390 days ago

I truthfully cannot think of a single model that satisfies your criteria.

And if we wait for the the internet to be wholly eaten by AI, if we accept perfect as the enemy of good, then we'll have nothing left to cling to.

> And the question is also pretty clear: did $company steal other peoples work?

Who the hell cares? By the time this is settled - and I'd argue you won't get a definitive agreement - the internet will be won by the hyperscalers.

Accept corporate gifts of AI, and keep pushing them forward. Commoditize. Let there be no moat.

There will be infinite synthetic data available to us in the future anyway. And none of this bickering will have even mattered.

cavisne 390 days ago

"knowing why a model refuses to answer something matters"

The companies that create these models cant answer that question! Models get jailbroken all the time to ignore alignment instructions. The robust refusal logic normally sits on top of the model, ie looking at the responses and flagging anything that they don't want to show to users.

The best tool we have for understanding if a model is refusing to answer a problem or actually doesn't know is mechanistic interp, which you only need the weights for.

This whole debate is weird, even with traditional open source code you cant tell the intent of a programmer, what sources they used to write that code etc.

behnamoh 390 days ago

it's got more 'source' than whatever OpenAI provides for their models.

numpad0 390 days ago

less alcoholic beverages are fully alcoholic beverages

subscribed 389 days ago

0.5% or 0.03% satisfy my "nonalcoholic" criteria.

> Studies have found ethanol levels in commercial apple juice ranging from 0.06 to 0.66 grams per liter, with an average around 0.26 grams per liter[1]

Even apple juice is an alcoholic drink if you push your criteria to absurdity.

[1] https://pmc.ncbi.nlm.nih.gov/articles/PMC5421578/

fragmede 390 days ago

but they're not bleach, and no amount of adding or removing alcohol can transmute the alcohol into something else.

stavros 390 days ago

No it doesn't, it has exactly the same source, zero. It has more downloadable binary.

Aeolun 390 days ago

That’s the ‘source’ for what the model spits out though, if not the source for what spits out the model.

prmoustache 390 days ago

It is just freeware, not open source.

stavros 389 days ago

The "source" for something is all the stuff that makes you able to build and change that something. The source for a model is all the stuff that makes you able to train and change the model.

Just because the model produces stuff doesn't mean that's the model's source, just like the binary for a compiler isn't the compiler's source.

quarters 390 days ago

Ok

https://huggingface.co/deepseek-ai/DeepSeek-R1-0528/blob/mai...

stavros 390 days ago

Slapping an MIT license on a compiled binary doesn't make it open source.

quarters 390 days ago

They're keeping some stuff to themselves which is fine. I don't expect anyone to have to fully release everything they've got especially considering the vast costs associated with researching and developing these models.

What they have released has been distilled into many new models that others have been using for commercial benefit and I appreciate the contributions that they have made.

alpaca128 390 days ago

> I don't expect anyone to have to fully release everything they've got

I also don't expect Microsoft to release their full Windows 11 source code, but that also means it's not open source. And that's okay, because Microsoft doesn't call it open source.

aldanor 389 days ago

Open weights.