| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by byefruit 472 days ago
	> Both of our models are trained on top of DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Qwen-32B. Not to take away from their work but this shouldn't be buried at the bottom of the page - there's a gulf between completely new models and fine-tuning.

4 comments

israrkhan 472 days ago

Agreed. Also their name make it seem like it is totally new model.

If they needed to assign their own name to it, at least they could have included the parent (and grant parent) model names in the name.

Just like the name DeepSeek-R1-Distill-Qwen-7B clearly says that it is a distilled Qwen model.

link

qeternity 472 days ago

DeepSeek probably would have done this anyway, but they did release a Llama 8B distillation and the Meta terms of use require any derivative works to have Llama in the name. So it also might have just made sense to do for all of them.

Otoh, there aren't many frontier labs that have actually done finetunes.

link

diggan 472 days ago

> the Meta terms of use require any derivative works to have Llama in the name

Technically it requires the derivatives to begin with "llama". So "DeepSeek-R1-Distill-Llama-8B" isn't OK by the license, while "Llama-3_1-Nemotron-Ultra-253B-v1" would be OK.

> [...] If you use the Llama Materials or any outputs or results of the Llama Materials to create, train, fine tune, or otherwise improve an AI model, which is distributed or made available, you shall also include “Llama” at the beginning of any such AI model name.

I've previously written a summary that includes all parts of the license that I think others are likely to have missed: https://notes.victor.earth/youre-probably-breaking-the-llama...

link

lumost 472 days ago

I suspect that we'll see a lot of variations on this, with the open models catching up to SOTA - and the foundation models being relatively static - there will be many new SOTA's built off of existing foundation models.

How many of the latest databases are postgres forks?

link

adamkochanowicz 472 days ago

Also, am I reading that right? They trained it not only on another model, not only one that is already distilled on another model, but one that is much lower in parameters (7B)?

link

rahimnathwani 472 days ago

They took the best available models for the architecture they chose (in two sizes), and fine tuned those models with additional training data. They don't say where they got that training data, or what combo of SFT and/or RLHF they used. It's likely that the training data was generated by larger models.

link

GodelNumbering 472 days ago

This happens a lot on r/localLlama since a few months. Big headline claims followed by "oh yeah it's a finetune"

link