Hacker News new | ask | show | jobs
by thor-rodrigues 664 days ago
I think that focusing primarily on the discussion of what is or isn't open source software makes us miss an interesting point here, that Llama enables users to have a similar performance to frontier models in your own systems, without having to send data to third-party sources.

My company is building an application for an university client, regarding the examination of research data written in "human language" (mostly notes and docs).

Due the high confidentiality of the subjects, as often they deal with non-patented information, we couldn't risk using frontier models, as it could break the novelty of the invention, therefore losing patentability.

Now with Llama3.1, we can simply run these models locally, on systems that is not even connected to the internet. LLMs are mostly good in examining massive amount of research papers and information, at least for the application we are aiming at, saving thousands of hours of tiresome (and very boring) human labour.

I am trying to endorse Meta or Zuckerberg or anything like that, but at least in this aspect, I think Llama being "open-source" is a very good aspect.

3 comments

To me it's fairly interesting how relatively little money it takes meta to pose a risk to other models makers businesses, who are dependent on having to run the model after they created it (because that is how they make money) while meta does not even have to deal with the cost attached to providing inference infra, at all, to pose that risk.
That's a funny definition of "little money"
They did say "relatively little money" which is arguably true.
How do? It’s little money neither relative to metas income nor relative to anyone else?
Can you imagine how incredible an open source model would be for research / humanity beyond the buisness needs right in front of us?

Open-Knowledge source with an Open-Inteligence that can guide you through the entire massive digital library of its own brain. Semantic data light-years beyond a Search Engine.

If you have the model weights you have roughly the same opportunities as the company that trained the model. The code you need to run inference on the Llama weights is very much open source. The only thing you're missing out on is the training code, which is prohibitively expensive to run for most anyways. Open source training isn't going to give you any unique insights into the "digital brain library" of your model.

Also just to be clear, if you want to set up a RAG with an open weight model and a large dataset there's nothing stopping you. Download Red Pajama and Llama and give it a try.

https://github.com/togethercomputer/RedPajama-Data

> Open-Knowledge source with an Open-Inteligence that can guide you through the entire massive digital library of its own brain. Semantic data light-years beyond a Search Engine

This sounds like the usual AI marketing with the word "open" thrown in. It's not articulating something youc an only do with an open source LLM (and doesn't define what that means).

I'm personally not thrilled with how locked down LLMs are. But we'll need to do a better job at articulating (a) a definition and (b) the benefits of adhering to it versus the "you can run it on your own metal" definition Facebook is promulgating. Because a model meeting Facebook's definition has obvious benefits over proprietary models run on someone else's servers.

You can't imagine it :( Open data :(

I believe our world fights to destroy ideas like this because our economy drives our entire life.

> Can you imagine

>> No

>>> You can't imagine it

You haven’t articulated the idea you claim the “world fights to destroy”. (Just throwing around the word open without elaboration isn’t an idea.)

I’m not sure what they’re talking about, but I’ll throw my hat into the ring. Copyright and other such systems are destroying any chance that we, as humanity, have of letting LLMs progress in an open and transparent manner. We have to hide the training data and make the weights a black box because of such antiquated notions such as copyright. While I am willing to permit some level of exclusivity with creative works, 100+ years is unreasonable and stagnates human creativity even outside of ML tasks. In the 19th century, I could take a book I was raised on and write my own fanfiction, and because that book would have been public domain by the time I was an adult I could add onto the work and the other fans of the previous work can build upon it with me. We see this with Sherlock Holmes for instance. If I wanted to publish a book set in the world of Harry Potter I’d need to wait for JK Rowling to croak, and then wait another 70 YEARS.

We need dramatic reforms on copyright, as we’ve really let corporate interests crowd out our rights to human culture and ideas. While I alone cannot decide what we as a country should find reasonable, I can say I find 20 years + 5 years extension is perfectly reasonable and that corporations should have never been able to pay off politicians to get what they wanted. Let alone Sonny Bono, that bastard, signing in bills that specifically benefited him.

So, to reiterate, the idea I feel that corporations want to destroy is the idea that we, as a people, have rights to the works that form our popular culture and that no one man, let alone a faceless corporation, should be able to profit from a singular work for hundreds of years.

Data that is accessible. Knowledge. Truth. With an AI trained on it that can expose it in any expert / layman terms into any human language.
You’re undermining the case for an open source LLM by stating things fully-proprietary models do.
You're not really asking for an open source model though, you're asking for open source training data set(s), which isn't something that Meta can give you. There are open source web scrapes such as The Pile, but much of the more specialized data needs to be licensed.
I'm asking for an "Open Source AI" and Meta and everyone supporting them is convinced its impossible in our lifetimes :( We are living in the Dark Ages where Information = $$$. I pray to AI we one day grow out of this pointless destructive economic spiral towards the heat death of the Earth and collect and share open knowledge across all human cultures and history.
Well, as long as by "AI" you are referring to pre-trained transformers, then what you are effectively asking for is the data used to pre-train them.

OTOH why you want the data is not clear. You don't need it to run Meta'a models for free, or to fine-tune them for your own needs. The only thing the data would allow you is to pre-train from scratch, in other words to obtain the exact same set of weights that Meta is giving you for free.

All of that data is already available, just look into “shadow libraries”. Now, I do wish Meta and other companies would publish their data sets and we, as humanity, could improve upon them and empower even better LLMs, but the unfortunate reality is copyright is holding us back. Most of what you say is essentially gibberish, but there is truth that LLMs would be better if it could not only utilize its weights, but reference and search its training data (that is collectively owned by humanity, by the way) and answer with that and not just what it “thinks”.
No, I really can't imagine it. Extrapolating from our free commercially-licensed offerings it would seem most people would ignore it or share stories on Reddit about how FreeGPT poisoned their family when generating a potato salad recipe.
An open source model would be able to give you the sources of its potato salad recipe inspiration. It would be the best of both worlds. AI Knowledge + Real Open Human Knowledge.
Just because you have the dataset doesn't mean you can generate a reference. Let's say I hand you a potato salad recipe and a copy of the entire internet. Say you somehow extract all potato salad recipes from the dataset (non trivial btw) and none of them are an exact match for the recipe the model generated. Now what?
> open source model would be able to give you the source of its potaeo salad recipe

Kagi’s LLM can already do that. I believe so can Perplexity’s. Citing sources isn’t something only open models can do.

I'm pretty sure Kagi is like a normal search engine with AI integration like Google. Not an AI designed to be open source with an open dataset of knowledge it was trained on.
> pretty sure Kagi is like a normal search engine with AI integration like Google

Sure. The point is the thing you said only an open-source model can do, it can do. Plenty of proprietary LLMs can cite sources.

The plain truth is most of the benefits of open models are not on the consumer side. (Or at least, I haven't seen any articulated.) They're on the producers'. Open models are better for those of us training models. That's partly why the open data debate is academic--very few people are training large foundation models because the compute and electricity costs are prohibitive.

Can you expand on the risk of breaking novelty?

Is the concern that prompts could be re-used for training by the provider and such knowledge become part of the model?