Hacker News new | ask | show | jobs
by koolala 664 days ago
Can you imagine how incredible an open source model would be for research / humanity beyond the buisness needs right in front of us?

Open-Knowledge source with an Open-Inteligence that can guide you through the entire massive digital library of its own brain. Semantic data light-years beyond a Search Engine.

5 comments

If you have the model weights you have roughly the same opportunities as the company that trained the model. The code you need to run inference on the Llama weights is very much open source. The only thing you're missing out on is the training code, which is prohibitively expensive to run for most anyways. Open source training isn't going to give you any unique insights into the "digital brain library" of your model.

Also just to be clear, if you want to set up a RAG with an open weight model and a large dataset there's nothing stopping you. Download Red Pajama and Llama and give it a try.

https://github.com/togethercomputer/RedPajama-Data

> Open-Knowledge source with an Open-Inteligence that can guide you through the entire massive digital library of its own brain. Semantic data light-years beyond a Search Engine

This sounds like the usual AI marketing with the word "open" thrown in. It's not articulating something youc an only do with an open source LLM (and doesn't define what that means).

I'm personally not thrilled with how locked down LLMs are. But we'll need to do a better job at articulating (a) a definition and (b) the benefits of adhering to it versus the "you can run it on your own metal" definition Facebook is promulgating. Because a model meeting Facebook's definition has obvious benefits over proprietary models run on someone else's servers.

You can't imagine it :( Open data :(

I believe our world fights to destroy ideas like this because our economy drives our entire life.

> Can you imagine

>> No

>>> You can't imagine it

You haven’t articulated the idea you claim the “world fights to destroy”. (Just throwing around the word open without elaboration isn’t an idea.)

I’m not sure what they’re talking about, but I’ll throw my hat into the ring. Copyright and other such systems are destroying any chance that we, as humanity, have of letting LLMs progress in an open and transparent manner. We have to hide the training data and make the weights a black box because of such antiquated notions such as copyright. While I am willing to permit some level of exclusivity with creative works, 100+ years is unreasonable and stagnates human creativity even outside of ML tasks. In the 19th century, I could take a book I was raised on and write my own fanfiction, and because that book would have been public domain by the time I was an adult I could add onto the work and the other fans of the previous work can build upon it with me. We see this with Sherlock Holmes for instance. If I wanted to publish a book set in the world of Harry Potter I’d need to wait for JK Rowling to croak, and then wait another 70 YEARS.

We need dramatic reforms on copyright, as we’ve really let corporate interests crowd out our rights to human culture and ideas. While I alone cannot decide what we as a country should find reasonable, I can say I find 20 years + 5 years extension is perfectly reasonable and that corporations should have never been able to pay off politicians to get what they wanted. Let alone Sonny Bono, that bastard, signing in bills that specifically benefited him.

So, to reiterate, the idea I feel that corporations want to destroy is the idea that we, as a people, have rights to the works that form our popular culture and that no one man, let alone a faceless corporation, should be able to profit from a singular work for hundreds of years.

Data that is accessible. Knowledge. Truth. With an AI trained on it that can expose it in any expert / layman terms into any human language.
You’re undermining the case for an open source LLM by stating things fully-proprietary models do.
They don't make the source data accessible :(
You're not really asking for an open source model though, you're asking for open source training data set(s), which isn't something that Meta can give you. There are open source web scrapes such as The Pile, but much of the more specialized data needs to be licensed.
I'm asking for an "Open Source AI" and Meta and everyone supporting them is convinced its impossible in our lifetimes :( We are living in the Dark Ages where Information = $$$. I pray to AI we one day grow out of this pointless destructive economic spiral towards the heat death of the Earth and collect and share open knowledge across all human cultures and history.
Well, as long as by "AI" you are referring to pre-trained transformers, then what you are effectively asking for is the data used to pre-train them.

OTOH why you want the data is not clear. You don't need it to run Meta'a models for free, or to fine-tune them for your own needs. The only thing the data would allow you is to pre-train from scratch, in other words to obtain the exact same set of weights that Meta is giving you for free.

All of that data is already available, just look into “shadow libraries”. Now, I do wish Meta and other companies would publish their data sets and we, as humanity, could improve upon them and empower even better LLMs, but the unfortunate reality is copyright is holding us back. Most of what you say is essentially gibberish, but there is truth that LLMs would be better if it could not only utilize its weights, but reference and search its training data (that is collectively owned by humanity, by the way) and answer with that and not just what it “thinks”.
No, I really can't imagine it. Extrapolating from our free commercially-licensed offerings it would seem most people would ignore it or share stories on Reddit about how FreeGPT poisoned their family when generating a potato salad recipe.
An open source model would be able to give you the sources of its potato salad recipe inspiration. It would be the best of both worlds. AI Knowledge + Real Open Human Knowledge.
Just because you have the dataset doesn't mean you can generate a reference. Let's say I hand you a potato salad recipe and a copy of the entire internet. Say you somehow extract all potato salad recipes from the dataset (non trivial btw) and none of them are an exact match for the recipe the model generated. Now what?
> open source model would be able to give you the source of its potaeo salad recipe

Kagi’s LLM can already do that. I believe so can Perplexity’s. Citing sources isn’t something only open models can do.

I'm pretty sure Kagi is like a normal search engine with AI integration like Google. Not an AI designed to be open source with an open dataset of knowledge it was trained on.
> pretty sure Kagi is like a normal search engine with AI integration like Google

Sure. The point is the thing you said only an open-source model can do, it can do. Plenty of proprietary LLMs can cite sources.

The plain truth is most of the benefits of open models are not on the consumer side. (Or at least, I haven't seen any articulated.) They're on the producers'. Open models are better for those of us training models. That's partly why the open data debate is academic--very few people are training large foundation models because the compute and electricity costs are prohibitive.

I'm kinda hoping World Governments will use their Public Library infrastructure to train AI. Japan is my #1 hope with how they are opening public science knowledge. Super-computers have been prohibitive for a long time but national science institutions could be a great place for open source & open weight AI.