Hacker News new | ask | show | jobs
Google brings Stack Overflow's knowledge base to Gemini for Google cloud (techcrunch.com)
53 points by onatm 843 days ago
9 comments

One comment to add here - regardless of where you stand on this particular LLM provider:

Do we want knowledge communities like Stack Overflow or Reddit to continue to exist? Should big AI providers that train on their data share some of the value back to the community? Is there an ethical way for web communities to license data to AI providers?

I hope the answer is yes and that there is a path to a productive partnership, one that allows public communities where knowledge is shared freely to thrive, while also bringing more grounded and vetted content to AI systems that are often closed and require a subscription to access.

> Should big AI providers that train on their data share some of the value back to the community?

How would they do that? So far, the LLMs can't be trusted to produce accurate answers. The AI companies can pay money to the data sources, but they can't really offer back anything useful (yet, imho).

For free integrate back into stack overflow. there are tons of questions that never get answered. This also provides a public forum for that response to be corrected and provided feedback. symbiosis.
By definition, aren't those difficult questions to answer? Is there any reason to think the LLMs would succeed where humans have failed? I mean, I'm sure they would produce some output...but is a misleadingly-incorrect answer better than no answer to a thorny obscure question?
Well the least they could do imo would be to post or comment that a question is a duplicate of another or link to the top voted answer. Similar to how users on HN post links to "on going discussion threads" for duped posts. Its grunt work that these bots should at least be able to regurgitate or find easily.

Also theres a chance these LLMs have access to other tech forums in addition to stack overflow and could possibly provide a solution. For example GitHub has actually been the better source for me when debugging issues. Usually you can go to the repo and search the issues and read comments with solutions or workarounds.

But aside from that i am in agreeance with you that these bots will struggle to provide new, non regurgitated answers and could potentially cause more harm than good

Reddit literally licenses its data to AI training [1]. If doing so kills its own product that would be hilarious.

[1]: https://arstechnica.com/ai/2024/02/reddit-has-already-booked...

Reddit and stackoverflow have heavily degraded over the years long before ChatGPT existed, we should remember that.

They offered a convenience by burning money and the mismanagement and pre IPO shenanigans certainly are not helping.

They don"t own the content and the communities or the user base that can move from irc, aol, to discord. What did they learn from dead communities of the past? Do they sell traction and convenience that they own or are they claiming they sell content which they don"t own? Users curated the content, and most effective mods have left. The content in large parts of their sites has become stale or degraded long before OpenAI existed. Graveyard communities cannot curate nor pay for server costs.

AI is convenient curation and people are paying for the convenience. AI sites are also losing money per click with server costs that are out of the galaxy compared to the cached html and elastcsearch serving crowd.

We have seen the ridiculousness of the AI sites' attempts to introduce management features, short of putting penguins in the desert for animal diversity. But the great teachers of bad management features were reddit and stackoverflow who also actively killed community developed management modules.

They are failing because they lack basic understanding of the teachings of centuries of civil society and they make up what is right, wrong or politically correct ad hoc based on marketing. Just trying to avoid bad publicity that could scare off potential IPO crowd only introduces community debt and grievance. That is what has been killing them.

Wikipedia has not been crying foul but has been curating the most quality content for AI but on a low cost setup for its size. I just think its better to donate there content and money,

>> Do we want knowledge communities like Stack Overflow or Reddit to continue to exist?

No, we want better knowledge communities to exist and for Stack Overflow and Reddit to cease to exist.

My experience of stack overflow is that the question in the title is too often not answered directly. The specific issue is tangentially related to the title and the answer can amount to a typo or a bad assumption.

There are often clues in the comments that are more helpful than the “answer” and often outdated answers have the most votes.

All this to ask, how on earth is something like an LLM expected to reconcile those issues.

I'm curious about the longevity of these sort of collaborations - in the future who is contributing to these knowledge bases? I imagine the communities that surround these places will fade away if new users are unaware of them and content creation begins to halt.

Though I suppose that is a short sighted concern in itself given the way in which we work will begin to evolve quickly as AI becomes more powerful.

In the end, I guess they end up being a positive press story for Google / Open AI / etc.?

It really feels analogous to overfishing or deforestation, where it's very profitable at the moment for whoever is doing the harvesting but then ruins the resource for anyone in the future.
Yeah does the AI then go and upvote or downvote sources to its responses based on the end user feedback?
I mean we already have that with social media posts getting phantom likes and bot comments to give the appearance of "engagement". You know the same playbook will unfold when the CEOs want the numbers or perceived value to go up.
Stackoverflow's knowledge?

Isn't it the knowledge of its users?

Don’t anthropomorphize computers—they hate it.
Do we think it's going to meaningfully enrich the data?
This question is likely to be answered with opinions rather than facts and citations. This question has already been asked, and answered. As currently written, the question lacks enough detail or clarity to be answered. Your question is too broad or has multiple parts and needs to be distilled into one. I'm voting to close this question because it has no effort to solve the problem.

Expect a lot of enriched answers like these coming soon from Gemini/bard :-p

How come they get to talk to the customer that way and not me lol

these are literally questions I've given to project managers to help create better requirements but ultimately as a dev you have to come up with "something" regardless and redo the work once the customer complains. Stupid GPTs cutting the line!

The only SO data not yet incorporated into these models is that which was recently created since the model has been trained. It appears more like a "licensing" deal to give something back for scraping all "their" data like everyone else
I think it's probably also about continuing to have access to it for updating information.
So much for deep and thoughtful adoption of AI. We're just gonna go full-speed ahead and damn the consequences.
If it doesn't work, then Stack Overflow got some free money. If it does work, new knowledge production / acquisition will suffer. Or will Gemeni pay for answers to questions it does not know. Then some script kiddie using OpenAI will answer then, and then our internet hive mind will take the shape of a Habsburg Jaw.
I don't understand these instances of using ML to return search results. DDG, returns results; ML returns results, possibly with hallucination. Even without the hallucination what's the point? I find the results I need from a search engine. Solution looking for a problem?
I don't know what Google is going to do with it, but Bing is absolutely ruining their search engine with AI. You search for something, AI starts typing out an answer in a typewriter effect. Since AI, I can't trust anymore that the snippets shown at the top are verbatim quotes from a human-written article or something the AI came up with (not considering they could be quoting an AI-generated article!). At the bottom of the search page, where the "next page" would normally be the rightmost button, it's now the "chat" GPT button. I misclick it every time I want to see the next page. I bet that is driving up some metrics and making some engineers really confused about why people keep clicking the chat button and then not chatting.

Overall this all feels so unimaginative. With all the resources these companies have the only solution they can come up with for the search problem is "just throw AI at it." I could come up with that. It's not clever.

Curious why you ever use bing to begin with?
They sponsor my browser of choice, Vivaldi.

Unlike Google, I can click the second tab every time and it goes to image search. Wait, actually they put "copilot" there and image search is the third tab now. Either way point stands: no shuffling of tabs.

Image search is actually better than Google. I can search for exact image sizes. Google used to offer this! I can just type my screen width and height and find the perfect wallpaper. Wait, it says "at least" here, not "exactly," so I guess it just stores the total amount of pixels of an image and then multiplies the width and height you inputted...

Can you believe this? It's 2024 and I can't even find an image by size on the Internet. I can't even trust the second tab is going to be the images tab. And some people think AI is going to fix software. It's ridiculous. It's just laughable. And so depressing.

GPT4 is much faster for searching through results than I am typically and can pull out exactly what I need.

I've basically completely replaced Google in my day-to-day unless i need to look up a specific location of something in the physical world or something that recently happened.

> I've basically completely replaced Google in my day-to-day (for GPT4)

That's ...not good.

GPTx gets alot of surface topics right but when you delve into gritty specific details it will just start rambling like a straight jacket lunatic with the confidence of a used car salesman. The rubber meets the road when i try to compile code that uses libraries or functions that don't exist or it leads me to hallucinated imaginary github repos. I worry that this use of GPTx would be like getting water from lead pipes: it would seem fine on the day-to-day while my mind is slowly poisoned with nonsense and insanity.

Google has certainly taken a nosedive in result quality for sure the last few years but Kagi has been amazing for me lately.

Your usage of 'GPTx' makes me think you are conflating chatgpt and GPT4. I find chatgpt useless, so hopefully that's not what you're talking about.

None of the things you are describing happen to me, especially if you do basic trust+verify which you should be doing for Google anyways.

Could you explain the difference and how you use GPT4? If all you're doing is hitting the GPT4 api, I don't see how it's different.

And, of course, you wouldn't know that your mind is being poisoned with hallucinate half-truths. Maybe you can pick some out because of prior knowledge, but what about the ones you can't? What about the little things you learn that you don't deem important enough to verify, but then remember later without remembering that they snuck in through an untrusted source? That's precisely the danger - you can't accurately tell truth from fiction, and the stuff you already know isn't the stuff you're asking about (otherwise you wouldn't be asking)

> Could you explain the difference and how you use GPT4? If all you're doing is hitting the GPT4 api, I don't see how it's different.

The difference is that the models are completely different? I don't really find that GPT-4 hallucinates all that frequently (only in very nitty gritty details rarely).

> And, of course, you wouldn't know that your mind is being poisoned with hallucinate half-truths

Okay, so it appears you have some non-falsifiable theory of mind that somehow renders Google better because my mind is being poisoned. Not sure what sort of appeal to objectivity I could use to demonstrate otherwise.

> the stuff you already know isn't the stuff you're asking about (otherwise you wouldn't be asking)

True for Google - less true for GPT4, who I ask to give me practice problems and worked solutions of various things I already know about to practice.

... kinda already tainted by Google Gemini importing the likes of Reddit, DemocratUnderground or Daily Kos, no?