| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by yamihere 438 days ago

Nice! Thanks for responding.

>> Outputs undergo rigorous validation steps, including cross-checking with advanced auditing models such as OpenAI’s o1-pro, which has proven especially proficient at performing high-quality random audits.

>> there was a lot of manual effort involved in coming to that conclusion with my own effort to review the LLM's reasoning

So, the randomly audited entries seemed reasonable to you – not even the data itself, just the reasoning about the generated data. Did the manual reviews stop once things started looking good enough? Are the audits ongoing, to fill out the rest of the dataset? Would those be manually double-checked as well?

>> I became interested in exploring how recent advances in generative AI could enable entirely new kinds of consumer products—ones whose core innovations leveraged AI but didn’t explicitly market themselves as “AI products.”

Once again: Why not market this as an AI product? This is LLMs all the way down.

People are already interested in using this dataset. I was. Now, LLM generated “usually close enough to not be actively harmful” data is being distributed as a source for any and all to use. I think your disclaimer is excellent. Does your license require an equivalent disclaimer be provided by those using this data?

1 comments

joshdickson 438 days ago

> not even the data itself, just the reasoning about the generated data

Poor phrasing on my end -- yes, absolutely the end data as well as the reasoning, as the reasoning tends to include the final answer.

Maybe I should! Appreciate the feedback.

link

yamihere 438 days ago

Thanks again. Mine was an uncharitable interpretation, apologies for that. I appreciate your engagement with critical comments without coming off as defensive or snarky.

This looks like a lot of work and good will were poured into it, and I can see how it can be useful to a fitness focused audience.

You control the messaging on the site and in your apps, and you make it clear that this is not authoritative data. Everything built on top of this needs to have the same messaging, but it has probably been ingested into multiple LLMs already.

I think some sort of licensing requirement that the LLM source of this data be prominently disclosed will not keep this from becoming a source of truth for other datasets, products, and services; but, it is still worth the effort. All you can do is all you can do, right?

link

joshdickson 438 days ago

The idea of including that requirement in the license is a good idea and I had not considered it, but I will -- frankly my motivations have been more on the citation side of things such that the need for quality disclaimers is not as great. Thank you for the suggestion.

link