Hacker News new | ask | show | jobs
by madaxe_again 631 days ago
Man, I can’t tell you how much labour modern LLMs would have saved me at my business, 10-15 years ago.

An awful lot of what we ended up dealing with was awful data - the worst example I can think of was a big old heap of textual recipes that the client wanted normalised, so they could be scaled up/down, have nutritional information, etc. - about 180,000 of them, all UGC.

This required mountains of regexes for pre-processing, and then toolchains for a small army of interns to work through every. single. one. and normalise it - we did what we could, trying to pull out quantities and measures and ingredients and steps, but it was all such slop it took thousands of man-hours, and then many more to fix the messes the interns made.

With an LLM, it could have been done… more or less instantly.

And this is just one example of so, so many times that we found ourselves having to turn a heap of utter garbage into usable data, where an LLM would have been able to just do it.

Anyway. I at least managed to assuage my past torment by seeing the writing on the wall and stocking up on NVDA at about the time I was wrestling with this stuff.

2 comments

This gets to an essential point about LLMs - they are the ultimate intern. Anything you wouldn't ask an intern to do, you probably don't want to ask the LLM to do either. And you certainly want to at least spot check the results. But for army-of-intern problems like this one, they are revolutionary
with the exceptions that an intern is (hopefully) going to learn from their mistakes and improve
If you have a reviewed output dataset from an LLM, you could use it for RLHF.
The metadata from the music industry is crazy unstable, "Africa" from Toto is known to have an absurd of number of unique listings each with different metadata.

Music streaming providers need to sort that shit out and make sure you don't show the user duplicates. The music labels don't give a damn about normalizing the metadata.

LLMs can help classify this stuff a lot easier with minimal human review.

If the streaming platforms cared strongly about this problem they could have addressed it already, so I'm not confident they'll use LLMs effectively to do it without making the problem (or at least edge cases) even worse somehow. I think it would take a different business goal driving their algorithms to, for example, stop playing MF DOOM for 8 songs in a row under different aliases.