Hacker News new | ask | show | jobs
by moffkalast 641 days ago
https://huggingface.co/datasets/HuggingFaceFW/fineweb

The #1 problem is absolutely compute. People barely get funding for fine tunes, and even if you physically buy the GPUs it'll cost you in power consumption.

That said, good data is definitely the #2 problem. But nowadays you can just get good synthetic datasets from calling closed model APIs or just using existing local LLMs to sift through trash. That'll cost you too.