Hacker News new | ask | show | jobs
by beernet 1169 days ago
Are there any well-known companies that focus on data set collection and provision? Not talking HuggingFace like but more of highly curated data of best quality and guaranteed ownership rights (hard part about it!), where money for its usage is charged. I feel like there is a rapidly growing market for that with the advent of large foundation models and regulators stepping in, also and in particular bringing data privacy up.
1 comments

I think Import.io, Bright Data, and Zyte would fall into that category.
These are web scraping services. Data ownership is a grey zone at best, depending on which country you're in. Besides, copyrighted data might be scraped by accident. What I am proposing is much stronger than that, rather like an "audited" dataset that comes with guarantees because its curation can be fully backtraced.
I'm guessing you can pay one of these companies to meet these types of requirements.

I know that highly-regulated financial institutions that purchase web-scraped data have very strict rules about the data they buy.

The US has an organization dedicated to this: https://www.investmentdata.org