| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sqrt17 1653 days ago
	It takes substantial effort to build a good dataset, proportionally more if it gets bigger, and people like big datasets because you can train more powerful models from them. So I am not surprised that people tend to gravitate towards datasets made by well-funded institutions. The alternative is either a small dataset that people heavily overfit (eg the MUC6 corpus that was heavily used for coreference at some point where people cared more about getting high numbers than useful results) or things like the Universal Dependencies corpus which are provided by a large consortium of smaller institutions