| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bhou 2526 days ago
	Ananas has been tested on production processing terabyte data on a daily basis (with Google Dataflow, but you can achieve the same thing with your own spark cluster too). In term of exploring large source file, the design principle is to paginate any kind of data that support random access records (for example CSV, logs, etc). So when "exploring the data" of a CSV with 6M rows, Ananas will not load 6M rows at once, but read a few rows at a time for each page. For example, in this early demo video, exploring a 755M CSV file in seconds. https://www.youtube.com/watch?v=GwqZlhmei78&t=01m00s