| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mbb70 192 days ago

The bigness of your data has always depended on the what you are doing with it.

Consider the following table of medical surgeries: date,physician_name, surgery_name,success.

"What are the top 10 most common surgeries?" - easy in bash

"Who are the top physicians (% success) in the last year for those surgeries?" - still easy in bash

"Which surgeries are most affected by physician experience?" - very hard in bash, requires calculating for every surgery how many times that physician had performed that surgery on that day, then compare low and high experience outcomes.

A researcher might see a smooth continuum of increasingly complex questions, but there are huge jumps in computational complexity. At 50gb dataset might be 'bigger' than a 2tb one if you are asking tough questions.

It's easier for a business to say "we use Spark for data processing", than "we build bespoke processing engines on a case by case basis".

2 comments

christophilus 191 days ago

50GB and 2TB are both sizes that SQLite supports and could handle. You could probably solve all of the problems you mentioned with simple tools on a single server, in the language of your choice.

link

esafak 191 days ago

Sounds like a good fit for DuckDB.

link