| The bigness of your data has always depended on the what you are doing with it. Consider the following table of medical surgeries: date,physician_name, surgery_name,success. "What are the top 10 most common surgeries?" - easy in bash "Who are the top physicians (% success) in the last year for those surgeries?" - still easy in bash "Which surgeries are most affected by physician experience?" - very hard in bash, requires calculating for every surgery how many times that physician had performed that surgery on that day, then compare low and high experience outcomes. A researcher might see a smooth continuum of increasingly complex questions, but there are huge jumps in computational complexity. At 50gb dataset might be 'bigger' than a 2tb one if you are asking tough questions. It's easier for a business to say "we use Spark for data processing", than "we build bespoke processing engines on a case by case basis". |