Hacker News new | ask | show | jobs
by koolba 3597 days ago
> But sometimes it's also very easy to please people. Big data: just insert 10M records in a database and suddenly everyone is happy because they now have big data :|

Since when is 10M records is considered big data?

My goto gauge for big data is that it can't fit in memory on a single machine. And since that means multiple TB[1] these days, most people don't really have big data.

[1]: *Heck you can even rent ~2TB for $14/hour! https://aws.amazon.com/ec2/instance-types/x1/

7 comments

I get your point, but 10M records is big data depending what you're doing with it. Not big on disk, but extremely unwieldy depending how it's structured and how you need to query/manipulate it. I let internal product engineering at a large multinational for a long time, and we accrued so much technical debt as a result of having to handle the stupidest of edge cases, where queries against just a few million (or even thousands) of records took multiple seconds -- in the worst cases, we had to schedule job execution because they took minutes -- because of ludicrous joins spanning hundreds of tables, and imposition of convoluted business logic.

Most all of that is overall poor architecture, and most companies don't hire particularly good developers or DBAs (and most web developers aren't actually very good at manipulating data, relational or not), but it's the state of the union. That's "enterprise IT". That's why consultancies makes billions fighting fires and fixing things that shouldn't be problems in the first place.

I think that is why he had the :| face at the end.
Oh haha. I thought that was a typo!
> big data is that it can't fit in memory on a single machine

A Lucene index can be much larger than your current RAM. It can be 100x that. The data will still queryable. Lucene reads into memory the data it needs in order to produce a sane result. Lucene is pretty close to being industry standard for information retrieval.

My definition is instead "when your data is not queryable using standard measures".

I literally heard that "Big Data is something that is too large to fit into an Excel spreadsheet". The speaker was serious.

I unsubscribed from that (non-tech) podcast.

I would have said when a single file is bigger than the maximum size of disk so say 4TB - its whey we used Map Reduce back in the 80's at British telecom for billing systems - the combined logs would have been to big to fit on a single disk
I'd say if it can't fit in RAM, but still can fit on a single SSD it doesn't count as big data either
Not sure how accurate that is, since you can buy 60TB SSDs these days.
ergo, not big data.
Yeah that's everybody's gauge if they actually work with it, which was the point.