Hacker News new | ask | show | jobs
by collyw 3597 days ago
As long as the database isn't relational, I guess.
2 comments

10 million rows in a relational database doesn't need to be bad nor is it big data.

Rows is a bad measure of "big" when it comes to data. A measurement of bytes and probably more specifically bytes per field and how many fields the records have, as this gives a better indication into the way this will be written and potentially searched.

10 million rows of 5 integer values is pittance for any relational database worth using in production. 10 million rows of 250 text columns would be horrendous for a relational database.

Someone once suggested to me that 'big data' begins when it doesn't fit in RAM in a single rack any more.
Yup, that's essentially what looking that byte-size means. However, just because it doesn't fit in memory might not make it big data if it's just poorly engineered.

But many times this happens because of wasted or bloated indexes that aren't useful. Or it happens when data types are picked incorrectly.

For example, I once worked on a database where the original developer used Decimal(23, 0) as a primary key. This was on MySql and that ended up taking up 11 bytes per row, versus a Long which would have just been 8. In one table, maybe not so bad but when you start putting those primary keys into foreign key relationships... we ended up with a 1 billion row table in MySql that had 4 of these columns in it. That might make it "big data" by that definition but it's also just bad design.

Another example in that same database was using text fields in mysql for storing JSON. Since text fields in mysql are stored as separate files, this meant that every table that had one (and we had several tables that housed multiple) ran in large IO and disk access issues.

"big" data is probably a bad term to use these days because of easy it is to accidentally create a large volume of data but not need a big data solution outside of the fact that it's not the business that needs it, it's the poorly implemented system that does.

But the real reason we talk about fitting in memory comes from the core of the issue: IO. Even a super large memory set could end up being slow if it's postgres and single threaded reader that's scanning a 500 GB index. AWS offers up to 60 GB/s memory bandwidth and we'd need it for this index, since that would still take almost 10 seconds to warm up the indexes in the first place.

>Since text fields in mysql are stored as separate files

Bwuh? Over in MS SQL you just go for an NVARCHAR and forget about it. What is the right way to store this data (if you really do need to store the JSON rather than just serializing it again when you get it out of the DB)

varchar is different than a text field in mysql: http://dev.mysql.com/doc/refman/5.7/en/blob.html

It stores text fields as blobs.

I suppose now the right way would be the json data type. It didn't exist when I was working with these servers though (or they were on a much older version of MySql) https://dev.mysql.com/doc/refman/5.7/en/json.html

That's soon going to be on the order of 100 terabytes, so there will be only a handful of companies doing big data ;-)
I'm only aware of servers up to 12TB. Care to elaborate?
He/she said a whole rack of servers. I actually took 30 servers of 2TB each and rounded up to 100. With 12TB per server it will already be over that.
10M rows in a relational database is a very low number (depending on the size of the row of course).
I know, it was a sarcastic follow up to the "now they have big data" part of the original comment.

"SQL doesn't scale". It needs to be in Mongo or whatever NoSQl database is in right now. I have heard all sorts of nonsense regarding "big data" in the last few years.

ahaha, i didn't read the sarcasm that time, sorry for replying with tmi