| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by _a_a_a_ 1307 days ago

aaand here we go again.

DB guy with 25+ years experience. Summary: it depends.

> joins are never cheap

it depends. On table size, indexes/table size vs how expensive the alternative is. Always!

> tables with billions of rows crossed with millions of rows just to find a single row with data is not something i would call cheap

indexes

> more often than not it is better to avoid joining large tables if you can live with duplicate data

1E9 x 1E6 = 1E15 (at worst anyway). A join via an index will save you colossal amounts of IO (though as ever, it depends).

Problem here isn't this mostly clueless advice (discarding/archiving unnecessary data is the only good idea here, and it's not used as often as it should be). Problem is strong opinions put forth by someone who doesn't have the necessary experience, or understanding of what's going on under the hood. Denormalising is a useful tool that IME rarely gains you more than it loses you, but this 'advice' is just going to lead people down the wrong alley, and I'm tired of suchlike n00b advice strongly (and incorrectly and arrogantly) expressed on HN.

(edited to fix maths error)

2 comments

Akronymus 1307 days ago

There's also the possibility of filtering each source table first, then doing an inner join. Which can VASTLY cut down on computation. I assume GP assumed doing an outer join first, then filtering.

But those are details for the database engine to handle. And, as you said, indexes

link

_a_a_a_ 1307 days ago

FYI for others, such filtering is called predicate pushdown (I believe also called predicate hoisting sometimes). Example (and this is trivial but for illustration)

   select * from (select * from tbl) as subqry where subqry.col = 25

would be rewritten by any halfway decent optimiser to

   select * from (select * from tbl where tbl.col = 25)

(and FTR the outermost select * would be stripped off as well).

Good DB optimisers do a whole load of that and much more.

link

Akronymus 1307 days ago

Yeah, had to get quite well acquainted with query execution plans and the like a few years ago (And forgot most of it by now) because of diagnosing a SLOW query.

Joining onto either table a or table b is something that REALLY trips optimizers up.

link

jteppinette 1307 days ago

Wow, this comment comes across as being incredibly arrogant while providing zero value. nOOb lol

link

_a_a_a_ 1306 days ago

I thought I was being informative. I can't give hard&fast rules because (drumroll)... it depends. So I have tradeoffs to consider, and indexes got mentioned.

How else could I have posted better? Honest question.

link

jteppinette 1306 days ago

Because you didn’t actually refute anything the GP said, and gave bad advice, all while being incredibly negative and arrogant.

> this mostly clueless advice

> strong opinions put forth by someone who doesn't have the necessary experience, or understanding of what's going on under the hood

> I'm tired of suchlike n00b advice strongly (and incorrectly and arrogantly) expressed on HN

You continue to just say it depends without giving any actual scenarios. You make it sound like magic, but it’s not: “under x and y, do z except when u” is better than “it depends, I’m sick of all these noobs”.

Also, your main points are against denormalization and avoiding large table joins which are 100% rational arguments under certain workloads.

link

_a_a_a_ 1306 days ago

I refuted what he said by pointing out that 1E9 x 1E6 = 1E15. A billion row table denormalised with a million row table = 1000 trillion row table. How big's your disk array? How are you going to ensure correctness on update?

His was stupid advice and had it should not have been given.

> You continue to just say it depends without giving any actual scenarios

it depends. Use your common sense and then use a stopwatch, is a good start. There are entire shelves of books on this, I won't repeat them.

> You make it sound like magic, but it’s not:

absolutely true!

> “under x and y, do z except when u” is better than

it's a multidimensional problems inc. memory size, disk size, the optimiser, sizes of particular tables joined, where the hotspot is, cost of updates of non-normalised tables, etc. I can't give general advice from here.

> Also, your main points are against denormalization and avoiding large table joins which are 100% rational arguments under certain workloads.

I said "Denormalising is a useful tool that IME rarely gains you more than it loses you,"

I don't accept your criticism.

link

jteppinette 1306 days ago

That’s not what denormalize means, how long have you been doing this again?

link

_a_a_a_ 1306 days ago

True, you normalise/denormalise data not tables as such; tables pop out of a normalisation process and denormalisation collapses them together. Perhaps if I'm still wrong you could put me right. And don't just point at the wiki article on it, please be specific.

To your question, probably longer than you but I've always more to learn.

link