| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by solaxun 1871 days ago
	I understand your point generally but I don't understand this example. If you need data from 15 tables, you need to do those joins, regardless of prefiltering or subgrouping, right?

2 comments

tharkun__ 1870 days ago

Disclaimer: It's been a while, but my very first job out of uni was dealing with large queries like that for reporting and import/export out of our OLTP database directly.

15 tables would've been not very common for us, but something between 5 and 10 was normal. Many of those tables would've had millions upon millions of rows (while some were simple 'key tables' with only hundreds or a few thousand records)

The one thing that I learned really fast is that yes, "prefiltering/subgrouping" as the OP calls it, is very important. If you cut down on the number of rows that your query "starts out with" was very very important, as it cuts down on the amount of data that needs to be dealt with in the rest of the query. This was for an old Sybase ASE based system. IIRC, ASE would only be able to automatically optimize this across 4 clauses (might misremember the number and I left before they upgraded to the new newer version with a better optimizer), so ordering of your join clauses was important. If the filtering that cut down on the amount of data needed from other tables came first, your query would run way faster, than if the filters were way down with the rest of the joins.

Just think about it, if you start out with getting 6 million patient records and then start collecting 6 million records from the next table for a join and so on and so forth, that's way more data that needs to be read and churned through than if you can 'start on the other end' so to speak, whittle it down to say 50000 records that now need to be looked up in the patient table.

link

protomyth 1871 days ago

Well, yes but. Why do you need 15 tables for analysis queries and why isn't someone rolling those tables into something a bit easier with some backend process.

link

tomrod 1871 days ago

> why isn't someone rolling those tables into something a bit easier with some backend process.

You'd identified why we're all going to have jobs in 100 years.

Automation sounds great. It's exponentially augmenting to some users in a defined user space.

Until you get to someone like me, who looks at the production structure and goes: "this is wholly insufficient for what I need to build, but it has good bones, so I'm going to strip it out and rebuild it for my use case, and only my use case, because to wait for a team to prioritize it according to an arcane schedule will push my team deadlines exceedingly far."

This is why you don't roll everything into backend processes. Companies set up for production (high automation value ROI) and analytics (high labor value ROI) and has a hard time serving the mid tail. EVERYTHING on either direction works against the mid-tail -- security policies, data access policies, software approvals, you name it.

People, policy, and technology. These are the three pillars. If your org isn't performing the way it should, then by golly work at one of these and remember that technology is only one of three.

link

punnerud 1870 days ago

Tree pillars where you only can choose two to be perfect. Just like databases and CAP theorem

link

tomrod 1870 days ago

Technology is the easiest to adjust, ironically.

link

alextheparrot 1871 days ago

This is where I think the original data admins were deluding themselves. Expecting 1,400 analysts to write better code is a really non-trivial problem, but easy to proclaim.

An actual solution is creating pre-joined tables and having processes ("Hey your query took forever, have you considered using X?") or connectors (".getPrejoinedPatientTable()") that make sure those tables are being used in practice.

link

steve-chavez 1871 days ago

> why isn't someone rolling those tables into something a bit easier with some backend process.

To put it in more concrete terms(plain SQL): the tables could be aggregated on a set returning function or a view.

link

cgh 1870 days ago

Yes, in these situations materialized views with indexes are generally the correct answer.

link

CRConrad 1870 days ago

May I introduce: Protomyth, meet DW and ETL.

link