| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by JVerstry 3015 days ago
	I wouldn't be so sure. If data is partitioned on stock symbols, then there is already a grouping by subset happening at the partition level, which may considerably ease the job of aggregation (it depends on the number of symbols and it seems you have about 1500-1600). Considering there is a clustered index, it means values are saved "in order". Then, a page could easily have a min value = max value situation, and the saved count in meta-data would be valid (no need to scan all values in the page again). Should the index use a dictionary, then one only has to count the number of items in the corresponding entry without scanning the rows. Such information is stored in the index, not in the rows. I have no doubt about one's intention. It's just that I think we could do better to support the claim. If your where clause contained something like (shares mod 3) = 0, then I would be pretty sure all rows would be scanned, because such information is not aggregated at the page level. If possible, I would also check the execution plan for any incongruent values.

1 comments

nikita 3015 days ago

That’s right: partitioning on stock symbol allows to push down LIMIT 10 to each partition, however in this case with 1500 stock symbols it doesn’t buy us much. It’s actually possible to compute a full group by (without a limit) by on every partition and merge them at the end. Merging 1500 groups is computationally trivial.

Yes, shared mod 3 or other predicate would make it impossilbe to run this query in O(metadata). It would of course burn more instructions so we would have to have a bigger cluster to hit a trillion a second as well as have a complex explanation in the blog post why this matters.