Hacker News new | ask | show | jobs
by quizotic 3125 days ago
The prescription of changing sequential_page_cost to equal random_page_cost is certainly reasonable for SSD, but I wonder if the underlying issues aren't somewhat deeper and more interesting. One difference between a sequential scan and an index scan is the amount of data being scanned. PostgreSQL stores information horizontally as rows and a sequential scan will have to read in all column values of all rows. An index scan will read through all values of a _single_ column. The 50x performance difference _might_ be just that the whole row is 50x wider than the width of the indexed join column.

An interesting second factor relates to the nature of the SSD storage. With SSDs a read request will pull back a 4K page, even if the read request was smaller. So it's not quite right to say that a sequential read and a random read cost the same on SSD, particularly if the same 4K page must be read multiple times. I suspect that the particular index technique used by PostgreSQL tends to organize data such that successive indexed values reside in the same 4K SSD page. IOW, it's not so much that the cost of random SSD access is the same as sequential SSD access (though that's true), as it is that the PostgreSQL index mechanism doesn't require multiple reads of the same 4K page.

if a Hash-based index was used instead of a Btree-based index, and if the table width was narrower, the sequential scan might have outperformed the index scan.

3 comments

A sequential scan could also be faster if the join selectivity is poor. As an extreme example say if every id in event_type appeared in prop_keys for a given app. So scanning the index repeatedly would be a waste of time since you would have to scan the table anyway.
Which is the point of stats, right? To know when it's better to use an index vs just read the whole table because you need 50%+ of it anyway.
PostgreSQL uses 8K pages, so it won't read less than that. IIRC, it also has its own cache of recently read pages, so it won't have to read the same page over and over.
You can rebuild it with smaller pages, including 4kB, which may be beneficial for various reasons. The packages however stick to 8kB.

And yes, the database has it's own cache (aka shared buffers), on top of page cache (filesystem cache).

You're saying that during sequential scans, the time it takes per row is O(n) where n is the number of columns? I find that hard to believe. Can anyone confirm / deny this?
The number of columns in PostgreSQL is limited to 250-1600 [1] since a tuple (a row) can't span more than one page of memory. Since O-notation talks about asymptotic behavior, it doesn't really apply here.

But yes, tables with more columns normally take more time to scan sequentially. The complete tuple is always loaded (excluding the data of TOAST [2] attributes), there is no way to only load one column. This is one of the reasons that column-oriented databases can be faster than row-oriented databases [3].

[1] https://www.postgresql.org/message-id/42C3C382.5020108@cinec... [2] https://www.postgresql.org/docs/9.5/static/storage-toast.htm... [3] https://en.wikipedia.org/wiki/Column-oriented_DBMS

The time complexity for sequential scan takes O(r), because every row must be read once. Of course rows are made up of columns so you can also specify it as O(rc) where c is the average column length.