Hacker News new | ask | show | jobs
by patrickg_zill 2950 days ago
I am pretty sure that PG has had clustered indexes for a decade or more... ? e.g. https://www.postgresql.org/docs/9.1/static/sql-cluster.html

Or is this term referring to a different feature/method than this?

One thing not mentioned: PL/SQL vs. whatever the MySQL equivalent is.

3 comments

You're correct that PostgreSQL has had clustered indexes for quite a long time. The only difference here is that PostgreSQL allows a HEAP table. That said, you do have to re-cluster a PostgreSQL and that requires an exclusive lock on the table, which is obviously not idea for a massive table. MySQL will always cluster on the PRIMARY KEY or (if there isn't one) the first UNIQUE key[0], but as far as I can tell it always clusters on write. I don't see any way to configure the padding of the clustered index like you can on SQL Server, so I'm not sure how this is accomplished.

MySQL only supports SQL in procedures[1]. There is no PL/pgSQL[2] equivalent, and except for custom UDFs written in C/C++, there's no support for external procedures, either. PostgreSQL[3] supports PL/pgSQL, PL/Python, PL/Tcl, and PL/Perl in base, plus there's external modules for PL/Java, PL/Lua, PL/R, PL/sh, and PL/v8.

[0]: https://dev.mysql.com/doc/refman/5.7/en/innodb-index-types.h...

[1]: https://dev.mysql.com/doc/refman/8.0/en/adding-functions.htm...

[2]: https://www.postgresql.org/docs/current/static/plpgsql-overv...

[3]: https://www.postgresql.org/docs/current/static/xplang.html

This is not the same thing - it is rearranging an existing index for efficiency as a one-off process. It needs to be repeated when the data is substantially changed.

With a true clustered index the clustering property is as far as possible (it can get somewhat fragmented in the presence of random data) maintained during normal operation without the need for a full rebuild every now and then to keep the benefits for new data.

> When a table is being clustered, an ACCESS EXCLUSIVE lock is acquired on it. This prevents any other database operations (both reads and writes) from operating on the table until the CLUSTER is finished.

This makes that operation very nasty. For a large amount of data you are looking at locking your applications out of the database for some time, and the delay is relative to the total data size in the table being acted upon, not the amount of data that has recently arrived or changed.

I think it's different. From the docs, this is a one time update to the table. The table itself is not clustered, it's just ordered based on a clustered index, so selects by default should come in that order. But once an insert is done, it's again ouy of order.

Also, an order by using the cluster field most probably invokes an index scan, while a clustered table doesn't.