Hacker News new | ask | show | jobs
by egeozcan 3478 days ago
If you also didn't know what exactly partitioned tables are, here's a nice introduction from Microsoft:

https://technet.microsoft.com/en-us/library/ms190787(v=sql.1...

It is for the SQL server but I assume it would be mostly relevant. Please correct me if I'm wrong.

1 comments

So this is all about partitioning data into different storage files on the same server? What is the main benefit of that?
If you combine the partitions with tablespaces, you can put tables on multiple disks. Let's say you keep a record of all orders you have processed. During the day-to-day operation, you need, say, the last 2 months of data all the time, but the older data you only need for reporting here and then.

By partitioning, you can keep the recent data on a fast disk and the older data on slower disks while still being able to run reports over the whole dataset.

And once you really don't need the old data any more, you can just bulk-remove partitions which will get rid of everything in that partition without touching anything else.

Even then you don't split over tablespaces: By keeping the data that's changing often separate from the data that's static and is only read, then you gain some advantages in index management and disk load when vacuum runs as it mostly wouldn't have to touch the archive partitions.

> For example, if a current month of data is primarily used for INSERT, UPDATE, DELETE, and MERGE operations while previous months are used primarily for SELECT queries, managing this table may be easier if it is partitioned by month. This benefit can be especially true if regular maintenance operations on the table only have to target a subset of the data. If the table is not partitioned, these operations can consume lots of resources on an entire data set. With partitioning, maintenance operations, such as index rebuilds and defragmentations, can be performed on a single month of write-only data, for example, while the read-only data is still available for online access.

The "General Ledger Entry" table in most accounting systems ends up being millions to billions of rows. Except for rare circumstances, prior periods are read-only due to business rules.

Metadata operations on partitions can be very fast. One simple example is date based housekeeping. Deleting a month of data will be quite intensive on most databases, whereby dropping a partition from the table is effectively instant.

Partion switching is also fast. Say you have a summary table that is rolled up by month, but you want to recalculate the summaries every so often. You can build a month into a new table and then switch the new table for a partition in the summary table.

When the circumstances allow it ('cause there are some limitations on where it can be used), this pattern is HUGE. We've got a few places in our system that do this, and the optimization achieved an improvement of a couple orders of magnitude.
First, if your query used the partition key in its where clause, the database knows(can calculate) which partitions can have a result and which can not. This means smaller indexes/less data to scan to find the result.

In the MSSQL case - not sure about others, this is were I had to use it - you can also switch data segments between tables indexed over the same partition function and with the same DDL. So you recreate the existing table a second time, create all the required indexes on it (which is fast because the table is empty), and then you switch partitions between them basically via pointer manipulation. The empty partition is now in the normal table, the data partition in the recreated one. Then you drop table on the recreated table. This is much more IO efficient than a delete-from statement.

This switching of course allows for a lot of other fun stuff as well, where you switch out a partition with a couple million rows, then work on it in isolation, switch the partitions back and then only have to "replay" the few rows that hit that partition while they were switched. Which is easy because they are now in the shadow table which is not updated further.

It is of course data and application dependent if you can use these things without affecting your application; but if it is suitable, the gains can be immense.

So this won't help when you have something like a deleted flag and need to join on data that could be in either partition?
Yes and no. It won't help in the sense that the query planner can not identify the affected partitions. But it can help if the database supports parallel index scans, since instead of one big index, it has to scan n smaller indexes which can be done concurrently.
I have a scenario where we want to keep 12 months of data online. When you go to delete the 13th month of data the traditional way:

- Postgres has to scan the whole table to find the old data

- Postgres marks it as free, but doesn't give it back to the OS

Handling this the naive way winds up being both slow and unproductive. With table partitioning, I just go in and DROP TABLE data_2015_11 and get on with life. It's fast and returns space to the OS.