| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by IpV8 3234 days ago

Thats funny, because I was dying for it to be longer. I felt like the post was just an introduction. I'd love to see a part 2 with a more detailed description that touches more of the implementation of a sharding plan.

For me a major question I have as I consider sharding is what my application code will look like. Let's say I have a query like:

'select products.name from vendor inner join products on vendor.id = products.vendor where vendor.location = "USA"'

If I shard such that there are many products table (1 per vendor), what would my query look like?

4 comments

ves 3234 days ago

Your application code shouldn't have sharding concerns in its logic. To achieve this, you should introduce an abstraction layer. One such example is vitess[0], which is used at YouTube.

If that's too much work, then an easy preliminary step is to add the abstraction layer in your application code. That gets you most of the benefits of a proxy for the purpose of having clean application logic, and makes it easy to switch over later, but is less powerful and feature complete.

[0]: http://vitess.io/overview/#features

link

ozgune 3233 days ago

Reading through your comment again, I realize I completely missed the mark on your question.

If you use Citus, you don't have to make any changes in your application. You just need to remodel your data and define your tables' sharding column(s). Citus will take care of the rest. [1]

In other words, your app thinks it's talking to Postgres. Behind the covers, Citus shards the tables, routes and parallelizes queries. Citus also provides transactions, joins, and foreign keys in a distributed environment.

[1] Almost. Over the past two years, we've been adding features to make app integration seamless. With our upcoming release, we'll get there: https://github.com/citusdata/citus/issues/595

link

ozgune 3234 days ago

Thanks for your input (also the_duke)! If time permits, we may come up with a second blog post on this topic.

If I understood your example query, your application serves vendors and each vendor has different products. Is that correct?

You can approach this sharding question in one of two ways.

1. Merge different product tables into one large product table and add a vendor column

2. Model product tables as "reference tables". This will replicate the product tables to all nodes in the cluster

Without knowing more about your application / table schemas, I'd recommend the first approach. I'd also be happy to chat more if you drop us a line.

link

the_duke 3234 days ago

Same here.

To me it read like just a basic introductory post to a longer series.

link