| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by karmaniverous 592 days ago

The key focus of EM is the implementation of a multi-entity data model along the lines of the Single Table Design Pattern.

Recall that, in DDB, your index has two parts: hash & range key. If you want to have many entities in the same table, then you need a way of distinguishing between different entities, and a way of locating an individual record. In your primary index, those account for your hash and range keys, respectively: the hash key is your entity differentiator, and the range key is your entity id (which may come from a different record property from one entity to another). If you follow the development of the article, you’ll see how this plays out with variously constructed keys across different indexes.

Now, forget EM sharding for a minute and let DDB manage your sharding. Say you launch your application with little data and a single shard. Over time your data scales & spills over onto additional shards. When you perform a search, DDB has no way of knowing which shards are relevant so it has to search ALL of them.

But from the application side, your data scaled over TIME. Therefore, if you know which shards were created when, you could limit a time-based search only to the shards that are relevant to the search parameters. And a LOT of searches involve a time window.

Within the context of EM, when I say a “shard”, I am talking about a unique hash key value like `user!1F`, where `user` is the entity type and `1F` is the shard key. These may or may not map to physical DDB shards, and the good news is that you don’t NEED to care… DDB will flex if you don’t.

EM has a lot of features that greatly streamline the dev experience when operating against a DDB table with a multi-entity data model. You don’t HAVE to use the sharding feature… it’s literally just a config item, everything else happens behind the scenes. But when you DO use it, EM splits a search across sharded data into MANY parallel searches, one per shard, then assembles the returns into a coherent result with a “page key” that is actually a compressed map of ALL the underlying page keys. You don’t have to care about THAT, either… just pass the compressed string back to EM and it will rehydrate the page keys & perform the next set of searches.

So you get to choose your own adventure… you can run every entity on a single “shard” or run in parallel. I’d just keep an eye out for any drop in performance at scale and add a shard bump when I see it.

Also worth noting: EM is actually platform-agnostic. There is a companion repo that contains the DDB-specific client. This is still a bit in flux btw so be kind lol. Anyway the point is that other platforms that don’t have AWS’ resource footprint may not handle sharding as well, and EM will be able to render effectively the same result.

Hope that answers your question!

P.S. Worth noting: in addition to searching across multiple SHARDS, an EM query can also search across multiple INDEXES. Say you want to query on “name” and you want to query both your firstName and lastName indexes with the same “name” value. With EM, this is a SINGLE query that returns a combined, paged, deduped, sorted result set. Handy.