Hacker News new | ask | show | jobs
by scottmessinger 4153 days ago
What are the performance characteristics of realtime push? Does the performance of inserts slow down with the number of subscriptions to change feeds? Or, is insert performance unrelated to subscriptions? Also, does the change feed only show the before/after or does it also show the query that was used to transform the data?
1 comments

Slava @ rethink here.

The idea behind the architecture was that performance should be significantly better than rolling your own infrastructure, because the database has a lot of information that userland (from the database perspective) software doesn't.

The performance of inserts might slow down slightly (matter of microseconds in insert latency) if you create many feeds. The database has to look at each insert and figure out if it applies to any of the feeds. This code is written in optimized C++ and is very fast. We're still running benchmarks, but we're shooting for performance levels where you (as a user) might not even be able to measure the difference.

The same applies for inserts that aren't affecting feeds (on a per table basis).

Same goes for throughput -- it might slow down slightly, but we're shooting for making the slowdown barely measurable if at all.

EDIT: in clustered environments, if you're subscribed to 1000 changefeeds on machine A and a write happens on machine B, we do constant work on B to send the changes to A and then A does all the work to figure out which changefeeds need to see it. TL;DR: We don't block out other writes for time proportional to the number of feeds.

Are you doing anything clever to figure out e.g. which subset of changefeeds subscribed to a table might be interested in a given update?

Let's say I have a table containing data for many users, while each subscription only needs data for a single user. Instead of scanning through all the changefeeds, you could put subscriptions in a hashmap and figure out which ones to update in O(1) time rather than O(N) time in number ob subscriptions, per update.

Obviously this is much harder in the general case, but do you do anything along these lines?

_This code is written in optimized C++ and is very fast._

Can you elaborate?

_but we're shooting for performance levels where you (as a user) might not even be able to measure the difference._

Benchmarks are almost always skewed towards the preferred workload of the DB (they're like science experiments, the result is heavily biased). How are you ensuring this isn't the case with your benchmarks?

Also what are you benchmarking against? I'd like to see one against Aerospike.