Hacker News new | ask | show | jobs
by nikita 40 days ago
I'm a VP on Databricks and former CEO of Neon. Happy to answer performance related or any other questions here.
4 comments

In the blog article[1] that linked to, it says "Unified transactional and analytical workloads: Lakebase integrates seamlessly with the Lakehouse, sharing the same storage layer across OLTP and OLAP. This makes it possible to run real-time analytics, machine learning, and AI-driven optimization directly on transactional data without moving or duplicating it."

Is the "without moving or duplicating" part actually a true statement? If the actual table state is only reconstructed by the pageserver, its not like Spark can just read it from S3.

[1] https://www.databricks.com/blog/what-is-a-lakebase

How does it affect HA postgres? (Replicas, consensus, etc). Especially with extensions like citus.
This specific perf improvement is orthogonal to HA.

However generally disaggregating storage makes HA simpler and allows for things like zero downtime patching: https://www.databricks.com/blog/zero-downtime-patching-lakeb...

Read replicas can be "shallow". You don't need to replicate all the data to create a replica. This allows to create them very very quickly (sub second).

All the extension still work. We don't support Citus today, but mostly because customers are not asking for it rather due to technical limitations. We support lots of extensions: https://docs.databricks.com/aws/en/oltp/projects/extensions

Thanks for offering. In the graph labeled "Prod customer throughput: (higher is better)" eyeballing it within a week you are seeing ~2k qps peak increase over the previous week.

Operationally, how do you handle landing that large of a perf improvement? If my data store changed that much in a week it could break something.

Generally the more throughput the system supports the better. In this case we were hitting limits (btw each operation is many queries of different sizes) and the customer observed higher latencies which is typical if the system can't sustain the throughput required.

After this change latencies are back to normal and throughput increased.

Ahh, so it was a customer pain point of higher latency so they were happy to see latency go down and throughput go up. Good to hear.

Great write up, cheers to the people involved.

Hi Nikita. Can you share any of Neon's techniques for minimizing noisy neighbor issues in the multi tenant storage services? Thanks!
* Rate limiting on proxy in front of compute fleet

* Large tenants are broken up into shards, reducing hotspots

* Each shard is throttled to a fixed req/s rate

* We do not run pageservers at their redline in terms of CPU load, so there is some slack to take up bursts

* Capacity quotas which selectively throttle write traffic to the largest databases if they are competing with others for disk space, until the larger database is migrated away.