Hacker News new | ask | show | jobs
by levkk 336 days ago
My last big co, we had a team of 10 who's entire job was to sync data from Postgres into Elastic. It would take weeks and fallover regularly due to traffic.

If we could have a DB that could do search and be a store of record, it would be amazing.

5 comments

They're different access patterns, though. Are there no concerns about performance and potentially blocking behavior? Decoupling OLTP and analytics is frequently done with good reason: 1/to allow the systems to scale independently, and 2/to help prevent issues with one component from impacting the other (i.e., contain blast radius). I wouldn't want a failure of my search engine to also take down my transaction system.
You don't need to. Customers usually deploy us on a standalone replica(s) on their Postgres cluster. If a query were to take it down, it would only take down the replica(s) dedicated to ParadeDB, leaving the primary and all other read replicas dedicated to OLTP safe.
Are you saying that the cluster isn't homogenous? It sounds like you're describing an architecture that involves a cluster that has two entirely different pieces of software on it, and whose roles aren't interchangeable.
Bear with me, this will be a bit of a longer answer. Today, there are two topologies under which people deploy ParadeDB.

- <some managed Postgres service> + ParadeDB. Frequently, customers already use a managed Postgres (e.g. AWS RDS) and want ParadeDB. In that world, they maintain their managed Postgres service and deploy a Kubernetes cluster running ParadeDB on the side, with one primary instance and some number of replicas. The AWS RDS primary sends data to the ParadeDB primary via logical replication. You can see a diagram here: https://docs.paradedb.com/deploy/byoc

In this topology, the OLTP and search/OLAP workloads are fully isolated from each other. You have two clusters, but you don't need a third-party ETL service since they're both "just Postgres".

- <self-hosted Postgres> + ParadeDB. Some customers, typically larger ones, prefer to self-host Postgres and want to install our Postgres extension directly. The extension is installed in their primary Postgres, and the CREATE INDEX commands must be issued on the primary; however, they may route reads only to a subset of the read replicas in their cluster.

In this topology, all writes could be directed to the primary, all OLTP read queries could be routed to a pool of read replicas, and all search/OLAP queries could be directed to another subset of replicas.

Both are completely reasonable approaches and depend on the workload. Hope this helps :)

Which of these two is the higher order bit?

* ParadeDB speaks postgres protocol

* These setups don't have a complex ETL pipeline

If you have a ETL pipeline specialized for PG logical replication (as opposed to generic JVM based Debizium/Kafka setups), you get some fraction of the same benefits. I'm curious about Conduit and its postgres plugin.

That leaves: ParadeDB uses vanilla postgres + rust extension. This is a technology detail. I was looking for an articulation of the customer benefit because of this technologically appealing architecture.

The value prop for customers vs Elasticsearch are:

- ACID w/ JOINs

- Real-time indexing under UPDATE-heavy workloads. Instacart wrote about this, they had to move away from Elasticsearch during COVID because of this problem: https://tech.instacart.com/how-instacart-built-a-modern-sear...

Beyond these two benefits, then the added benefits are:

- Infrastructure simplification (no need for ETL)

- Lower costs

Speaking the wire protocol is nice, but it's not worth much.

they both sound like postgres to me, just with different extensions
Since we both worked there: I can think of a few places at Segment where we'd have added more reporting/analytics/search if it weren't such a pain to set up a OLAP copy of our control plane databases. Remember how much engineering effort we spent on teams that did nothing but control plane database stuff?

Data plane is a different story, but not everything is 1m+ RPS.

It's not going to happen anytime soon, because you simply cannot cheat physics.

A system that supports OLAP/ad-hoc queries is going to need a ton of IOPs & probably also CPU capacity to do your data transformations. If you want this to also scale beyond the capacity limits of a single node, then you're going to run into distributed joins and network becomes a huge factor.

Now, to support OLTP at the same time, your big, distributed system needs to support ACID, be highly fault-tolerant, etc.

All you end up with is a system that has to be scaled in every dimension. It needs to support the maximum possible workloads you can throw at it, or else a random, expensive reporting query is going to DOS your system and your primary customer-facing system will be unusable at the same time. It is sort of possible, but it's going to cost A LOT of money. You have to have tons and tons of "spare" capacity.

Which brings us to the core of engineering -- anyone can build a system that burns dump trucks full of venture capital dollars to create the one-system-to-rule-them-all. But businesses that want to succeed need to optimize their costs so their storage systems don't break the bank. This is why the current status-quo of specialized systems that do one task well isn't going to change. The current technology paradigm cannot be optimized for every task simultaneously. We have to make tradeoffs.

I don't know. For me, I need

* a primary transactional DB that I can write fast, with ACID guarantees and a read-after-write guarantee, and allows failover

* one (or more) secondaries that are optimized for analytics and search. This should also tell me how caught up the system is, with the primary.

If they all can talk the same language (SQL) and can replicate from primary with no additional tools/technology (postgres replcation for example), I will take it any day.

It is about operational simplicity and not needing intimately to know multiple technologies. Granted, even if this is "just" postgresql, it really is not and all customizations will have their own tuning and whatnot, but the context is all still postgresql.

Yes, this will not magically solve the CAP theorem, but for most cases we don't need to care too much

Yeah, in general, I think a lot of businesses would love to skip ETL pipelines if possible / consolidate data. Postgres is a very much a neutral database to extend upon, maybe a wild analogy but it's the canola oil of databases
Total tangent, but I think "Canola is a neutral oil" is a lie. It's got the most distinctive (and in my opinion, bad) flavor of the common cooking oils.
What would you say is the most neutral oil then?
Sunflower oil? It seems to very reliably taste like nothing.
Personally I have Canola and Sunflower oil tied. Vegetable Oil I guess deserves a mention here too.
If canola oil tastes like something, it's really disgusting IMO. I kinda hate the stuff even though my dad made good money growing it. OTOH, the very sweet smell of the plant's flowers is pleasant enough if pretty basic and the honey is similar.
Shout out to grapeseed oil
Once upon a time, I was using postgres for OLTP and OLAP purposes combined with in-database transforms using TimescaleDB. I had a schema for optimized ingestion and then several aggregate views which produced a bunch of purpose-specific "materialized" tables for efficient analysis based on the ingestion tables.

Timescale had a nice way of abstracting away the cost of updating these views without putting too much load on ingestion (processing multiple TBs of data a time in a single instance with about 500Gb of data churn daily).

One db that could be interesting here is CrateDB. It's a Lucene based DB that supports the postgres wire protocol. So you can run SQL queries against it.

I've tried figuring out if it supports acting as a pg read-replica, which sounds to me like the ideal set up - but it doesn't seem to be supported.

I have no affiliation to them, just met the team at an event and thought it sounded cool.

One of the ParadeDB maintainers here -- Being PostgreSQL wire protocol compatible is very different from being built inside Postgres on top of the Postgres pages, which is what ParadeDB does. You still need the "T" in ETL, e.g. transforming data from your source into the format of the sink (in your example CrateDB). This is where ETL costs and brittleness come into play.

You can read more about it here: https://www.paradedb.com/blog/block_storage_part_one

Sounds very interesting! Unfortunately AGPL license makes it hard to bring into projects.
How so? Many popular projects are AGPL. MinIO, Grafana, etc.

We wrote about this here: https://www.paradedb.com/blog/agpl

So, I'm not versed enough in legal matters to be certain about this, so I tend to fallback to caution, but (A) customers I've worked with in the past seem to be wary of such copyleft licenses and (B) the contagious nature of such license would make me think twice about using it in a project of my own as well.

It would be nice to have such notion challenged but I'm not sure what would change my mind.

I would expect that most commercial companies that use Grafana would obtain a commercial license?