| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by derefr 1050 days ago

> but it seems like a very weird configuration

On IaaS providers, you get "local scratch NVMe" presented to the guest as individual fixed-sized disks — presumably because they're being IOMMU-pass-through'ed from the host (or a JBOD direct-attached to the host.)

The sizes for these disks were standardized several generations ago, so they're at least presented to the guest as 375G slices (I'm guessing they might actually be partitions of a larger disk nowadays.) To get "decent" amounts of local scratch storage for e.g. a serverless data-warehouse instance, you need "all you can get" of these small volumes — which on at least AWS and GCP, is 24 of them (equalling ~9TB.)

And that's just one guest. The host might have several such guests.

(To be clear, neither AWS nor GCP is likely to be using libvirt anywhere in their stack. This is just to demonstrate the use-case.)

3 comments

candiddevmike 1050 days ago

A serverless data warehouse instance sounds like an oxymoron

link

derefr 1050 days ago

"Serverless" is a jargon term, with a specific meaning — basically "all state is canonically durable in some external system, usually one rooted in a SAN-based managed object store like S3; there are no servers that keep durable state that must be managed, only object-store bills to pay and spot instances temporarily spun up to fetch and process the canonically-at-rest state."

(This kind of architecture is actually "serverless", but in a possibly-arcane sense to someone who doesn't admin these sorts of systems: it's "serverless" in that your QoS isn't bounded by any "scaling factor" proportional to some number of running servers. You don't have to think about how many "servers" — or "instances" or "pods" or "containers" or whatever-else — you have running. Especially, as a customer of a "serverless" SaaS, you will only get billed for the workloads you actually run, rather than for the underlying SaaS-backend-side servers that are being temporarily reserved to run those workloads.)

Snowflake and BigQuery are examples of serverless data warehouse systems. You do a query; servers get reserved from a pool (or spun up if the pool is empty); those servers stream your at-rest data from its canonically-at-rest storage form to answer your query.

In a serverless data warehouse, as long as you still have the same server spun up and serving your queries, it'll have the data it streamed to serve your previous queries in its local disk and memory caches, making further queries on the same data "hot." The more local scratch NVMe you give these instances, the more stuff they can keep "hot" in a session to accelerate follow-on queries or looping-over-the-dataset subqueries.

link

Eduard 1050 days ago

what does "canonically durable" and "canonically-at-rest storage" mean?

link

derefr 1050 days ago

Most database systems are canonically-online: the state lives on the instances, and you make backups of it, but these are never more canonical than what’s on the local online storage of the cluster (and usually less-so, because it’s offset back in time by at least a few seconds, if not hours.)

When a cluster-node permanent-faults (say, its DC burns down), you lose at least a few seconds of what you — and your customers — thought of as committed data.

In a canonically-at-rest DBMS, the only state that matters is the state in the object store (or other external, highly-replicated durable-storage abstraction.) The reads are an ephemeral caches in front of the canonical at-rest data; and all writes must be pushed down to the at-rest representation before any other nodes in the cluster can see them, and before the write returns as successful to the client.

link

db48x 1050 days ago

Not stored in memory.

link

adql 1050 days ago

...the use case of "our architecture's idiotic limitations made it hit hypervisor limitations" ?

link

jacquesm 1050 days ago

That definitely wouldn't be the first time.

link

simcop2387 1050 days ago

Probably not normal partitions but nvme namespaces instead since that 3ill also allow them to balance iops and such so that one customer doesn't affect another as much.

link