Hacker News new | ask | show | jobs
by whalesalad 852 days ago
> You need an IKV account and a provisioned key-value store to start using IKV in production. Why? IKV is an embedded database which is built on top of a persistent stand-alone data layer (which needs resource allocation). To provision (provisioning time is usually less than 12 hrs)

This seems counterintuitive to an embedded store. Potentially 12 hour provisioning time is also wild.

2 comments

I’m guessing the 12hr provisioning time is because this is super early stage and there’s no self-serve interface available yet to provision it yourself?

In the larger picture I’m trying and failing to imagine the niche for the eventual product but that could be a lack of familiarity or imagination on my part. I’m guessing the OP is part of the team that’s working on this? If so, maybe you could elaborate on what specific problem this is solving? Additionally is there any possibility of self-hosting? Since writes do obviously involve network traffic, they’ll almost certainly be faster over a 6’ 10-Gbit SFP cable to the pool of NVMe drives sitting in the rack here.

Also, since the use case sounds like “datasets that can’t fit in RAM”, what’s the cold start latency like? Say I’ve pushed 10TB of data into IKV. How much does a given new node have to pull down into local storage before it can start reading from (potentially a shard of) the data?

Correct, we are super early so there is no self-serve yet.

The primary usecase for this is serving features for ML inference (since eventual consistency is ok and sacrificing write latency for reads is a fair tradeoff). Right now, this is done by using a traditional client-server DB at the moment (Redis/DynamoDB/etc) - or if you're a big tech company that cares about latency you can implement this on your own (https://doordash.engineering/2022/05/03/how-we-applied-clien...).

As far as self-hosting goes - yes writes will be definitely faster. IKV is fully open source so we're not opposed to it, just haven't figured out the details yet (since self hosting will mostly be useful to very large usecase)

At the core, we use in-memory hashmaps that reference memory-mapped files. So, when a dataset doesn't fit in RAM - it spills to disk automatically.

Cold start - the database is seeded with a "base image", that is built periodically by the backend. That's how a user can add new nodes to their cluster, and still avoid any RPCs.

That being said, if you don't have 10TB of disk, you have to partition IKV (and by extension your application). We support partitioning by allowing documents (the data) to declare partitioning keys. If one shard/partition cannot fit on disk - the store won't startup.

Its a managed embedded-store, ie someone can write data, forget about it, come back in a month with new hardware and still access all their data. You can't do that with a traditional embedded store (ex. rocksdb or a local redis instance)

There are data pipelines behind the scenes to distribute writes to the embedded store which needs some provisioning time. And yes we are super early.