Hacker News new | ask | show | jobs
by hoodoof 3411 days ago
AWS needs something like this.

The missing piece for the AWS serverless story is a database that is suitable for writing real world applications. DynamoDB is far from suitable for that task, which leaves AWS serverless with no good database.

5 comments

AWS has RDS - That's most certainly a database suitable for writing real world applications as its MySQL.

Does serverless somehow mandate a non SQL solution?

RDS also supports PostgreSQL, SQL Server, Oracle, Aurora and MariaDB

http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Welcom...

RDS is server based - you need to pay to have an instance running per hour. That's not serverless. That's "serverful".
On the one hand, everything is server based at some level; it's just a question of how much is being hidden from you and managed by a third party.

On the other hand RDS hides a lot of the complexity from you. You don't have to pick an OS, apply updates, secure it, manage it, configure it, or patch it. There are some number of virtual servers out there that are nominally running your RDS cluster, but it's all pretty theoretical.

So I'm not entirely understanding your point.

> you need to pay to have an instance running per hour

You are paying to have instances running with every other DB service too; they may just break it out on your bill a bit differently. :)

The real issue with RDS for me isn't that they haven't removed the server part from the equation (they have), it's that they haven't removed the RDBMS from the equation. Schema changes, data migrations, replicas, sharding, scaling: All the hard parts of running a RDBMS are still there.

If Amazon could somehow make a magical service that accepted SQL queries and somehow returned my data, I'd be ecstatic - but the difference between that and RDS isn't the fact that they're letting me know how much ram the virtual server which is nominally running MySQL for me has.

I'm not sure how that differs from Azure Document DB? I have no inside info on this, but, I'm pretty sure it runs on a server too.. In the specific context of databases used for "serverless", clearly there are servers involved, it's simply that your application and ops team doesn't manage them.

What I'm getting at is, a hosted DB is a hosted DB.. What makes SQL unsuitable for serverless?

Replying to myself here, I missed a key point.. the issue you raise is that you're billed per hour, even when it's unused? That makes some amount of sense, but any data storage is going to come with a per hour bill - either for the instance of it, or the data within it.

Anyway, my bad, I now see your point :)

It's a cloud service just like Dynamo, the implementation specifics seem irrelevant here.

Touting "serverless" as some sort of mysticism that doesn't really mean anything useful doesn't really get anybody anywhere.

Yeah, I dabbled in DynamoDB for a recent project - couldn't really get my head around it - very strange sort of NoSQL database. The query language is incredibly arcane and wordy, and mostly inflexible.

Thinking of setting up an EC2 instance running RethinkDB or PouchDB for my project (and for future projects).

Cross datacenter replication is the missing piece from AWS. I wish they'd just roll out a hosted Cassandra or something identical
While probably not what you're looking for if you're mentioning Cassandra, RDS does let you have read replicas in any region.
You can use scylladb.com and set it up pretty easily. Stable, distributed and fast out of the box with a lot less maintenance.
> DynamoDB is far from suitable for that task

why?

DynamoDB would be pretty close if it just allowed null values.
DynamoDB is effectively useless for querying, except perhaps for some sort of highly specialised application able to fit within the DynamoDB strange and arcane query model.

What sort of database is effectively useless for querying?

Also they need to ditch the really, really confusiong and limiting scaling model. For a database that advertises scaling as one of its key strengths, DynamoDB sure has a bad scaling story.

> What sort of database is effectively useless for querying?

Cassandra, Riak, Voldemort, HBase, Bigtable, Azure Table Storage, and many other implementations of wide column stores have similarly limited querying.

I'm also not sure what you mean by the limiting scaling model. I can go from 0 to 160k reads/second by turning a knob, and 160k is only the default limit (you can request higher limits).

It is not a document store. It's a wide column store. Use it for the right job and it does very well. Treat it like postgres and you are gonna have a hard time.

The price for that 160k is horrifying though, esp. if the requirement is bursty rather than continuous.
Which is why you turn the knob back down when you stop being bursty.

But yes, it's pricy. It may not be the best fit for some. Hopefully by the time you're taking 160k writes per second you have a solid business model. I mean, Twitter peaked at around 8000 tweets per second. What are you doing that requires 160k, and do you really need to be storing it?

It's probably an indication that your use-case is not a good fit for dynamo, or that you didn't adapt your use-case to dynamo, you're doing something "wrong" like trying to use it as a relational database. I've experienced some of these pains as part of my dynamo learning curve.

For example by changing my query strategy I was able reduce the provisioned write units from 1900 to 150 (write units dominate the cost).

Ignoring reserved prices, it is $10.40/hr (these are eventually consistent reads, so half the cost of consistent ones). That puts it roughly on par with an RDS postgres r3.8xlarge instance with 10k provisioned IOPS.

Sure, you likely have more than one table on RDS, so that cost is amortized, but when you get to the scale where you need 160k reads/s, you aren't going to have much more than that one dataset in a single instance.

It works well for a CQRS model. Which helps with super high scale apps. But most devs want joins and dont want to take the discipline to manage the data duplication.
I just rolled out a feature on DynamoDB and when monitoring it, I look at one yeah. Provisioned capacity vs consumed capacity. That's all I have to care about. No CPU, RAM, disk space metrics. Usage can increase 4x and performance is flat. It's great.

The application is less flexible and required making a lot of decisions up front, but operationally it's fantastic.

For my application I have found it is more complex about provisioned vs consumed capacity. I get throttling all the time when consumed capacity is a third of provisioned capacity.

You also need to care about how DDB does its underlying partitioning. It would be nice to turn the knobs and be able to trust you will get X reads/sec and Y writes/sec, but that is only true per node! Unfortunately, DDB gives you zero information about how many nodes your DDB table is running on! (Yes you can guess pretty well if you keep track of your usage rate and do some math).

So when provisioning, you need to be aware that if you have 100 provisioned read ops, but you have data on 5 nodes, you really only have 20 reads/sec if one key gets hot.

I agree it's pretty easy operationally, but you can get burned if you don't know how it works under the hood.

I just ping support when I want to know partitions. They also told me a little trick. If you create a kinesis stream for your table, the number of shards in the stream is the number of partitions.

But you're right part of design for DDB is picking a proper partition key so you don't end up with hot shards.

Databases in this category are some of the most popular ones in the world with good reason. The only way you can scale is to adopt a query-free architecture.

It feels tedious at first but once you develop some good habits and frameworks around denormalization it becomes easy to do that from day one.

>> The only way you can scale is to adopt a query-free architecture

This is not really the case. There are database systems that can handle large scale and complex queries. Allthough usually at the price of providing reduced consistency guarantees.

Actually I guess the query language and indexing is pretty limiting too.