Hacker News new | ask | show | jobs
by 3pt14159 3991 days ago
You are speaking from ignorance with the voice of authority.

I worked on a rails app that handled a billion requests per day. The problem isn't performance of the web framework, those are easy to load balance and split into C or cache when you need it. The problem is scaling your database, keeping your data secure, and iterating to meet business goals with a growing codebase and infrastructure. A mess of stored procedures would restrain you from doing all three.

And I know, I worked on a codebase in 1999 that did this because of the "performance gains". It ended up bricking the project due to inability to iterate.

9 comments

> The problem is scaling your database, keeping your data secure, and iterating to meet business goals with a growing codebase and infrastructure. A mess of stored procedures would restrain you from doing all three.

Your argument has a non sequitur right here. A mess of [foo] is a mess; the layer it is in does not matter; the language it is in does not matter. A mess of application layer code is equally effective in preventing scale, security and effectiveness.

The original post is right. Web developers treat their databases poorly[1]. A database is an interface to your data that maintains integrity. Maintaining integrity almost always means stored procedures, as some validation is not expressible as relational integrity and basic type validation.

Now, if you are at the point where your database fully guarantees integrity of data going in and coming out, a REST interface is a small step away. This project is very welcome.

[1] The typical web developer treats a database as a data store. It is also a data store, but a well designed database is much more than than.

A mess is a mess, true, but some are easier to clean up than others.

GP is correct. Methods for scaling/optimizing the application layer are clear and well-known. Scaling the data layer is a huge challenge. This is why the market is filled with snake oil databases promising linear scalability and perfect consistency/reliability, etc.

Scaling the data layer is a huge challenge. No doubt. But calling databases that are designed for solving these problems "snake oil" undermines the huge amount of work that serious engineers have invested in this. No one has ever promised linear scalability and perfect consistency/reliability. No one.

Cassandra, HBase, CouchDB etc even MongoDB have built in scalability as a first order priority from day one and have been largely successful at it e.g. iCloud, EA Online, PSN. Databases like this are a nightmare to work with for smaller datasets but work incredibly well with larger ones.

It's always a shame to see HN act like you scale vertically and magically every problem is solved.

> It's always a shame to see HN act like you scale vertically and magically every problem is solved.

When this is seen (and IME it's a pretty minority opinion) I think it's there as a reaction to the massive overuse and hype regarding a lot of newer-gen DBs. There's absolutely no doubt that there are good uses for them, but those cases are pretty niche compared to the level of their uptake.

You should read "innovator's dilemma"
Is that comment intended to imply that companies will go under if they fail to deploy new technology that doesn't target their business needs?
MongoDB heavily implied the linear scalability, consistency, reliability bit in its early material [particularly in their marketing].

Its only in the past couple years they really started mentioning the fact it was "tunable consistency" blatantly rather rather than burying it in a couple places in the manual.

I purposely chose a non sequitur in the interests of speeding up the prose. It is an acceptable method frequently utilized in language. It roughly translates to:

"Than (what will certainly be given the expressiveness and level of abstraction they provide) a mess of stored procedures."

"The problem is scaling your database, keeping your data secure, and iterating to meet business goals with a growing codebase and infrastructure. A mess of stored procedures would restrain you from doing all three."

Perfectly expressed.

I'm always a little confused that people seem desperate to use the wrong tool, and then blame the tool. If you need to store normalized data and maintain integrety -- you'll end up with the equivalent of an SQL datastore (or, more likely a system that is faster, but subtly broken).

Sure, it's difficult to scale ACID. But if what you need is a way to serialize objects, you'll probably be better off with something like Gemstone/GLASS, a document store or some other kind of object database?

If your problem domain actually fits working with structured data, then using an SQL system makes a lot of sense. The obvious example for "web scale" here is Stackoverflow. Sure their architecture has grown a little since it was 2xIIS+2xSQL Server -- but they got pretty far on just that.

The bigger issue is this idea that everything needs to live in one place. For the bulk of an application handling a billion requests / day I'd wager that most of that traffic is isolated to certain types of data.

I'd wager that because in almost every case I've ever seen it's true. You just don't tend to see every table in a normalized dataset bearing the traffic load.

If that is the case, rolling that particular piece of data out to a more easily scalable store will largely fix the problem, if caching, async writes and buffered writes didn't already.

Everything else can very easily sit in PostgreSQL, avoid race conditions, maintain data integrity, have permissions controlled and be accessed from multiple languages directly without requiring an API layer. Then you can use a foreign data wrapper to let PG query that other data source (mongo, couchbase, redis, whatever) and join the results with the other data in the database just like it's all one bit happy dataset.

As another poster said, a mess is a mess and honestly I don't know why he takes a shot at Rails since Rails has some of the best first class support for leveraging PostgreSQL features these days.

Wrote an entire post about it: http://www.brightball.com/ruby-postgresql/rails-gems-to-unlo...

We are exactly there. We're having to remove vast swathes of stored procedures and rewrite everything.
> The problem is scaling your database

There is only one database for everything in the business? Of course it doesn't scale. The problem you describe stems from solving every business request by adding yet another table to 'the' database.

It's a monolithic solution. It doesn't matter if you use database features or not. There is no difference between a mess of stored procedures and a mess of business logic classes. It's still a mess.

Web servers usually scale better than (traditional) databases, so it makes sense to not offload computation to the database, even if it means that there's an overhead.
That's very situational. Read scaling a database is easy. Write scaling a database is harder and doing computational logic while write scaling a database is harder still. Computational is still a very broad word though and the intensity of those computations is a huge defining factor.

The problem boils down to the "the database" idea described earlier. There are very, very few normalized datasets that I've ever seen that have write scaling concerns on more than 1 or two tables.

Move those to a separate datastore that is built for it and you've largely solved your problem. Postgres can even connect to outside datastores to run queries against them for sake of reporting.

Web server codebases are typically also way easier to modify, unit test, with better tools and languages.
there is even pl\brainfuck so as far as choice of langs PG has you covered
Once you get rid of your N+1s the bottleneck in my experience (working with Rails now since v1.2) has always be Rails / Ruby itself. It is so incredibly slow, even using just Metal (even Sinatra for that matter). The slowdown at the view level is significant.

I always have a caching strategy (usually varnish in front of nginx) with Rails unless it's literally only supporting a handful of users, and anytime I need to support non-cacheable hits like writes to more than 50 or so concurrents I consider swapping in Node or Go or something reasonably performant to handle just those writes.

Lately I've been looking into Elixir as a Rails alternative for APIs for performance and scalability. I am very intrigued by a PostgreSQL based REST API.

The point is that when Rails gets too slow it is very easy to switch to something like cacheing or C (Or Go, or whatever). Even if you just split it off at nginx or use a worker pool in a faster language. Or if you need lots of concurrency use Go. Or even replace the Ruby code with one fairly nasty SQL statement or a single stored procedure.

The other 95% of your code can be slow Rails. You know those pages where a user adds another email address, or where they report a comment as being hateful, or where they select what language they want the app to be in, or where you have your teams page, or your business partners page, or your API docs and key registration / invalidation.

The database doesn't scale without pain though. You have joins, you're going to need to get rid of them. You have one table on a machine, you're going to need to split it. You have speedy reliable writes, you are going to have to either make due with inconsistency and possibly have a whole strategy to clean up the data after the fact or lose the speediness.

I'm intrigued about shuffling the serialization of JSON to Postgres, but that is different than what the OP was talking about.

By the same logic though, at the point that heavy write load becomes a reality it's just as feasible to move the heavy write table to an isolated datastore and leave 95% of your data (structurally) in the PG. Even use a PG foreign data wrapper to connect to that new datastore to allow PG to continue any necessary queries against it.

I'm not ever going to argue for heavy stored procedure usage but there are definitely times when it makes sense and more times still when using the features in your database instead of setting up multiple different standalone systems for pubsub, search, json data, etc when your database can do it all makes sense.

It's very similar to the "you can always switch the slow parts" point with Rails to move a part to Go. You can do it all in PostgreSQL and then when you actually reach a point where you've grown it into a bottleneck, move it out.

Postgres isn't SQL Server and it isn't Oracle and it isn't MySQL. It's Postgres. It's a tool that you choose because of it fits your needs, not because somebody told you it was a good database. You choose it as part of your stack. If you are using PostgreSQL because you wanted a dumb datastore then you chose the wrong database and should probably reavaluate your options. That's like getting a Lamborghini to make grocery runs.

http://www.brightball.com/postgresql/why-should-you-learn-po...

I am a postgresql novice, but I've used the JSON serialization and it is indeed fast. But, here's my question:

When you do a 1-to-many join and return the same fields very many times, do the binary drivers optimize that or is it return many times? With JSON serialization (or serializing to arrays), you only get the one row.

If Facebook uses MySQL and PHP there is some truth in the comment.
To say that Facebook uses PHP and MySQL is to leave out the truth, honestly. They are a part of the stack, yes, but they aren't what makes the application scale to billions of requests. It would be like saying the local coffee shops website using Wordpress with a MySQL backend is using the same tech as Facebook. It's laughable.
They choose MySQL vs a lot of other alternatives for some reason and this reasoning can be applied to your use case.

> They are a part of the stack, yes, but they aren't what makes the application scale to billions of requests.

These are not just part of the stack, these are critical components within the stack.

To say that PHP/MySQL is just a "part of Facebook's stack" is laughable.

They are the core components of Facebook. Normal people understand that the characteristics of Facebook's architecture is unique to just Facebook. They can get away with sharding/colocating data that nobody else can. The rest of us have a tonne of integrated data that requires complex joins (whether at the application or database layer).

They are edge components of Facebook. Just from a brief interaction with FB recruiters, I learned they use a lot of Vertica in their back-office. Please don't propose that they are using MySQL for their main business when it's only powering app nodes which are just POPs fed by their real (internal) services. Approximately speaking.
I never mentioned performance or scaling as reasons for using a database's features- though they might be worth considering. The fact I never said those words and it fired you up says more about your experience than mine I imagine.