Hacker News new | ask | show | jobs
by meanguy 5093 days ago
It seems like AppEngine is saving your ass at the moment but aren't you worried about scale? This is sort of a classic "storage not data" problem where you mapreduce raw data to a structured store for reporting. Are you really still querying everything live? When do you expect this to break down?
2 comments

It broke down for me on AppEngine. I had to move data out of the store to blobs, then use AppEngine queues to reduce the data into the store for reporting access.

Basically they promised me what they promised you and, after I got past a few TB of real data, the whole thing blew up.

Also what "front end user apps" are you unable to write on AppEngine itself that require something like EC2? Splatting data out the HTTP hole was the least of my worries.

I'm not sure what you mean. Do you mean "why would a team choose EC2 over App Engine?" I never claimed that you are unable to do anything specific on App Engine.
Your post confused me because it said a lot of things about App Engine's datastore that conflicted with my direct experience. Khan Academy is one of the few sites that I'm excited about at the moment, so I'm concerned.

I chose AppEngine because I was very much aware of the issues around big data and I thought I could avoid having to deal with it. I came away from your post with the feeling that you may be underestimating what you're up against. Step one: look at your data size and querying cost every day!

Right now you can access the datastore externally via the remote_api shim or an API you put on your app. Performance isn't great. (An OData-style HTTP interface to the datastore seems like an obvious addition.)

Specific to my query: you say you're excited about Google's EC2 equivalent. I'd be more excited about the managed Hadoop that's likely the next step along your dev path whether you're aware of it yet or not. Custom mapreduce operations against the Google App Engine datastore, ironically, really suck and are really expensive.

So... was this general excitement or is there something specific you want to do with App Engine but you can't yet? And have you estimated out the transactional costs for walking across your full record set even if they gave you access to it?

You're likely going to find yourself stuffing at least some things in a SQL store and talking to that.

Ah. Well, first of all, we have already gone through the pain of building a pipeline to export the majority of our heavy data analytics to a Hadoop/Hive setup on EC2. So, yes, we only use App Engine's mapreduce in certain cases where it makes sense.

However, what I'm specifically referring to in this blog post is the ability to keep relying on App Engine's datastore for the everyday work involved in serving our application (forget the mapreduce stuff) while gaining more flexibility to run non-App Engine pieces of software on the virtual servers without suffering the App Engine-to-EC2 latency pain.

A trivial example would be Lucene (right now we have to run it on EC2 and communicate back'n'forth). Another example would be our own memcached servers that we control the size of.

That's exactly my point in the post. This won't break down in App Engine's datastore, which is unique.

We don't have to worry about it, because performance scales with the size of each query's result set, not the size of our data.

On any other system, you'd have to worry about it. That's a (for now) unique opportunity Google has w/ compute engine, IMHO.