| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by meanguy 5142 days ago

Your post confused me because it said a lot of things about App Engine's datastore that conflicted with my direct experience. Khan Academy is one of the few sites that I'm excited about at the moment, so I'm concerned.

I chose AppEngine because I was very much aware of the issues around big data and I thought I could avoid having to deal with it. I came away from your post with the feeling that you may be underestimating what you're up against. Step one: look at your data size and querying cost every day!

Right now you can access the datastore externally via the remote_api shim or an API you put on your app. Performance isn't great. (An OData-style HTTP interface to the datastore seems like an obvious addition.)

Specific to my query: you say you're excited about Google's EC2 equivalent. I'd be more excited about the managed Hadoop that's likely the next step along your dev path whether you're aware of it yet or not. Custom mapreduce operations against the Google App Engine datastore, ironically, really suck and are really expensive.

So... was this general excitement or is there something specific you want to do with App Engine but you can't yet? And have you estimated out the transactional costs for walking across your full record set even if they gave you access to it?

You're likely going to find yourself stuffing at least some things in a SQL store and talking to that.

1 comments

kamens 5142 days ago

Ah. Well, first of all, we have already gone through the pain of building a pipeline to export the majority of our heavy data analytics to a Hadoop/Hive setup on EC2. So, yes, we only use App Engine's mapreduce in certain cases where it makes sense.

However, what I'm specifically referring to in this blog post is the ability to keep relying on App Engine's datastore for the everyday work involved in serving our application (forget the mapreduce stuff) while gaining more flexibility to run non-App Engine pieces of software on the virtual servers without suffering the App Engine-to-EC2 latency pain.

A trivial example would be Lucene (right now we have to run it on EC2 and communicate back'n'forth). Another example would be our own memcached servers that we control the size of.