Hacker News new | ask | show | jobs
Ask HN: Help me to decide architecture for buidling my startup
6 points by starterkit 4595 days ago
I am planning to start my startup and it is kind of social app. Currently I am researching approaches to build my app, but there are too much to consider, so I ask for your help to decide my architecture. (below knowledge is from my research on topic) My main concern is storing data. - Use RDBMS. PostgreSQL, because I need location based features so I may use PostGIS. - use Couchbase or cassandra. Why? because doing activity stream with RDBMS maybe slow due to joins. In NoSQL, most of the time I will not be able to use joins, so I will design denormalized tables for activity streams (maybe I am wrong) Also couchbase seems fast, and it contains memcached inside, s caching layer and scaling layer will be there on install. As far as I understand if you use couchbase you can retrieve data in json, so it will reduce converting data to json in backend app. side, where I am going to use pythn/django. - I need search so I am going to use elastic search, in that case couchbase can be a good fit, since they have a plugin for ES, which may boost development time. What would you recommend? (I have never created such thing before, so my questions may seem to stupid to you) If you have any suggestion please suggest, maybe I am thinking too much about performance and scaling, maybe I should just stick with PostgreSQL at the beginning.
3 comments

TLDR; Couchbase is cool but has some considerations so evaluate, Cassandra is better spoken to by someone with experience there and PostgreSQL is still great if you have a relational dataset -- start with what you know if it works and go from there. ---

There is a huge number of factors that go into it, but I'll give you some opinion. :)

If you know PostgreSQL and how to work with it today (and the others are new to you), stay with it for now until you know it isn't the right tool. Trying to learn a new methodology and tool while also starting a company and trying to gain traction etc isn't always the best way to go. Also, think about when/if you need to hire, how long will it take to find someone with experience in X tool or to train someone on it.

As for Couchbase vs Cassandra vs PostgreSQL. All have their pro's and con's and it will boil down to your use cases, dataset and complete tech stack (i.e. some SDK's are less mature than others)

I have been a huge Couchbase fan and user for a few years now, going back to membase. However, I'll be honest, while our current primary datastore is Couchbase, we are moving away from it because of the amount of time we spend solving issues that just shouldn't be. To get this out of the way, I love CB's scale out ability and performance, it is stupid simple overall and works very well -- Mongo could learn a few things about making the scale out process easier from Couchbase (and I think they are). We also use Couchbase to ElasticSearch, and it works pretty damn well, but again is still maturing. In our recent evaluations we found we can replace ES for 60-70% of why we have to use it simply by moving off Couchbase. That means I can reduce my ES resources, to the 30-40% of use cases where it is needed and save some cash, while still getting the same results and performance.

There are a number of things to consider when using CB as your datastore, and while we are moving away from it, I think it is worth a solid look. However, if you store a lot of documents that are small in size but you want keyed for near instant access, Couchbase can cause you to need far more machine resources than you really should (e.g. it gets expensive fast). This is because every key + meta data (56 bytes for 2.2 I believe) must be stored in your bucket RAM, and once the key+meta-data exceeds 50-60% of the available, your in trouble in a few ways. So if you define the bucket to be 2gb, every key+meta data must fit within roughly 50% of that (1gb). Of course, you can keep scaling up/out to increase that size, but like I said costs start to become a factor here. A fair rebuttal to that is to restructure the data so it is larger values, smaller number of keys. However, now you run into a second issue, while views are awesome we have seen they have quite a way to go to be truly a final solution, and they have diminishing returns if you have too many of them. So then the typical answer is you start merging views and returning larger data sets and doing more and more work on the Couchbase client side (API etc) to filter results. Not saying that is always bad, just something to consider. Couchbase also limits you to no more than 10 buckets per cluster (and in my experience more than 5 and your CPU utilization goes up pretty well, so you need more CPU generally). Which means if you need document segmentation, that is more than just a "type" field on a document, this can quickly become an issue. Lastly, all of our API's are in node.js, and frankly CB's node library has a way to go before it is really ready to work in a high transaction way. We have found that it leaks memory when you have sustained high transaction volumes (this is with node 0.10.22), so we have reverted to writing a lot of larger tasks directly in C to get around it; while I actually enjoy doing that, it is time-consuming and not an efficient use of our bootstrapped resources. I read a lot of what the CB team is doing and I think they are working hard to fix almost every one of my points, so just weigh your entire stack first. And please don't consider this a bash against CB, it is anything but, as I think their technology is pretty damn cool, it just has to fit your use case properly like any technology.

As for Cassandra, I am no where near an expert or even a good novice here, so someone else can give you the good/bad there. I do know from reading that it has grown in favor quite a bit and the redundancy and reliability are quite good. We just evaluated it and felt it would be a good solution, however we had a hard time fitting our use case into it. I fully admit that may be our own limitations more than Cassandra's.

PostgreSQL is great, especially if you have the need for highly relational data. In general, I still would favor an RDBMS if your dataset is highly relational. So this depends more on what your data looks like and how it gets used. Performance is good when designed right, but hard to reach the performance of Couchbase, although everything has a trade off. If I needed the performance in places but my data was highly relational, I might look at using Couchbase in front of the RDBMS as a persistent cache, this makes recovery easier on the DB when there is a fault.

In the end its still all about your use case, dataset, tech stack and what you need it to do.

Thanks for great explanation of couchbases disadvantages :), I have read a lot about couchbase and how it is good, but real production is not an ideal, so thanks.
Anytime. Good luck with everything.
Design your stack around where the organization is growing in a business perspective: early stage start-ups are very fast organizations. You have to design your stack in a way, early on, that supports failed assumptions, which means frequently changing requirements at the same leaving enough room to breathe and grow.

Scaling equates to specialization, you specialize in a specific area of your stack because your startup has grown in that direction. In a very early stage, there's not much growth only a period of intense validation, so don't overthink about scale right now.

If you are using PostgreSQL, just stick with it. You can ship things faster and troubleshoot better with stuff you know. In terms of fetching data, you can design it in a way that you actually don't have to do any joins at all. Twitter still uses MySQL up until this very day, they've customized the core engine for their purposes. Point is: don't over-think about storage for now, no one knows right now where your startup will grow into :)

Build and design a pleasurably usable RESTful HTTP API Server with a matching client: in my experience this is very very helpful. At a very early stage, building an API server allows you to pivot relatively quickly. When you have a "proxy" for your database, it's practically developer-UX for fast changing business requirements, and it will avoid "database code hell" ie, random projects doing random things at your database.

Imagine you're building this huge web-app, but the users clearly want and need a mobile equivalent. What if the users want some sort of on-site installation for an enterprise version? Suddenly you're not a B2C startup and you'll be going on B2B.

An API server helps you do tons of things that enables you to ship applications faster and makes your startup very flexible since you can isolate and maintain this very large part of your product.

Thanks, by the way as an experienced developer what would be your stack if you wanted to start your own. After reading comments here, posts in other places, I am going to stick with Python/Django/PostgreSQL+PostGIS/ElasticSearch and django-tastypie for REST backend, for activity stream maybe django-activity-stream for the beginning, since I am only one member now :), I do not want to reinvent bicycle.
I can't comment on your storage stuff (we've had to develop our own to fit our specific needs), but elastic search is a good option to start out with for search. You can check out the Elasticsearch case studies here: http://www.elasticsearch.org/case-study/
Can you tell me yours if possible and what kind of product you are building