Hacker News new | ask | show | jobs
What we learned from building custom analytics over Amazon Redshift (alooma.com)
53 points by itamarwe 3851 days ago
6 comments

Probably the biggest challenge I had in my last startup was setting up the analytics platform. Obviously building an analytics platform wasn't the top priority (probably the last).

So you put Google Analytics on your website, then you need the tracking code from Facebook for the ads. And you add Flurry because it provides different analytics (e.g: Funnels and Cohort Analysis). We also logged on the DB some actions difficult to capture on analytics. Then you need mobile ads, so you add in your app the framework to track the install. And, I forgot, you have another platform to track your mailing lists.

The final result: for every business question you have data in at least 3 different platform that gives you three different answers. Even worst, when you are still struggling with the product your "hard tech" co-founder could not understand why you push to re-implement the analytics framework (not necessarily building it in house) while your business co-founder decide to don't ask questions and doesn't understand why you want better data. If it looks like a recipe for disaster is because it is.

We really need better analytics platform, easy to use like Google Analytics and MixPanel for the beginning but with the ability to build custom analysis for the growing biz.

I'm a big fan of [Segment](https://segment.com) who do a great job of putting an abstraction layer on top of the massive selection of analytics platforms you're likely to end up using. A particularly nice touch is some of their plans having a feature where you can replay your entire event history into new analytics tools you add, allowing you to almost immediately start analysing all your data with them.
That's not quite right what I have in mind. It solve part of the issue, the maintainability of 10+ integrations, but it is not really the biggest pain point.

The biggest pinpoint is having 10 different service that gives you quite the same answer but:

1) they cover different needs and you need a patchwork to customize to what you need (and when you are small it is not even sure you know what you need so the patchwork is biggest than needed).

2) they measure things in slightly different way making comparison difficult.

3) they have vastly different interfaces, thus requiring a lot of time to learn them (I like to learn new things, but my time is valuable and better spent on product features)

4) they create a mess in your code. This is the problem segment try to solve. It is relevant but not top priority. Actually, I'm not totally sure of the effect of inserting a level of abstraction doesn't actually make more complex to solve point 1 and 2 (it is not a critic, just a real problem I faced when I tried to abstract the analytics call in the app with few functions).

Segment actually provides you access to the raw data so if you can layer a BI tool on top of it, then you have solved your main pain point.
We'll soon have an integration with Segment too. In fact, you can already use webhooks to send your Segment data to Alooma.
Dude, I'm experiencing exactly this right now. This is the most obnoxious tech thing ever. All I can find out there are stories of homegrown metrics systems made out of duct tape and chewing gum, or SaaS solutions that inevitably have shortcomings.
This is not, as billed, about building a custom analytics solution.

This is SQL as BI, which is great for small technical teams, but hardly a good or scalable solution.

If anything, this is a vague argument to have a data warehouse. Data warehousing is only an intermediate step in any analytics or BI solution and the hard problems are not in moving data or writing one-off queries.

What kind of solution would you suggest? What is missing?
BI or an analytics solution would be more all-encompassing than an ETL tool (Alooma) and a database platform (Redshift).

ETL is much more than just moving data around. I haven't dug too deeply into Alooma's offerings, but it does appear to have workflows for more than just data from system A to destination B. Some of the important things to look at in ETL are change-tracking for slowly changing dimensions, business logic transformations, and mashing up data from disparate sources against conformed dimensions. Master Data Management is its own practice, but is intimately involved in ETL. Some ETL tools include Alooma, SSIS, Kettle.

Beyond ETL there is the data warehouse platform which usually consists of a RDBMS, sometimes with an OLAP layer as well. OLAP is distinguished from something like Redshift, which is a columnstore RDBMS, by including a metadata layer with the data. MDX is probably the granddaddy of OLAP languages. Think something that a pivot table can speak to directly. The data warehouse can really be any RDBMS product.

At the data warehouse layer, there's much more than just flattened source tables. A good deal of effort needs to go into the dimensional modelling for end-user consumption and reporting. We could take an aside for Inmon vs Kimball here, but it's spurious, because Inmon recommends dimensional data marts for end user consumption, just like Kimball - the data warehouse is never exposed in the Inmon methodology and dimensional data modelling is required for either.

If we are utilizing an OLAP engine, then there is a lot of measure definition to be done, encapsulating a lot of the logic that is displayed in the single-purpose queries of the article's examples. The data model will be based on the dimensional model from the data warehouse, but there's a lot more metadata in an OLAP database, which helps make self-service reporting much easier. Many more people are comfortable building a pivot table than even a simple SELECT statement in SQL. Some OLAP engines are SSAS and Mondrian.

Regardless of whether we have an OLAP solution, we need a presentation layer (typically several, as there are different types of reporting needs, and different strata of sophistication among end users). This is something like Tableau, Qlik, Quicksight, Cognos, Power BI (all prior products purport to cover varying degrees of the data ETL and modelling process in addition to providing a presentation layer), Crystal Reports, and SSRS are just some samples of many in the commercial space (I'm not as familiar with open source presentation layers).

This is just a high level overview of the major components of a BI or analytics solution, and fairly barebones. The article spoke to ETL a little, and at best hit tangentially on data warehousing and presentation.

Was going to suggest Google BigQuery (faster, cheaper, more scalable) but then looked at the rest of your workflow, and it looks like ETL would get more complicated, and the visualization options wouldn't be as extensive (although BQ does seem to support Tableau). On balance, not worth the hassle.

Looks pretty neat, job well done!

Hi melted! I have to agree with you that BigQuery is probably faster, cheaper and more scalable than Redshift for this and other use cases. Specially given the "join multiple datasets" angle.

But why would you say that ETL would be more complicated? Specially since BigQuery is the best way to get raw Google Analytics data (for premium customers).

Re: Visualization options - All of the options mentioned by Itamar work well with BigQuery, except one (Quicksight) - and I know for sure how much re:dash, Mode, Looker and Tableau love Bigquery.

Anyways, great article - and I love the real use case queries.

Because the actual analytics (the part that BigQuery provides) is maybe 20% of this solution, and judging by the slides their ETL is very easy to use. What would I even use on Google Cloud to do ETL? Dataflow? Javascript UDFs? Something else? All of that seems clunky compared to what these guys are offering. And they have a bunch of data sources available "out of the box" that would be a hassle to deal with manually.

Another issue with BigQuery seems to be unpredictability of cost. One typo somewhere and you can easily run up a bill in tens of thousands of dollars because your dashboard isn't caching something. In a similar situation Redshift will merely get slow.

Clunky ETL: Please correct me if I'm wrong, but what I saw about ETLs in the article "One example we encounter quite often is that Mixpanel stores timestamps in seconds, while Redshift expects timestamps in milliseconds." - that kind of transformations I would much rather run inside BigQuery in a couple seconds than going through a whole pipeline. Other things I could outside, just as what they are doing now - but I didn't see any RS specific advantages for transformations?

Cost: Cost should be way below other solutions - and to prevent problems BigQuery now has cost controls at a user and project levels: https://cloud.google.com/bigquery/cost-controls

Thanks for your comments!

Thanks!
Nothing here about "what you learned". This is a straight up tutorial on how to use your product!
A lot of this does showcase how to use Alooma's product, true.

However, I hope that the SQL queries provided would be helpful to anyone who would like to build it on their own.

FD: I am Alooma's CTO

So, what did you learn? That bit is completely missing from the article.. Surely not how to write a (very simple) SQL query?
What data warehouse to choose, what are the different visualization tools available, what events to send, how to setup Redshift, how to choose the keys for the tables, and yes, how to recreate popular dashboards with simple SQL queries.

Of course, we also have more complex queries for analysis, that we will share at a future post.

It's actually hilarious that your SDK's are just modified versions of Mixpanels'...

https://www.alooma.com/docs/integration/ios-sdk

Makes integration a breeze :)
so if I use Alooma android SDK, I can push event data straight into RedShift ?

I'm not very clear about how to use Alooma.

Yes, exactly. You'll push events from Alooma's Android SDK and they'll land in your Redshift in seconds. You can have a look at the documentation here: https://www.alooma.com/docs/integration/android-sdk
I couldn't really understand how you replaced Mixpanel and other analytics services with Redshift. AFAIK Redshift is optimized for bulk inserts so the micro-batch approach (a few seconds) is not is not optimal and in fact quite expensive. Also the data in Mixpanel is not structured and if you're using JSON in Redshift, you can't take advantage of the columnar storage format which is actually the main point of using Redshift. Do you care to explain how you solve these issues?
Hi, Alooma uses batch size appropriate to the data volume, the bigger the data volume is, the bigger the batch size will be.

Alooma not loading into a JSON column, but provides mapping UI and transformation layer to in order structure and normalize the data into different appropriate columns, thus leverages the columnar storage properties.

FD: I am Alooma's CTO

*mAPI = AloomaAPI.getInstance(context, “<Company_Name>.alooma.io”)

no api key ?

this looks interesting - but it would be nice to be able to get started immediately without "request a demo".