Hacker News new | ask | show | jobs
by ransom1538 3483 days ago
For what it is worth, every company I have worked for - and almost every company I know -builds their own bizarre stats system. Each presentation I attend (last one being uber) the ideas for storing columnar data gets even nuttier. Frankly I gave up. Now I just installed new relic insights and I can run queries, have dashboards, and infinite scale. I understand that slack has scale - but why on earth hook together 30 random technologies and become an analytics company too.
4 comments

Warning: I build bizarre stats systems for a living :)

I totally get where you are coming from. Right now I'm thinking about a web API that feeds data into Kafka, to be processed (in Python, maybe Go?), stored into Cassandra and later on be the target of large Spark jobs, by the way, I need to present this info through pretty graphs and tables - Pandas will come in handy!

Sometimes it's better to just use what someone else has built, let them think about the implementation, the storage, the traffic and the maths... Here is where a third party solution falls apart: a) Costs. Data Analysis is stupid expensive. b) ... and this is the important one: Your sales/consumer facing teams want some extra numbers, literally the sort of thing that only fits your business. The solution you decided on doesn't support that use case, you are now stuck with an inflexible solution.

New Relic Insights is OK for some use cases, completely useless for the majority of analytics I need to serve, though. If it fits your bill, great! Save yourself A LOT of time and life span... Just keep everyone else on the business away from it, or they will start asking for things you can't give :)

I am super curious. Most analytic questions I run into: give me a month over month, which Test won, why is x happening, etc. These could be solved with just some sql queries. What questions do you run into where you need Kafka + pig +fig+ hive+ all messaged with scribe + redshift. Doesn't it even make it more difficult to answer questions?
It's not so much the implementation details that worry me. Well, I do get worried if we end up building a vastly complex beast, but what I REALLY worry about is having data available for whatever eventual scenario that might pop up. It's true that tools (like New Relic) answer a lot of questions, but data in these systems isn't usually available for you to play with, you're constricted to their sandbox. Even if it is available, a lot of the times the data is built and stored in a way that only makes sense to be used through their system (with good reasons, performance being the best one).

A lot of the times these systems are built not only to serve business insight and stats, one of our main systems needed to answer two requirements: a) better/faster analysis for us; b) serve as a machine learning platform to serve better content to our users.

a) complements b) perfectly, as we collect data for analysis, that same data feeds into other areas of the business that help our users, on the fly.

You could argue there are solutions out there that satisfy a) perfectly, but the learnings of doing a) is what made b) possible.

Even if you're happy with a solution like New Relic (and by all means, I'm sure it's a good product, we use New Relic a lot!), what happens when someone has an idea like... oh I don't know... "can we build something that looks at the past 7 days worth of data and flags up any metric that moves away from the standard deviation line? Also, can you then match that against historic data and identify patterns/catch false positives?"... Just an actual, factual, example that I'm working on as well.

This is a valid question. In some cases it has to do with the amount of data you're working with. Most database management systems have made progress for aggregating large amounts of data. In many cases it is still necessary to distribute the workload, which in turn creates the need to build out the rest of the distribution system.

With that being said, and to your point, I would not be surprised if these systems were often over engineered when a sql query could get the job done.

> With that being said, and to your point, I would not be surprised if these systems were often over engineered when a sql query could get the job done.

RedShift takes SQL queries.

I've actually been running an analytics company for a few years (http://parse.ly), and as a result of seeing what you're describing and thinking it was pretty strange -- that many companies have "not invented here" syndrome about analytics -- we actually turned our data collection and event enrichment infrastructure into a fully-managed cloud service. It's called Parse.ly Data Pipeline, and is described here: http://parse.ly/data-pipeline.

Together with cloud SQL tools like BigQuery or Redshift, it gets rid of the need to build a "full analytics stack" on your own. You can license the data collection/enrichment from us (we've already scaled it to billions of monthly events), you can use our clean starting schema (over 100 enriched fields per event), and then you can pipe the data into a fully-managed analytics warehouse, or just analyze it in raw form. Then you can actually spend all your time focusing on insights, rather than fussing about data collection clusters, pipelines, ETLs, etc.

I would love to hear what you think of the idea; it was launched just a few months ago.

I'm surprised too. I work at companies that have their own data center so can't use new relic, datadog etc. I'm really surprised there aren't more free open source analytics platforms for small projects. I'm going to start one when I "get some spare time". lol.

Anyone know of anything out there?

Using paid tools has no relation with having or not having your own datacenter.

If you have a small project (what is "small"?) you just deal with Google analytics or direct SQL requests to the single database you have. Don't need fancy tools.

The two free stuff I can think of are piwik and snowplowanalytics. They clearly suffer from "free open source" when compared to the paid tools out there.

Actually I forgot that I played with nagios vs graphite - if anyone knows of other backends similar to those that would be appreciated.
The kind of "stats" collected by new relic are only one of many inputs toa data warehouse like Slack is describing. You can't import your mysql databases into new relic, for example.