Hacker News new | ask | show | jobs
by joaodlf 3487 days ago
Warning: I build bizarre stats systems for a living :)

I totally get where you are coming from. Right now I'm thinking about a web API that feeds data into Kafka, to be processed (in Python, maybe Go?), stored into Cassandra and later on be the target of large Spark jobs, by the way, I need to present this info through pretty graphs and tables - Pandas will come in handy!

Sometimes it's better to just use what someone else has built, let them think about the implementation, the storage, the traffic and the maths... Here is where a third party solution falls apart: a) Costs. Data Analysis is stupid expensive. b) ... and this is the important one: Your sales/consumer facing teams want some extra numbers, literally the sort of thing that only fits your business. The solution you decided on doesn't support that use case, you are now stuck with an inflexible solution.

New Relic Insights is OK for some use cases, completely useless for the majority of analytics I need to serve, though. If it fits your bill, great! Save yourself A LOT of time and life span... Just keep everyone else on the business away from it, or they will start asking for things you can't give :)

1 comments

I am super curious. Most analytic questions I run into: give me a month over month, which Test won, why is x happening, etc. These could be solved with just some sql queries. What questions do you run into where you need Kafka + pig +fig+ hive+ all messaged with scribe + redshift. Doesn't it even make it more difficult to answer questions?
It's not so much the implementation details that worry me. Well, I do get worried if we end up building a vastly complex beast, but what I REALLY worry about is having data available for whatever eventual scenario that might pop up. It's true that tools (like New Relic) answer a lot of questions, but data in these systems isn't usually available for you to play with, you're constricted to their sandbox. Even if it is available, a lot of the times the data is built and stored in a way that only makes sense to be used through their system (with good reasons, performance being the best one).

A lot of the times these systems are built not only to serve business insight and stats, one of our main systems needed to answer two requirements: a) better/faster analysis for us; b) serve as a machine learning platform to serve better content to our users.

a) complements b) perfectly, as we collect data for analysis, that same data feeds into other areas of the business that help our users, on the fly.

You could argue there are solutions out there that satisfy a) perfectly, but the learnings of doing a) is what made b) possible.

Even if you're happy with a solution like New Relic (and by all means, I'm sure it's a good product, we use New Relic a lot!), what happens when someone has an idea like... oh I don't know... "can we build something that looks at the past 7 days worth of data and flags up any metric that moves away from the standard deviation line? Also, can you then match that against historic data and identify patterns/catch false positives?"... Just an actual, factual, example that I'm working on as well.

This is a valid question. In some cases it has to do with the amount of data you're working with. Most database management systems have made progress for aggregating large amounts of data. In many cases it is still necessary to distribute the workload, which in turn creates the need to build out the rest of the distribution system.

With that being said, and to your point, I would not be surprised if these systems were often over engineered when a sql query could get the job done.

> With that being said, and to your point, I would not be surprised if these systems were often over engineered when a sql query could get the job done.

RedShift takes SQL queries.