Hacker News new | ask | show | jobs
by ransom1538 3483 days ago
I am super curious. Most analytic questions I run into: give me a month over month, which Test won, why is x happening, etc. These could be solved with just some sql queries. What questions do you run into where you need Kafka + pig +fig+ hive+ all messaged with scribe + redshift. Doesn't it even make it more difficult to answer questions?
2 comments

It's not so much the implementation details that worry me. Well, I do get worried if we end up building a vastly complex beast, but what I REALLY worry about is having data available for whatever eventual scenario that might pop up. It's true that tools (like New Relic) answer a lot of questions, but data in these systems isn't usually available for you to play with, you're constricted to their sandbox. Even if it is available, a lot of the times the data is built and stored in a way that only makes sense to be used through their system (with good reasons, performance being the best one).

A lot of the times these systems are built not only to serve business insight and stats, one of our main systems needed to answer two requirements: a) better/faster analysis for us; b) serve as a machine learning platform to serve better content to our users.

a) complements b) perfectly, as we collect data for analysis, that same data feeds into other areas of the business that help our users, on the fly.

You could argue there are solutions out there that satisfy a) perfectly, but the learnings of doing a) is what made b) possible.

Even if you're happy with a solution like New Relic (and by all means, I'm sure it's a good product, we use New Relic a lot!), what happens when someone has an idea like... oh I don't know... "can we build something that looks at the past 7 days worth of data and flags up any metric that moves away from the standard deviation line? Also, can you then match that against historic data and identify patterns/catch false positives?"... Just an actual, factual, example that I'm working on as well.

This is a valid question. In some cases it has to do with the amount of data you're working with. Most database management systems have made progress for aggregating large amounts of data. In many cases it is still necessary to distribute the workload, which in turn creates the need to build out the rest of the distribution system.

With that being said, and to your point, I would not be surprised if these systems were often over engineered when a sql query could get the job done.

> With that being said, and to your point, I would not be surprised if these systems were often over engineered when a sql query could get the job done.

RedShift takes SQL queries.