Hacker News new | ask | show | jobs
by greggyb 3740 days ago
Analytics 101: Choosing the right database is the wrong first step.

I was excited when I saw 'choosing the right data model' as one of the rules, but they are talking about the data models the DB uses internally. The important data model is choosing how you model the data you have to analyze. I have my biases, but I'd argue that a dimensional model would be a good starting if we're really at a 101 level and extensions to the model are for future classes/development.

Starting with the end in mind is very important. When I look at this, I think in terms of data culture. Who needs to be able to do what in your organization? What types of question will you need to answer most often? What types of questions will you need to support in an ad-hoc manner?

To many organizations "analytics" means arithmetic, but with complex filtering logic and business logic, traditional BI, essentially. To others, "analytics" means R code monkeys. To others it may mean specifically visualizations and the presentation layer. There are many interpretations of the word. Regardless of the interpretation, process and culture are more important to understand before the technology.

For a rough analogy, it's like saying "Software development 101: Choosing the right programming language". Sure it matters, but knowing what your software needs to support and what the primary use cases are are more important to understand.

Ninja edit: Grammar.

3 comments

I agree with all of this and I think it also carries over to software design too: People tend to think about which design pattern, from some preconceived menu of design patterns, they will need, instead of applying common sense to how the business workflow will look with the desired outputs, and then working backwards from there, usually with some Occam's Razor sprinkled in to make sure you don't overdo it with design mumbo jumbo.

What this has taught me over time is that the best systems, whether for data modeling and data storage, or for software, are systems that go to great lengths to ensure it is extremely cheap and easy to reconfigure, redesign, scrap everything and try something new, and adapt your designs to the real life pain points you didn't anticipate.

I've been very frustrated in the last several years because you see so many people trying to shoehorn this sort of idea into so-called "Agile" methods, but those methods put the emphasis in the wrong place. They depict agility as a property of the team of humans beings, and the particular tasks and schedules of the humans involved. They don't do anything to improve the agility of the underlying software systems, and you can have the most Agile team ever but still get burnt by committing to a bad and hard-to-change design, even if you're generating gobs of story points or other nonsense.

One of the most important and powerful planning tools is to prototype your architecture. Begin doing actual work with it. "Beta test" it with a limited set of actual business users. Collect data, like you're profiling, and make these decisions with evidence about your actual use case, not trite analogies or generalizations of it.

And place a premium on tools that demonstrably make it easy to scrap an underperforming design and replace it with a different design.

Modelling your data dimensionally is becoming less and less popular in the analytics space.

It used to be the standard in data warehouses but now the trend is to leave the data unstructured and use query tools e.g Drill or do multiple ETL into structured versions. But even the structured versions would not be relationally modelled to any great extent. Data scientists typically want access to data right now not in a few months when your database guy has finishing modelling it.

Dimensional modeling, though often associated with a traditional waterfall methodology, is not tied to this delivery methodology.

It remains the most understandable model to the largest population of end users. Analytics is a very broad term, as I mentioned in my original post, and the audience is huge. If "analytics" to you implies an audience of primarily data savvy end users, then dimensional modeling may not hold as much value.

I tend to find that the data scientists at our clients still do a lot more data wrangling than data science when they don't have clean models to work with as a baseline.

Additionally, there's a big difference between exploratory analysis of new data sources where access and low latency are key, and well-known domains that have fairly predictable needs. The former have a habit of transforming into the latter. Dimensional models remain one of the most efficient physical structures of data for a read-dominant workload.

Long story short, I think it's worthwhile to both of us to look outside our bubbles. I work for a BI and data science consultancy. My focus specifically is in core BI workloads, and so I'm definitely overexposed to more traditional modeling techniques. I can guarantee you, though, that the cycle time for a usable pilot that includes a fully realized dimensional model is more on the order of a handful of weeks than months in a typical delivery.

What's your bubble?

Any recommendations? Maybe start a flowchart for collaborative editing? Please link me.