Hacker News new | ask | show | jobs
by zippy5 1811 days ago
This was wonderfully written and if your gonna start a data team, this is how you do it. But I can see that I’m the only one who thought it was crazy to start a data team in the first place.

This company makes 10M and spends 3M on the team and infrastructure to make data a core competency?

A vast majority of wins discussed were lowly differentiated web / mobile / supply chain analytics which they could have gotten and setup with 3rd party software for an order of magnitude cheaper.

I can only imagine what this hypothetical startup could have learned if they spent that money actually talking to customers, and running more experiments.

I’ve heard people talk about data as the new oil but for most companies it’s a lot closer uranium. Hard to find people who can to handle / process it correctly, nontrivial security/liabilities if PII is involved, expensive to store and a generally underwhelming return on effort relative to the anticipated utility.

My take away was that startups benefit tremendously from a data advisor role to get the data competency, as well as the educational and cultural benefits, but realistically the data infrastructure and analytics at that scale should have been bought not built. Obviously there are a couple of exceptions such regulatory reasons like hippa compliance for which building in-house can be the right choice if no vendor fits your use case.

5 comments

As someone who reaches for code if they need to blow their nose, what is a 3rd party vendor going to supply that a “English-to-SQL translators” wont do?

(I have not finished the article, but the idea that devs / data scientists can be replaced by some vendors makes me wonder what I have missed)

Edit: Also love the Uranium quote :-)

So my assumption is that for a given business model, like e-commerce or Saas business much of the highest value analysis is fairly standardized and can be templated. For example breaking down conversion rate by weekly cohort is something that can be pretty easily be done in google analytics.

The problem with English to sql translators or most coders in general are the assumptions we make, in particular about the underlying data. For example, say we want a join two tables, so we write a query to join on two columns and often call it correct which it is from a logical or schema perspective it is. However, null values, defaults like 0, many to one relationships vs one to one relationships, issues with instrumentation such as networking timeouts or bot detection, etc all can impact the down stream metrics. My point is that when there are 500 lines of sql in a query such as those mentioned the article, there’s a lot of ways to be mostly correct but to cumulatively be wrong.

Like many popular enough open source tools, 3rd party vendors get battle tested, issues get found before you, and they can justify devoting more resources to rigorously ensure correctness than the average analyst has the time or energy todo because their business depend on you trusting the outputs.

I’m not saying you couldn’t do all this yourself. But given the sheer number of analytics tools that are reasonably priced, you might have chosen to spend your time on something more specialized like a recommendation system.

can you point me at some of the vendors - I am missing a chunk of knowledge i suspect.

Or is this - for exmaple - people taking google analytics and producing analysis on top of that.?

Highly recommend Heap [1] - they have a neat approach that doesn’t require you to ‘decide’ which analytics you want to track ahead of time.

Disclaimer: I was an early engineer at Heap.

[1] https://heap.io/

Heap might be good but they are crazy expensive. We were quoted something like a quarter million dollars. Good luck getting that signed off, plus you still need quite technical analysts to run the thing.

I've found https://contentsquare.com/ to be much better received by juniors and seniors alike, and it's a fraction of the cost of heap.

I don’t know the specifics of what you were quoted, but a quarter million dollars (guessing per year?) does strike me as high.

Were you a later-stage startup by chance? The price point for pre-Series-C startups should be much, much lower.

Ah, so these do do web analytics on users - ok. That makes much more sense.
Very happy heap customer here. Been using it since 2016 or so and brought it from last company to my current startup. Autocapture is magic.
+2 on that! would love to know about what you think is worth investigating @zippy5
+1. @Zippy - May I ask for some of the vendors you refer to, please?

Also love the Uranium analogy.

So for example, the author saw that supply chain team had difficulty managing the complexity and scale of their analysis in large part due to the scalability of their spreadsheet solution. I would have pushed them to use Airtable which is basically a more scalable spreadsheet. By choosing the data pipeline route, the people who understand how to improve the supply chain model and the history of decisions that went into it, as well as previous missteps, now have limited ability to experiment with improving it. In my experience, every rewrite of a system has something lost in translation which makes me think that in the authors example that the life of the analysts got better but may have made the quality of supply chain model worse.

In the long run, there is plenty of useful logistics software that should do everything they want but the most important thing is to empower the people with domain expertise in the data to be as close to the solution as possible. Better decisions are often a result of better information/experience than better analysis. Unfortunately I haven’t studied these vendors well enough to make any suggestions though I believe that the solutions are well defined enough to write textbooks on them, which suggests to me that existing software and I would mostly implement similar methodologies.

On the marketing and product analytics tools, I think 80% of the problems boil down to measuring conversion rates and the comparing those rates across different contexts to select for the contexts which improves those rates.

Another user mentioned heap, which is great product if you know you don’t know what contextual data is meaningful but you suspect that it’s partially in how they interact with other parts of your website. Personally I’d use heap judiciously since I suspect there will be limitations to how useful the historical data will be in the future and collecting everything is expensive. One limitation is that site interactions are only part of the potentially important context. Another limitation is that startups change rapidly, so their historical data often depreciates in terms providing insight into their current problems. For an extreme example, I’m sure zoom’s conversion data before and during pandemic look completely different. But even a small tweak to google’s search algorithm could totally change what type of customer finds your site.

Personally I’d advocate talking to customers, potential customers, and other stake holders to understand what is important and measure that. Most companies, currently do the opposite where they take a lot of measurements and then try to figure out what’s important. The first approach can probably be done in google analytics. The second I might try and use Amplitude which is I what imagine a tool like heap will eventually try to evolve into.

The hardest person to help with data in the organization is the CEO because really they use data as form sales tool and reporting. The closest I have seen a tool to doing this in a way the CEO could mostly self service is Sisu data. Though it’s the CEO so it’s probably reasonable to hire some help anyway.

Lastly data warehouses were the gold standard in the early 2010s but Presto is better fit these days for companies whose data is distributed across many different places.

> spends 3M on the team and infrastructure

You're making a pretty big assumption on cost of team & infrastructure there. This company could have 100+ people with that kind of revenue (I've worked at a company this size before). The data team is only about 6 people. The cost of the data team & infrastructure is likely less than $1M

Having unique data is quite valuable. If your organisation can make decisions based on signals that other people can't detect then it can gain a decisive edge.

I do wonder at the anecdotes in this article though. In businesses that I've seen, the data team is usually the biggest impediment to a data-driven culture because they have databases full of numbers and no real grasp of how that links to the decision making process that makes the business money.

Beefing up the team doesn't help. In data, as in business more generally, the important think is not trying to guess what job your doing and spend a lot of time talking to customers about what job they need done. If the data team is where that work happens in a business then that can be helpful - but the grunt work of SQL/reporting/basic analysis is almost never where the value appears from.

> My take away was that startups benefit tremendously from a data advisor role to get the data competency, as well as the educational and cultural benefits, but realistically the data infrastructure and analytics at that scale should have been bought not built.

I really like your takeaway about data teams at tech companies. They try to make "data" a core competency of their business, at huge cost for fixed value.

I also appreciated the very subtle implication that the OP is shrouding empire building under an otherwise informative growth story.

> it’s a lot closer uranium

Love this analogy!