Hacker News new | ask | show | jobs
by lifeisstillgood 1811 days ago
As someone who reaches for code if they need to blow their nose, what is a 3rd party vendor going to supply that a “English-to-SQL translators” wont do?

(I have not finished the article, but the idea that devs / data scientists can be replaced by some vendors makes me wonder what I have missed)

Edit: Also love the Uranium quote :-)

1 comments

So my assumption is that for a given business model, like e-commerce or Saas business much of the highest value analysis is fairly standardized and can be templated. For example breaking down conversion rate by weekly cohort is something that can be pretty easily be done in google analytics.

The problem with English to sql translators or most coders in general are the assumptions we make, in particular about the underlying data. For example, say we want a join two tables, so we write a query to join on two columns and often call it correct which it is from a logical or schema perspective it is. However, null values, defaults like 0, many to one relationships vs one to one relationships, issues with instrumentation such as networking timeouts or bot detection, etc all can impact the down stream metrics. My point is that when there are 500 lines of sql in a query such as those mentioned the article, there’s a lot of ways to be mostly correct but to cumulatively be wrong.

Like many popular enough open source tools, 3rd party vendors get battle tested, issues get found before you, and they can justify devoting more resources to rigorously ensure correctness than the average analyst has the time or energy todo because their business depend on you trusting the outputs.

I’m not saying you couldn’t do all this yourself. But given the sheer number of analytics tools that are reasonably priced, you might have chosen to spend your time on something more specialized like a recommendation system.

can you point me at some of the vendors - I am missing a chunk of knowledge i suspect.

Or is this - for exmaple - people taking google analytics and producing analysis on top of that.?

Highly recommend Heap [1] - they have a neat approach that doesn’t require you to ‘decide’ which analytics you want to track ahead of time.

Disclaimer: I was an early engineer at Heap.

[1] https://heap.io/

Heap might be good but they are crazy expensive. We were quoted something like a quarter million dollars. Good luck getting that signed off, plus you still need quite technical analysts to run the thing.

I've found https://contentsquare.com/ to be much better received by juniors and seniors alike, and it's a fraction of the cost of heap.

I don’t know the specifics of what you were quoted, but a quarter million dollars (guessing per year?) does strike me as high.

Were you a later-stage startup by chance? The price point for pre-Series-C startups should be much, much lower.

That's odd. Why would you charge more for a post-series C startup or enterprise versus a pre-series C?
Ah, so these do do web analytics on users - ok. That makes much more sense.
Very happy heap customer here. Been using it since 2016 or so and brought it from last company to my current startup. Autocapture is magic.
+2 on that! would love to know about what you think is worth investigating @zippy5
+1. @Zippy - May I ask for some of the vendors you refer to, please?

Also love the Uranium analogy.

So for example, the author saw that supply chain team had difficulty managing the complexity and scale of their analysis in large part due to the scalability of their spreadsheet solution. I would have pushed them to use Airtable which is basically a more scalable spreadsheet. By choosing the data pipeline route, the people who understand how to improve the supply chain model and the history of decisions that went into it, as well as previous missteps, now have limited ability to experiment with improving it. In my experience, every rewrite of a system has something lost in translation which makes me think that in the authors example that the life of the analysts got better but may have made the quality of supply chain model worse.

In the long run, there is plenty of useful logistics software that should do everything they want but the most important thing is to empower the people with domain expertise in the data to be as close to the solution as possible. Better decisions are often a result of better information/experience than better analysis. Unfortunately I haven’t studied these vendors well enough to make any suggestions though I believe that the solutions are well defined enough to write textbooks on them, which suggests to me that existing software and I would mostly implement similar methodologies.

On the marketing and product analytics tools, I think 80% of the problems boil down to measuring conversion rates and the comparing those rates across different contexts to select for the contexts which improves those rates.

Another user mentioned heap, which is great product if you know you don’t know what contextual data is meaningful but you suspect that it’s partially in how they interact with other parts of your website. Personally I’d use heap judiciously since I suspect there will be limitations to how useful the historical data will be in the future and collecting everything is expensive. One limitation is that site interactions are only part of the potentially important context. Another limitation is that startups change rapidly, so their historical data often depreciates in terms providing insight into their current problems. For an extreme example, I’m sure zoom’s conversion data before and during pandemic look completely different. But even a small tweak to google’s search algorithm could totally change what type of customer finds your site.

Personally I’d advocate talking to customers, potential customers, and other stake holders to understand what is important and measure that. Most companies, currently do the opposite where they take a lot of measurements and then try to figure out what’s important. The first approach can probably be done in google analytics. The second I might try and use Amplitude which is I what imagine a tool like heap will eventually try to evolve into.

The hardest person to help with data in the organization is the CEO because really they use data as form sales tool and reporting. The closest I have seen a tool to doing this in a way the CEO could mostly self service is Sisu data. Though it’s the CEO so it’s probably reasonable to hire some help anyway.

Lastly data warehouses were the gold standard in the early 2010s but Presto is better fit these days for companies whose data is distributed across many different places.