Hacker News new | ask | show | jobs
by shubb 1391 days ago
The related problem that I see actually more often is the "you don't have big data" problem.

You know, in data science, you see people spending hours writing pandas scripts that replicate a few clicks in excel for a one of analysis. You see datasets of a few gigabytes being processed with spark when SQL would be fine. You see ML techniques being thrown at questions that could be answered simply and reliably with basic statistical tests.

Especially in the B2C space a lot of companies, departments, products don't actually have a lot of customers and certainly not many decision makers. The N number is always going to be low. You can just talk to people. Let's say you are doing pretty well and running a SaS with 1000 corporate customers paying a million each - that's a billion dollar revenue - you can just talk to them. Certainly you can just talk to every single person who signs the cheque and those are the only people that matter.

And which is easier - putting together a thorough suite of A/B tests or getting some real customers to use your app on video and talking to them about what they are finding annoying, useful, missing? I see less people do that than you'd think.

7 comments

frankly, there’s only a tiny handful of these mythical saas “1000 users each paying 1 mullion dollars” companies. the vast, vast majority of saas startups are serving millions of “users” - i put that in quotes because these aren’t real users or customers. they are real people checking out your product - but they aren’t users or customers.

if you set up a gas station near the off ramp of some major interstate, say I-65 North, you will see cars pulling in to fill up on gas. maybe buying a coffee. now, these aren’t your customers in the traditional sense of a Target or Walmart customer. Because you will never see them again. They were driving from town A to town B via the interstate- they started running out of gas and needed to refuel, so they are in your gas station now. Once they gas up, off they go. They aren’t going to come back to you and establish a customer relationship or something. We’ve all been to tons of gas stations on the interstate and we’ll probably never go back to the same one twice - unless we are plying the same route everyday like a truck driver. So the task is to find and convert these truck drivers, who are the true repeat customers.

I was working on an android app which had like millions of unique cookies. When they hired me they said we have million of users. No you don’t. If you put out an android app in some popular domain, say news, entertainment, tax accounting etc- people will download and “use” your app. they are checking it out. they aren’t users, in the sense they aren’t using it everyday or want to have a relationship with you, pay subscription etc. conversion stats are minuscule, like 0.01%. So maybe 1 out of 10000 users is the truck driver. The vast majority will never ever use your app again. To do data science with these millions of rows of user interactions and find some nuggets just because you know your way around pandas or sklearn is a fool’s pursuit. To ask foolish questions of your data, like why are all these people churning, is silly - they aren’t your users, they haven’t converted, they are just checking it out. In that sense, its a waste of time and resources to do so much data crunching. Look at actual conversions, which are probably a few thousand people, not millions. Reach out to those thousands and maybe a few tens will give feedback and then continue to iterate on the product based on that.

This is a really good analogy (gas station customers) that I haven’t heard before. I’ve often tried to describe this ‘low intent’ group but never had a good way to make it relatable.
it gets the point across, but it raises other interesting questions.

sure, most customers at an interstate gas station will only visit once or twice, but that doesn't necessarily mean they are less important to the business than the truck drivers that fill up every day. maybe the bulk of revenue actually comes from one-time customers. this could be a case where attracting new customers is more important than retaining the current ones.

>the vast, vast majority of saas startups are serving millions of “users”

There are tons of B2B saas, including regional ones, that only serve a small number of customers way under millions.

Even more problematically, if you have a free service that could attract any kind of automation (e.g. an API SaaS with a free trial) then you're also going to get a lot of "users" who seem to be the "truck drivers" given a black-box usage profile, but who will also never actually convert. They use some free part of your service a lot, but they're not and never will be interested in any paid part of your service.

Maybe a close analogy would be: truck drivers who stop at your rest stop every time they come by... just to use the washroom. But who never go into the store itself.

Unfortunately, your reality-driven approach has ~zero emotional appeal for most managers, exec's, and alpha-data-scientist wanna-be's.
Data has CYA appeal.
Needing to CYA also has pretty low emotional appeal for managers, exec's, and alpha-data-scientist wanna-be's. (Until it's just about too late, obviously.)

And recall Mark Twain's old quip about lies & statistics. The more & bigger data that the folks who control the data & analysis have, the easier it is to make sure that those meet their own emotional & political needs.

Wasn't that Will Rogers?
Yep.
Why? Inadequately “technical”?
I think there are lots of reasons why.

One possible reason: no one whose job it is to write Python scripts was ever promoted for making an Excel spreadsheet when that is the simpler and more practical approach. And no manager of people who write Python scripts is going to be able to use that Excel spreadsheet to sell "I need more responsibility and head count." People tend to follow incentives, rather than focusing on making wise decisions.

> People tend to follow incentives, rather than focusing on making wise decisions.

This is the key issue. Solving it isn't easy -- it requires people who are wise, and wisdom is a scarce commodity.

Even wise people likely follow the incentives. What is wise about doing something that your employer doesn’t reward in exchange for doing something that they will reward?
It's wise to do what's morally right, regardless of the consequences.
Excel has a history of forced format updates, breaking incompatibility. I know people who banned it because they got tired of marching to MSs upgrade beat.

Python 2 to 3 upgrade aside, can’t really say the same about the language.

There are a number of good arguments out there that might violate an engineers perception, which one might call a cognitive data model built through training and experience.

There is no theory that makes any given engineering path “wiser” than others. Just engineers chasing incentives to be engineers.

Libraries introduce breaking changes, too. I’ve been bit by silent default changes in Pandas, for example. To me that’s kind of striking because I also wouldn’t consider myself a major user of the library.
- "You just talked to them and concluded this? What certainty you can have on this conclusion, and how can we trust you just didn't want it to be true from the start?"

A few slides showing the data, a boring 10 minutes about methodology, and finally the conclusion brings an air of reliability that you can't replicate for knowledge instead of data.

Our field is filled with people who want to use the most technical approach possible to solve a non issue, their paychecks probably depend on it.
Talking to people is not going to help you either. You end up getting a lot of noise and making sense of what you hear is difficult. When you keep probing you will get to hear stuff thats not really critical and just often made up because you ask too many questions. Classical trap of market research.
ycombinator startup school disagrees and says it's one of the two CRITICAL things founders must have a hand in.

Of course you need to interpret it but its incredibly important and I do not think you really know what you are talking about.

https://www.ycombinator.com/library/6g-how-to-talk-to-users

Almost all the major fails I have seen in my career have been some derivative of not understanding your users.

At the very least, I feel talking to users will give you decent hypothesis to test.

The creation of hypothesis is often glossed over as a trivial first step in scientific or data-driven decision making, but in fact, that's where the magic lives.

> I do not think you really know what you are talking about.

Nice strawman. I have never said to never talk to your users, but to pretend that using data is meaningless and you should follow some bullshit and vague "good argument" instead is just sheer foolishness.

That depends on how big the differences you're looking for are.

When you've got an early product, there are probably things you can do that 2x as many people will like as dislike. Even a small set of customers will be good for discovering this. When you've got a mature product, you should be optimizing around the edges and need a large sample size to find those 1% wins.

Likewise if you don't have scale, there are a lot of well-known best practices that probably improve your site by 5-10%. You probably don't have sufficient volume to discover test those ideas, so following general best practices is a good idea. But if you have scale, you can and should A/B test the heck out of everything. And then do it again in a couple of years in case the answer changes.

It's this data-led uncritical thinking that destroyed facebook
Talking to customers might uncover some things you haven't even thought about.
The same thing is true with pure data analysis. Unless you have never analyzed data in depth in your life, that should be pretty obvious.
You have to do both. You can't just look at data & you can't just talk to users / customers without looking back at data.
of course, but then following "good argument" is just ignoring data in the original article, which is nonsense.
> You end up getting a lot of noise and making sense of what you hear is difficult. When you keep probing you will get to hear stuff thats not really critical and just often made up because you ask too many questions.

Yeah, but this just means qualitative data is challenging, not that it's useless. You have to be careful when asking questions that you're asking useful questions and not leading people into telling you what they think you want to hear (or going off on useless rabbit trails like what they think the product should be instead of what the problem they want the product to solve is).

While I agree with your suggested outcome for some or many, a product designer or manager who is skilled at asking questions, going deeper, removing distractions, asking why continuously, and empathizing while not seeming judgey can garner really good insights.

I am guessing it's like you see of a psychologist with a patient on TV..... the customer must feel comfortable enough to open up, then flood gates can open.

Go talk with your customer service. Oops, so much rotation nobody cares, everyone is cheating KPIs.
You both make good arguments, there must be a middle here. I doubt you can uncover what your customer wants very well without just talking to them, but maybe they wind up misleading you sometimes. A/B testing to discover a customer wants a whole different paradigm isn't possible.
This is true; what customers SAY they want doesn't necessarily corellate with what they will actually use or pay for.

I mean I worked on an app where in one part, the end user could upload CSV files to be used. What they SAID they wanted was basically a full data management system and RESTful API to enforce constraints, data validation, record retrieval and updating, etc. What they probably wanted was an excel sheet. I dislike how my employer was like "yeah sure if you pay for it" to them.

> what customers SAY they want doesn't necessarily corellate with what they will actually use

A key cause of this in many cases is that the stake-holders you talk to do not work closely with the end users of the system. Talking to the right people can help a lot, though unfortunately as a 3rd party this is not usually anywhere near your realm of control.

The other issue is them knowing what they have and wish to store, but not knowing what outputs are going to be needed down the line. That is harder to fix, but having some good industry knowledge within your company can be a great help on such matters – you can then sometimes preempt client needs if the people holding that knowledge are keeping an active eye on changes (for instance new/planned regulations that might be coming into force in X weeks/months/years).

Very dated thinking. Suggest you read up on Lean Customer Development (for example).
been doing product design most of my life in top corporations, so I'll pass on your opinion.
> You know, in data science, you see people spending hours writing pandas scripts that replicate a few clicks in excel for a one of analysis

I mean, having an Excel doc at all usually implies hour(s) of work formatting the data in structured manner. Sometimes collective decades of work depending on how much heavy lifting your 15GB .xlsx is doing.

This is why I've adopted R and Python for the data work I do. I have a bunch of exported data (CSV files) that I use. Manipulating the structure and format is 90% of the work. I wrote the scripts once, now I can reuse that for everything instead of playing games getting those CSV files (dates in particular) to play nicely.

Even a one off analysis is actually FASTER in Pandas because I've done the work of farting around with the formatting. Now I can just write the necessary analysis code, rather than deal with the formatting.

That said, my data analytics work is seriously small potatoes compared to many. But I can write a quick pivot table using Dplyr faster than I can do it in Excel.

Often that work exists regardless of if a table of processed data that engineering formatted and schema-fied is dumped out to Excel or queried over SQL into Pandas...

I've seen this myself: the person who "naively" downloads that table and plays around in excel finds interesting things that the person who was using Pandas hadn't, because the code to manipulate columns and do certain types of calcs is actually more time consuming to write and modify than making a bunch of new columns in Excel with a bunch of formulas!

A good data scientist will have a more rigorous approach to their notebooks and practice reuse and so on... but that's not necesssarily easy.

> the person who "naively" downloads that table and plays around in excel finds interesting things that the person who was using Pandas hadn't,...

I think they call that serendipity. Never underestimate its power.

https://didgets.substack.com/p/data-science-and-serendipity

I say this about every other day at work (we even have only internal users so it's part of their job to talk to us). So far impact: zero....
Would you say the big data threshold moves every year?

That would explain why people think a <1TB is big data.

>Would you say the big data threshold moves every year?

It moves with Moore's law. Big data is anything that cannot reasonably fit into memory for a single server, so yes that number is well over 1TB now.

I know this isn't the correct definition but I think of "big data" as the set of data which takes me more than 15 minutes to query on average with a moderately complex Postgres SQL join on well indexed information. I use JSONB in Postgres regularly and have indices on that too. So far I have gotten really far with increasing Postgres work_mem to a gig or more, a fast SSD, and strategically placed materialized views. These kinds of operations in Pandas make my computer billow smoke by comparison.
I don’t think many give much thought to what it really means. They just use the term because it sounds cool, either to themselves or to their superiors. Same as with Machine learning.
What used to be 'big data' is now just 'normal data'.

https://didgets.substack.com/p/big-data

Why not both…
Maybe what I wrote comes off a bit one sided - I'm really urging people to do what actually makes sense in their specific context - which can be both!