Hacker News new | ask | show | jobs
by _fourzerofour 2176 days ago
I can tell a story. I used to work for a HVAC installation company, pretty small in terms of staff but we subcontracted a lot. Initially brought on as a mechanical engineering intern, but moved on to sales engineering when I found an interesting statistical relationship.

A large factor in quotes to clients was the underlying cost of air conditioning equipment in our niche, and often a game of sales intel was played between suppliers and competing contractors (like us) for a given job site. Favorites were picked, and we could get royally screwed in a quote, losing the sale to the end-customer.

Fortunately, we had years of purchasing information. It turns out that as varied as air conditioners are across brands and technical dimensions, when you have years of accounts' line items and unused quotes, you don't get a dimensionality issue. Since we operated in a clear-cut niche, this was especially true. We could forecast, within a margin of error of two per cent, exactly what any of our suppliers would quote us (or our competitors!) for a job long before they could turn it around. Huge strategic advantage.

This was the watershed moment for me when I realized even basic multiple linear regression was a scarily powerful tool when used correctly.

1 comments

That is cool when you put it like that. Uncovering hidden relationships that are useful sounds romantic. Thanks for posting
And incredibly boring. The usual estimate is that data science is 80% data wrangling: finding, collecting, and cleaning up data. The term "data scientist" replaced "data miner", because miners are looking for gold. Scientists are obsessed with finding out the nature of reality, gold or mud. They will do seriously boring stuff to set things up so that reality is revealed.
It is only boring if you do it the boring way.

If the data cleaning is follows standard patterns, you should already have scripts to offload that kind of work to. If not, then there some incredibly interesting decisions hidden underneath. Like in text: Should character casing be preserved ? What should be the unit of representation (word/character) ? How should data be filtered: Quality vs quantity trade-off ?

All of those are non-trivial questions which involve a lot of thought to reason through. You are correct that the modelling is only a small part of DS's day to day job.

But, the rest of it is boring in the same way that coding is boring. It is doesn't involve some grand epiphanies or discoveries, but there is joy similar to the daily grind of "code -> get bug/ violate constraints -> follow trace/problem -> figure a sensible solution" that a lot of software engineers love.