Hacker News new | ask | show | jobs
by mcemilg 1622 days ago
The ML/DS positions highly competitive these days. I don't get why ML positions requires hard preparations for the interviews more than other CS positions while you do similar things. People expect you to know a lot of theory from statistics, probability, algorithms to linear algebra. I am ok with knowing basic of these topics which are the foundations of ML and DL. But I don't get to ask eigenvectors and challenging algorithm problems in an ML Engineering position at the same while you already proof yourself with a Masters Degree and enough professional experience. I am not defending my PhD there. We will just build some DL models, maybe we will read some DL papers and maybe try to implement some of those. The theory is the only 10% of the job, rest is engineering, data cleaning etc. Honestly I am looking for the soft way to get back to Software Engineering.
5 comments

In part because ML fails silently by design. Even if the code runs flawlessly with no errors, the outputs could be completely bunk, useless, or even harmful, and you won't have any idea if that is true just from watching The Number go down during training. It's not enough to know how to build it but also how it works. It's the difference between designing the JWST and assembling it.
ML doesn't just fail silently by design. because ML is based on error minimization, it fails in a way that is maximally hard to tell from random garbage. This is, remarkably, a subtlety that is lost on most people, which is a real surprise- my introduction to this was in structural biology, where you always do hold-outs and check the performance on the hold-out set before overfitting is such a problem.
Absolutely, the result you get is "the best you can do given the baked-in assumptions". But of course the assumptions can be wrong. And it takes time to learn how to evaluate and revise your assumptions in any analytical field, hard or soft.
But the OP was asking something different, that is why someone should excessively focus on theory, when, by the way, DL theory is very far from being solid and trial and error in ML and AI is the common way of operating.

The "model is in place, but I have no clue what's doing and so it can fail without me understanding when and how is straw-man". Especially for supervised learning, that is, we have a label for data, it is immediately clear whether the output of the model is "bunk, useless, or even harmful". There is no "fail silently by design".

I have been working in the field for almost 20 years in academia and in industry and it is not that I starting every PCA thinking about eigenvectors and eigenvalues and if you ask me now without preparing what are those, I would be between approximately right and wrong. But I fit many, many very accurate models.

You are considering only the technical aspects of the model. While of course important to understand, those are less interesting when considering potential harms than the downstream effects of the inference pipeline, particularly when it comes to interpretations of outputs. What is absolutely the worst possible MO is to offload the interpretation portion of a pipeline to a machine using proxy metrics without an exceptional model which justifies the approach unequivocally.

For instance, if we put an MSE loss function on a classification NN with sigmoid outputs, and used a classification dataset, we could generate an entire zoo of "many, many very accurate models" as measured by MSE. But once your model returns outputs, how do you interpret them to predict a label for some input data? You could hack some algorithm together (eg argmax of the highest value) which is indistinguishable from the "correct" procedure but the described probabilities are so incorrect that no ML professional would be comfortable trusting anything it says, not least because of the violation of the condition that the probabilities are non-negative and sum to one. But being able to explain why we use MSE or cross-entropy or any other loss function and which output activations (hint: and probability distributions) they are typically associated with actually has a very deep origin in the foundations of probability theory which blows open a whole new way of thinking about statistical modelling that is not made available in any of the programs whose materials I've been exposed to.

"But being able to explain why we use MSE or cross-entropy or any other loss function and which output activations (hint: and probability distributions) they are typically associated with actually has a very deep origin in the foundations of probability theory which blows open a whole new way of thinking about statistical modelling that is not made available in any of the programs whose materials I've been exposed to. "

What is the "very deep origin"? What is this "new way of thinking"? And what's so wrong with using argmax to make a classifier, if I don't care about estimating probabilities and just want the answer?

A lot of processes downstream to inference benefit from having a minimum of care put into the system design. We're talking 80/20 rule stuff here. It's a simple reorientation vs a janky argmax-classifier, but results in assumptions being obeyed broadly, in a max-entropy sense.

The key insight is that all prediction models can equally be framed as energy-based models (y = f(x) -> E = g(x, y)) and the job of ML is to estimate the joint distribution of x and y with suitable max-entropy surrogate distributions, and performing MLE on this variational distribution vs some training data. All the math in the theory follows from this (perhaps excluding causal stuff but actually I am not familiar enough with those techniques to say for sure). Things get a little more complicated when you consider e.g. autoencoders but above still holds.

Obviously with the choice of a poor surrogate distribution, your predictions will on average be worse. Yes, even if you don't care about probabilities and just want max-likelihood predictions -- your predictions will on average be worse. By construction, analysis proceeds by framing the problem as this and following through. A janky argmax-classifier is not exempted from this -- it, too, already implies a surrogate distribution, but you know, statistically speaking, it's probably a pretty bad one. So it makes sense to put a tiny bit more effort to get way closer to representing the space that your data lives in.

Naturally, you could easily find a janky model that outperforms some relatively unoptimized principled model on a specific use case, and many do get lucky with this. But the principled model has a lot more headroom specifically in terms of the information it can hold, because if the design is more or less correct to the problem specification then the inductive bias built into the model matches closely with the structure of the data which is observed.

Very few of ML is "principled" (e.g., taking account the probability distributions, priors, bounds on the value of parameters etc,), actually it is most of the time a brute-force approach that makes modelers avoid "thinking" about probability distributions, transformations etc.

I did a lot of the "principled" modeling you talk about, in Stan, TMB, and JAGS back in the day, but outside of the need for an "explanation" of model behavior—which is a scientific need much more than engineering need (mind you, here not having an explanation does not need having no idea what the model does, but it relative to the relationship between x and y, both in how we reach the estimation of parameters and the interpretation of the parameters themselves)—I would almost always favor a "brutish" for prediction in industry, out of (1) convenience, (2) accuracy that's almost always better for ML models even using un-principled methods, (3) outside of proper causal inference, predictions are what matters and even when people demand an "interpretation", causality when data and model are not up for that kind of analyses, is a just a guess anyway.

Do you have a reference to a paper that demonstrates the empirical superiority of energy-based models to well-tuned "janky argmax-classifiers"? I find it a little hard to believe there's a free lunch here given the relative popularity of basic argmax stuff – if energy-based models were obviously better, it seems like they'd be used more. But I am open to evidence on this point!
What you described seems to me pretty standard in ML and even more in statistical modeling. Maybe because I am coming from applied math and statistics.
> In part because ML fails silently by design.

That's why there's so much iteration and feedback gathering (e.g. A/B tests) as a part of DS/ML, which incidentally is rarely a part of the interview loop.

Anyone who claims they can get a good model the first time they train it is dangerously optimistic. Even the "how it works" aspect has become more and more marginal due to black boxing.

I'm sure this happens, but do you think the problem is actually one of mathematical savvy?

My guess would be that more machine learning projects go off the rails for want of understanding the data or the {business, research} problem.

My experience is bulk of the problem is insufficient monitoring. ML systems need heavy monitoring and should be sending lots of metrics to stuff like prometheus/grafana. There should also be validation/consistency checks for all data pipeline/feature transformations. And you should strongly avoid duplicating logic for stuff like feature preprocessing. I've seen people implement "same" feature preprocessing pipeline twice (one python, one java) and it is so common to find edge case bugs for a long time especially when these bugs only slightly impact model behavior.

Another issue is proliferation of data pipelines. The more distinct pipelines you have, the more painful they become to monitor. It is much better to minimize pipelines and do views on a small number. I think proliferations of models is a similar issue. It is often easier to build 4 models instead of 1 multi-task model, but monitoring/operational tasks grow more and more painful as you manage more models.

Not necessarily mathematical savvy though a lot of deeper understanding can follow from a strong grasp on the fundamentals. I think it has more to do with the alignment between intuitions and outcomes, and this is not taught well in most academic programs as far as I can tell.
And you learn literally zero about a candidate's ability to understand when and why things work by asking questions about eigenvectors. Someone can understand what an eigenvector is and still not have any clue about how you figure out a system is working, why it's working, what is likely to happen in production, how you test the limits of your method's ability to generalize, how you take an real problem and find something that you can productively use ML on, etc.

People say things like "you need to know how it works" but "it" doesn't work using your knowledge of eigenvectors. If you want to test how "it" works, test that, literally. Put up a model on the board and a dataset. Ask people about what might happen when you apply one to the other. What changes they would make in response to changes in the data. What they would do in response to the following training curves, budget limitations, etc.

These interviews are terrible and they select for people that regurgitate facts.

The "trivia" tests, when used (IMO) correctly, are not for testing whether or not the candidate recognizes the term and can regurgitate a definition. I prefer to listen to how they phrase their response to get a sense of the intuition behind the understanding of the concept as well as how it may fit in to a larger mathematical framework (i.e. their internal model for mathematical analysis).

I am not looking for someone to answer the question correctly, but to answer the question in a way that demonstrates deeper insights, which helps immensely in research settings as re-using properties of mathematical constructs in novel ways is often how theory and practice both are advanced.

I would be much less interested in someone giving a precise definition of eigenvalues than to describe them in such a way that they understand e.g. what can be deduced about an operator when one of its eigenvalues is zero.

Maybe "eigenvectors" is a bad example, because it's a pretty foundational linear algebra concept.

But there is a threshold where it stops being a test of foundational knowledge and starts being a test of arbitrary trivia, and favors who has the most free time to study and memorize said trivia.

The difference between trivia and meaty knowledge is somewhat contextually dependent, but an understanding of how core probability and statistics concepts are integrated into the framework of machine learning by means of linear algebra and the other analytical tools is pretty damn useful to have substantive conversations about ML design decisions. Helps when everyone in the team speaks that language to keep up the momentum.
Having recently completed an MLE interview loop successfully at a top company, I'm wondering where you are getting asked complicated linear algebra questions in interview?
Hopefully you aren't equating "eigenvectors" to "complicated linear algebra question".

But I agree, a lot of MLE roles don't get asked such things.

I think the OP's guide is closer to interviews I've seen for phd programs.

> Hopefully you aren't equating "eigenvectors" to "complicated linear algebra question".

They explicitly say something harder than eigenvectors in the GP.

I was imagining something involving the spectral theorem or something like that, ie. beyond the most basic linear algebra.

OPs guide seems to cover plenty of things I'd expect someone to learn in undergrad, I think I touched on almost all of this - except for stuff involving jax and recent CNN architectures, both of which can easily be supplemented online.

A reason for such requirements is similar to that that software engineers need to leetcode hard: supply and demand. Prestigious companies get hundreds, if not thousands, of applications every day. The companies can afford looking for candidates who have raw talent, such as the capability of mastering many concepts and being able solve hard mathematical problems in a short time. Case in point, you may not need to use eigenvectors directly in the job, but the concept is so essential in linear algebra and I as a hiring manager would expect a candidate to explain and apply it in their sleep. That is, knowing eigenvector is an indirect filter to get people who are deeply geeky. Is it the best strategy for a company? That's up to discussion. I'm just explaining the motives behind such requirements.
I can’t help but think there’s been a ton of filters used in the past to figure out if someone is deeply geeky, and we’ll continue to invent more in the future.

It’s really looking like another rat race. Especially since there’s no central authority, every hiring manager has the potential to invent their own filter, and make it arbitrarily harder or easier based on supply and demand (and then the filter drifts away from the intended purposes).

It will be rat race when there are so many interview books and courses and websites. It was a not rat race before 2005, when there were only two reasons that one can solve problems like Pirate Coins or Queen Killing Infidel Husbands: the person is so mathematically mature that such problems are easy for them; the person is so geeky that they read Scientific American or Gardner's columns and remembered everything they read.
You're missing the third category: people like myself who absolutely love this kind of riddles and destroy them in a few minutes, without any significance on their actual work abilities.

I don't think I'm a bad engineer, but I'm certainly not the rock star you absolutely need for your team, but when it comes to this kind of “cleverness” tests, I'm really really good.

I've had the “Queen Killing Infidel Husbands" (with another name) in an interview last year and I aced it in a few minutes, and I didn't knew about "Pirate Coins", but when I read your comment HN said your comment was "35 minutes ago" and now it says "40 minutes" which means I googled the problem, figured out the solution and then found the correction online to see if I was right in less than 6 minutes, and so while I'm putting my son to bed!

It's really sad because there are many engineers much better at there job than me who will get rejected because of pointless tests like this…

If somebody asked me logic/brainteaser questions like that, I would politely stop them, explain that if they're asking me that question I'm not a good match for the company, and if they would like to ask a better question, I'm open to it, but otherwise, we can end the application process now. I did that recently with a junior eng who asked me a leetcode question literally with the same exact test data as the leetcode page. I ended up explaining to the CEO that at the very least his engineers should be creative enough to come up with different test data, but that realistically, if "recognize the need for, and implement binary search in 45 minutes" is your go-to question, I'm not gonna be a match at your company.

I had to fight my way into google by doing every bit of prep and practice to solve stupid questions and code quicksort but when I joined, nothing I did in the 12 years I was there required any of that. And I wrote high performance programs that ran on millions of cores (I did know some folks who needed that skill, like the search engine developers, or the maps engine, or the core scheduling algorithms in borg). The entire time I was there I tried to get people to understand the questions they're asking are just not good indicators of programming, but it was repeatedtly pointed out, the goal is to minimize false-positive hires.

I do admire your ability to solve problems like that quickly, always wished I could.

> If somebody asked me logic/brainteaser questions like that, I would politely stop them, explain that if they're asking me that question I'm not a good match for the company

This is exactly what I started to do after I was asked a leetcode-based question for a SRE manager position.

It turned out that by making clear my "profile", I stopped to have bullshit interviews and started to get ones more aligned to actual daily work.

The Queen problem first showed up in a Putnam Math Contest. If you solved it in no time, then you're mathematically talented, which puts you in the first category.
I'm not questioning the fact that I'm kind of gifted when it comes to mathematics (I actually ranked #72 in a nation-wide math contest in France when I was 10) but you were talking about “maturity” and not innate skill. Since don't have a math degree and I haven't done math in more than a decade, I'm definitely far from “mature” on any mathematics perspective that can matter for a job.

And after ten years working in the industry, I can assure you that it is not a skill I can leverage a lot in my job…

But if there is an abundance of supply, the company has to use some kind of filter.

Testing for geekyness and ability to solve tricky coding math problems, seems like a rational way to do that.

If companies were starving for talent because 'nobody could pass the test' - it would be another thing.

But they have to set the bar on something, somewhere.

I can't speak to AI/ML but I would imagine it might be hard to hire there, given the very deep and broad concepts, alongside grungy engineering.

I've rarely had such fascination and interest in a field that I would never actually want to work in.

There’s an abundance of supply of people with masters degrees in machine learning? How’s that possible? I thought this shit was supposed to be hard.

Has humanity just scaled way too hard or something, because if we’re having an abundance of supply in difficult cutting edge fields to the point where they also have their own version of Leetcode, then what hope do average people have of getting any job in this world?

Or, is it at all possible that companies are disrespecting the candidate pool by being stingy and picky?

Maybe the truth is gray.

I currently work as an ML engineer and have interviewed on both sides for some well known companies.

The absolute demand in number of people is small compared to popularity. It would not surprise me at all if many computer science master's programs had a majority of the students studying machine learning. I remember in undergrad we had to ration computer science classes due to too much demand from students. I think school had 3x majors over a couple year time period in CS.

The number of needed ML engineers is much smaller than total software engineers. When a lot of students decide ML is coolest we have imbalanced CS pool with too many wanting to do ML. Especially when for ML to work you normally need good data engineering, backend engineer, infra, and the actual ML is only a small subset of the service using ML.

At the same time supply of experienced ml engineers is still low due to recent growth of the field. Hiring 5+ years of professional experience ML engineers is more challenging. The main place were supply is excessive is for new graduates.

> There’s an abundance of supply of people with masters degrees in machine learning? How’s that possible? I thought this shit was supposed to be hard.

I think it's just a matter of proliferation of these types of programs, as well as a large supply of students.

Also, the average qualification of people working in ML is probably no longer a Ph.D, like it used to be. This is arguably because deep learning techniques require less involved math to understand, and are more focused on computational methods that work well.

So the field has probably saturated. When I got involved with ML for the first time (well, really, statistical signal processing) in the mid 2000s, the field was kind of dead, and very high qualified postdocs had tough time finding jobs.

> There’s an abundance of supply of people with masters degrees in machine learning? How’s that possible?

I don't know for ML, but there are almost 12k Masters CS degrees awarded per year and 1.1k PhDs. If my university is any indication, then there's a good portion of those that are ML or doing some sort of ML in their research. But even if it was just 10%, that's a lot of people per year that are being added. This is just the US btw.

https://datausa.io/profile/cip/computer-science-110701

> Case in point, you may not need to use eigenvectors directly in the job, but the concept is so essential in linear algebra and I as a hiring manager would expect a candidate to explain and apply it in their sleep.

Exactly. Whenever eigenvectors come up during interviews, it’s usually in the context of asking a candidate to explain how something elementary like principal components analysis works. If they claim on their CV to understand PCA, then they’d better understand what eigenvectors are. If not, it means they don’t actually know how PCA works, and the knowledge they profess on their CV is superficial at best.

That said, if they don’t claim to know PCA or SVD or other analysis techniques requiring some (generalized) form of eigendecomposition, then I won’t ask them about eigenvectors. But given how fundamental these techniques are, this is rare.

Given that PCA is heavily antiquated these days, I'd say that asking your candidates to know algebraic topology (the basis behind many much more effective non linear DR algorithms like UMAP) is far better... But in spite of the field having long ago advanced beyond PCA, you're still using it to gatekeep.
The initialization strategy for UMAP is important enough that asking about that in practice is probably more important than anything out of Ghrist's book as an interview question

cf. https://twitter.com/hippopedoid/status/1356906342439669761

UMAP (and t-SNE) aren't the same as PCA. UMAP is pretty close to t-SNE and I think expanding PCA (Principle Component Analysis) and t-SNE (teacher Stochastic Neighbor Embedding) explain the difference. Neighbor embedding is a visualization technique and not the same as determining principle components. PCA preserves global properties while t-SNE and UMAP don't. They are good techniques for _visual_ dimensional reduction, but they aren't going to tell you the dominant eigenvectors of the data, or _dimensional reduction_. This is a bit of a pet peeve of mine.

There's some more in this SE post https://stats.stackexchange.com/questions/238538/are-there-c...

> asking your candidates to know algebraic topology

Congratulation, you've eliminated 99% of the ML research community.

and yet we're also told that tech companies can't get enough people.
> But I don't get to ask eigenvectors and challenging algorithm problems in an ML Engineering position at the same while you already proof yourself with a Masters Degree and enough professional experience.

People know pity passes exist for Master's degrees. You can't trust that someone actually knows what they should know just because they have a degree. Ditto professional experience. The entire reason FizzBuzz exists is because people with years of profesional experience can't program.

We aren't talking about FizzBuzz here; but rather the fashionable practice of subject people to 4-6 hours of grilling on "medium-to-hard" problems that you absolutely cannot fail, or even be slightly halting in your delivery on. And which can only be effectively prepared for by investing substantial amounts of time on by-the-book cramming.

On top of the fact that these problems are often poorly selected, poorly communicated, conducted under completely unrealistic time pressure, often as pile-ons (with 3-4 strangers as if just to add pressure and distraction), and (these days) over video conferencing (so you have to stare in the camera and pretend to make eye contact with people while supposedly thinking about your problem, on top of shitty acoustics), etc, etc.

It's just fucking ridiculous.

I'm quite happy these places makes it so clear they're not places I would be happy to work. I always ask about the interview process and tell the recruiters I'm not interested if they expect really lengthy processes. I'm fine with things dragging out of they have additional questions after initial interviews, but not if their default starting position is that they need that.
I figure the best way to prepare for an ML job is to pull out the nastiest working rat’s nest of if statements you’ve ever written & claim it was autogenerated by an adversarial network (which was you fighting with your coworkers over your spaghetti code).
This really made me laugh, thanks.