Hacker News new | ask | show | jobs
by shikharja 2485 days ago
@soVeryTired, the challenge showcased in the test does involve a lot of wrangling, handling missing data points, spotting bias and identifying the right features for the regression model. The challenge is designed to allow for a candidate's creativity.

Do you think the data set we have used doesn't do it to the extent you'd expect from a data scientists?

1 comments

Any assessment that directly provides data sets - even with "gotcha"s like missing values - is testing, based on the conventional wisdom, at most 20% of a real-world data science workflow. And IMO it's the least critical 20%.

The only good end-to-end "technical" data science assessment I can think of is to pose a broad question or business problem that's addressable by applying data science techniques to publicly available data. But a nontrivial version of that assessment would take half a day on the very low end, and long assessments anti-select against good candidates.

IMO, when it comes to evaluating data scientists, the only thing that online coding assessments are good for is to ensure that they can perform basic coding and data manipulation tasks. (I'd include tasks like web scraping, image manipulation, API calls, and ORM stuff in this category). Everything else needs to be evaluated in person.

Do you think candidates looking for Data Science jobs would be open to performing a half a day exercise?

We optimized these challenges to allow candidates to show as much of their skills they can show in a timed window, without killing their creativity. I'd be curious to know what do think is a good way to interview data scientists.

I think some candidates would be open to performing a half-day exercise. But the best candidates wouldn't, which is what drives the anti-selection I mentioned in my previous comment. More broadly, I don't think it's realistic to create an assessment that's representative of real-world data science workflows without being onerous enough to exclude good candidates.

If representative isn't an option, highly correlated is the next best thing. In practice, for my team specifically, this means screening for math aptitude and general business acumen during a phone screen, data manipulation (moderately complex SQL + tidyverse/data.table/pandas) during a "take-home", and delving more into problem solving approach, model selection and validation, etc. during an onsite. Broad business questions (e.g., "How does a life insurance company make money?") and communication skills generally weed out the candidates who picked up the bare minimum math and programming background through Kaggle + MOOCs.

As an aside, I absolutely think that the sort of assessment in the OP kills creativity. I care a lot about whether a candidate would think to include covariates like Internet usage and segmented urban population when predicting mortality rates; I don't care at all whether they're able to write the trivial amount of code that's needed to include those covariates in a model, given a data set that already contains them.

Typically take-home tests in any SWE field don't account for everything as it's a layer of screening: that's what an on-site is for (even better: adjust the on-site to address the results of the take-home).
I completely agree, and that's exactly my point: the median data science screening process tries to be all-encompassing to an extent that would seem ridiculous (for a "take-home" assessment) in any other technical field.