Why do you think candidates for data science jobs aren't qualified relative to o...

yummyfajitas · on March 6, 2017

I've probably interviewed about 70-100 such people in the past year and a half. Exactly 1 such person was qualified (I hired him). The issue in my view is the following: people who know both statistics and computer science are extremely rare.

People who actually understand statistics are rare. I can probably weed out 1/3 to 1/2 of candidates simply by asking what a p-value is, or what precision/recall are (this includes people who said they worked in search).

Of the ones who know basic stats, most are neither good at nor interested in programming. They just want to use existing libraries to crunch numbers in a Jupyter notebook, then hand that off to the developers.

Finding a person who can come up with a predictive model, understand what they did, optimize it without breaking it's statistical validity and deploy it to production is very hard.

(If you can do this, I'm hiring in Pune and Delhi. Email in my profile.)

halflings · on March 6, 2017

Ignoring what a p-value is does not mean that you don't know statistics. p-tests are not some inherent statistical property, they're just a useful model for significance. People coming from a CS background most likely didn't have to deal with p-values, but they can still be good at linear algebra or bayesian statistics.

(not sure I can defend somebody that does not know what precision/recall are)

Bootvis · on March 6, 2017

Regarding precision/recall, I've a background in financial econometrics and this is the first time I encounter the terms.

kenjackson · on March 6, 2017

I think the problem is that certain subfields use different terminology to mean similar or identical concepts. For example, while I'm in software, I tend to hear the terms sensitivity and specificity. They are historically medical terms. They aren't identical to recall/precision, but I think you can derive one set from the other.

antognini · on March 6, 2017

The fundamental thing to know is the confusion matrix. There are about a dozen terms for various descriptors of the matrix, but they all can be calculated if you know the confusion matrix. The Wikipedia page has a great table to describe them all:

https://en.wikipedia.org/wiki/Confusion_matrix

You can see from that that sensitivity and recall are the same thing, but specificity and precision are not.

Bootvis · on March 6, 2017

That certainly seems likely and a good thing to keep in mind when you're giving or taking an interview!

jknoepfler · on March 6, 2017

It's literally the first thing you learn in data science / machine learning coursework about evaluating model performance. It would probably be better to ask the candidate to whiteboard a set of metrics for evaluating model performance rather than ask for the definition of a pair of words, but the concept is practically the for-loop of data science.

Edit: note that I'm not saying you need this to add roi as an analyst for a business!

achompas · on March 6, 2017

For those downvoting this comment -- it's absolutely true that model performance is discussed at length early in ML courses (usually in the context of the bias-variance tradeoff).

My only quibble would be that precision + recall are one set of evaluation metrics applicable to classification tasks. Modelers can absolutely use other loss functions.

Additionally, precision/recall do not map nicely to regression problems, so people use other metrics (RMSE, MAE, etc.).

Bootvis · on March 6, 2017

I haven't taken a lot of data science classes but I'm not sure that's true. If you start with linear regression the mean squared error would make more sense. I actually searched through "The Elements of Statistical Learning" and the word 'recall' is not used in this sense at all.

mjn · on March 6, 2017

The jargon does vary by subfield and community, along with the actual measures used (sometimes it's just a different name, but sometimes practices are different as well). Precision/recall are terms from information retrieval that migrated into the CS-flavored portion of machine learning, but are not as common in the stats-flavored portion of ML, in part because some statisticians consider them poor measures of classifier performance [1]. Hence they don't show up in the Hastie/Tibshirani/Friedman book you mention, which is written by three authors solidly on the stats side of ML. It does occasionally mention some equivalent terms, e.g. Ctrl+F'ing through a PDF, I see that in Chapter 9 it borrows the sensitivity/specificity metrics used in medical statistics, where sensitivity is a synonym for recall (but specificity is not the same thing as precision). It looks like the book more often uses ROC curves, though, which have their own adherents and detractors.

[1] This paper is the one that most often gets cited as background by people who don't like recall/precision as metrics: http://dspace2.flinders.edu.au/xmlui/bitstream/handle/2328/2...

mehaveaccount · on March 7, 2017

People don't pay for linear regressions. They pay for discrete things: what is my best option among my three clear courses of action. Linear regression can be a tiny piece of a larger argument in favor or against one option or the other, but that alone doesn't make money.

Bootvis · on March 7, 2017

That's obvious but not at all what I responded to in my post.

I responded to the claim that ML courses start with the definition of precision and recall. In my admittedly limited experience those courses start with linear regression and mean squared errors. After that, there is so much generalization possible and that doesn't include precision/recall.

You make money by solving someone's problems, making money by stating definitions is only done on TV quizzes.

halflings · on March 14, 2017

That's OK. The article was talking about somebody interviewing for a search-related position (where precision and recall are usually what you are optimizing for). I guess they might be called differently in econometrics?

yummyfajitas · on March 6, 2017

I usually tailor my questions to whatever the person said they did. If they say they did search, I ask about precision recall. If they discuss hypothesis testing, I'll ask about p-values or other approaches.

I'd happily take a Bayesian answer if they preferred that, but that hasn't happened very often.

x0x0 · on March 7, 2017

I can't imagine how you could learn Bayesian stats w/o starting with a basic inference class, which will cover p-values at length.

halflings · on March 14, 2017

I did Bayesian stats without ever discussing p-values. Not al classes discuss them. (my background is ML, where p-values are not as useful as say in biology or any field relying on experiments with control groups)

dandermotj · on March 6, 2017

You're certainly not going to study Bayesian statistics without knowing (or at least having studied) what a p-value is.

AstralStorm · on March 7, 2017

Actually p-values are used way less often in Bayesian statistics than frequentist ones. The latter rely on statistical tests more.

Bayesian stats tend to use likelihood ratios or Bayes factors instead of p-values for hypothesis testing.

The trick in all cases is that you're comparing to expected results given some prior distribution. Most people use a dumb prior (e.g. Gaussian) and then they're confused when the numbers make no sense as data is multimodal or heavy tailed, thus mismodelled.

dandermotj · on March 7, 2017

I studied statistics - my point was that statistics is taught in a linear manner, starting with distributions and hypothesis testing (p-values) and then move onto more advanced treatments like Bayesian stats.

maus42 · on March 7, 2017

That happens in statistics programs. However, I have a ML-heavy minor in CS, and based on the ML course contents at our CS dept I've seen, I'm not sure if the all their CS majors go through the the full canonical statistics curriculum, nor that they were intended to. At least the ML courses had quite much introductory probability and statistics as far as ML applications were concerned, so I understood the implication was they didn't assume that the students would have already done the similar stuff in statistics (though it certainly helped), and I can't remember a single mention of p-value there.

And then there's this, that even if your intro to probability course everywhere covers the classic statistics with p-values and hypothesis testing and frequentist confidence intervals and so on, you are not necessarily going to use them that much. I calculated some p-values and other tests with R for some example datasets a couple of years ago and never seen them since in coursework, everything we've done after that has been more or less fully Bayesian. The concepts are still fresh[1] in my mind mostly because I read some statistics blogs, such as Andrew Gelman's [2]. The irony is that Gelman does not exactly love frequentist framework, he just mentions its concepts often enough.

[1] or not totally forgotten

[2] http://andrewgelman.com/

dwpdwpdwpdwpdwp · on March 6, 2017

I'm surprised people bother spending so much energy looking for someone who is both a statistician and a computer scientist knowing they are so rare. There are so many more statisticians who can at least communicate and work effectively with developers and vice versa. Why not just compose a team? I feel like just like other professionals have assistants, statisticians should have them too, and they'd be focused on the computer science and deployment of the applied statistics.

ska · on March 6, 2017

This is a classic problem that shows up equally with lots of related areas: numerical work, statistics, ML, signal processing etc.

"just compose a team" sounds easy, doesn't it? Unfortunately there are lots of failure modes involving different parts of the team not really understanding what each other are trying to do, let alone what they are doing, and subtle errors getting by people who don't know what to look for. So, you can find such teams and some of them work well but a lot of them don't.

So an alternate is to try and find or create domain experts who mix all the appropriate skills, but this is hard and in the extreme case involves chasing down unicorns.

Companies and industries flop back and forth between preferring different approaches - right now a lot of people are talking about "data scientists" as one of the latter, but it will likely change over time as it always does.

It's a hard problem, and it shows.

dwpdwpdwpdwpdwp · on March 6, 2017

As an engineer, I usually know better than to use the phrase I hate to hear: "why don't you just..."

pbhjpbhj · on March 6, 2017

Surely "Why don't you just... ?" is an exceptionally good phrase to use. In practice people mean "Just do ... !", which is very different. The why question, however, gets to the heart of an issue, it's a short hand for "The obvious solution appears to be that ... but I imagine you tried that and have a reason not to do things that way, what are those reasons?". It's a direct learning-centred enquiry that ekes out the kernel of complexity of a situation relying on the wisdom of the person it's aimed at.

So why don't you just use the phrase "Why don't you just ...?"?

olivierva · on March 6, 2017

I couldn't agree more. I just got hired to 'productionize' some proof of concept developed by data scientists in jupyter notebooks. The first thing I did was hiring a Python developer (no data science experience) to start cleaning up the code and a devops to put the infrastructure in place. Second step: I went to the data science department and sat down with them and I taught them how to program properly: test driven development, version control, code reviews (git pull requests) and continuous integration. They all have PhD's so it's not that they would have any trouble learning anything new. They thought it was great. Result: all their new code now directly goes (via code review and CI unit/end2end testing) into pre-production and after sign off from the product owners into production (quite often the same day). I just do not understand why instead of trying to find the perfect person for the job people don't just hire someone to teach them how to do the programming part of their job properly. Good teams are cross functional, diverse and have a strong focus on transferring knowledge.

SatvikBeri · on March 6, 2017

This is a good question. I've tried both approaches, and currently favor going after the rare multi-skilled hire. In general, I have seen many cases where one person who has a small-medium amount of experience/ability in both is a lot more productive than two specialists.

> There are so many more statisticians who can at least communicate and work effectively with developers and vice versa.

Not in my experience. You need to design your data infrastructure to promote easy analysis, and you need to design your models to scale well according to the amount of data you're working with. There are also many cases where a project will require mostly engineering work for a while, and then mostly analysis/statistics work–there are ways to handle this with specialists of course, but there's generally a significant switching cost.

Also people with a combination of statistics & programming aren't that rare–IMO it's more that employers tend to search for both degrees, when instead you should be trying to evaluate the skills directly.

SatvikBeri · on March 6, 2017

I have similar experiences. To add some color: I find that for data science tasks, someone who knows statistics & can program is much, much more productive than someone who only knows one. Part of that is because data science has to do their own product management–the question you ask next changes quite rapidly depending on the results of a single query.

That said, most companies should probably be hiring data engineers rather than data scientists–for most "data science" jobs I've seen, almost no statistics is actually necessary/useful.

drxzcl · on March 6, 2017

This is also very true! Usually we hire AI/stats folks and do a heck of a lot of training to get them up to speed on the development side of things. You can do it the other way around, but math is a lot harder to pick up outside of formal education than computer stuff.

johan_larson · on March 9, 2017

When you say "know statistics", how high are you placing the bar? Lots of people are forced to take one stats class (for non-specialists) in college, and it goes up from there.

ska · on March 6, 2017

Two main factors make data science stick out a little for me right now, although it isn't unique.

One is that there is buzz & excitement around "data science" right now. Nothing specific to this area, but in my experiences this creates a large number of under- or un-qualified applicants. It also creates an environment for companies to desire to hire a role they are not well qualified to hire for. It is really difficult to hire well for roles you don't understand well.

The second thing is that extremely few people are actually ready for this sort of job straight out of an academic program. A related Ph.D. or post doc plus a few years solid training in industry can make you a great candidate, but the academic work alone usually isn't even remotely close. There is confusion about this among both candidates (don't know what they don't know) and hiring managers (don't know what they are actually looking for).

Add to that an oversupply of academic credentials relative to academic jobs and you have a problem. If you are a large company with a well defined data science program and a defined "entry level" data science role, if you take skill development and training seriously and have the senior staff for it, well then you are fine taking strong academic candidates and turning them into talented data scientists. If you are a less experienced company looking for scientists to solve a problem you don't fully understand, you may be in for a pretty rough ride.

drxzcl · on March 6, 2017

Not OP, but I think many companies aren't qualified to judge who is and who isn't a qualified candidate, at least for the first hires. This turns the whole thing into a "market for lemons".

I've helped a few organisations solve this bootstrap problem by helping out with candidate selection and interviews, but many other just don't ask for help.