The Big List of 205 Data Science Interview Questions

January 18th, 2017 by lewis

data science interview questions

Here are our favorite data science interview questions.

If you’re a hiring manager, select the interview questions based on the competencies you’re evaluating.
If you’re a candidate, prepare and practice using this common list of data science interview questions.

Probability and Statistics Interview Questions

Explain what regularization is and why it is useful.
How would you validate a model you created to generate a predictive model of a quantitative outcome variable using multiple regression.
Explain what precision and recall are. How do they relate to the ROC curve?
How can you prove that one improvement you’ve brought to an algorithm is really an improvement over not doing anything?
What is root cause analysis?
Are you familiar with price optimization, price elasticity, inventory management, competitive intelligence? Give examples.
What is statistical power?
Explain what resampling methods are and why they are useful. Also explain their limitations.
Is it better to have too many false positives, or too many false negatives? Explain.
What is selection bias, why is it important and how can you avoid it?
Imagine a test with a true positive rate of 100% and false positive rate of 5%. Imagine a population with a 1/1000 rate of having the condition the test identifies. Given a positive test, what is the probability of having that condition?
What is the normal distribution? Give an example of some variable that follows this distribution.
What about log-normal?
Explain what a long tailed distribution is and provide three examples of relevant phenomena that have long tails. Why are they important in classification and prediction problems?
How to check if a distribution is close to Normal? Why would you want to check it? What is a QQ Plot?
Give examples of data that does not have a Gaussian distribution, or log-normal.
Do you know what the exponential family is?
Do you know the Dirichlet distribution? the multinomial distribution
What is the Laws of Large Numbers? Central Limit Theorem?
Why are they important for Statistics?
What summary statistics do you know?

Data Modeling Interview Questions

What are the most important skills for a data scientist to have?
What types of data are important for business needs?
What data would you go after and start working on?
What are the assumptions required for linear regression?
When you get a new data set, what do you do with it to see if it will suit your needs for a given project?
How do you handle big data sets?
How do you detect outliers?
How do you control model complexity?
How do you model a quantity you can’t observe?
You have one model and want to find the best set of parameters for this model. How would you do that?
How would you look for the best parameters? Do you know something else apart from grid search?
What is Cross-Validation?
What is 10-Fold CV?
What is the difference between holding out a validation set and doing 10-Fold CV?
How do you know if your model overfits?
How do you assess the results of a logistic regression?
Which evaluation metrics you know? Something apart from accuracy?
Which is better: Too many false positives or too many false negatives?
What precision and recall are?
What is a ROC curve? What is AU ROC (AUC)? How to interpret the curve and AU ROC?
Do you know about Concordance or Lift?

Data Science Process Interview Questions

How would you create a taxonomy to identify key customer trends in unstructured data?
Python or R — Which one would you prefer for text analytics?
Which technique is used to predict categorical responses?
What is logistic regression? Or State an example when you have used logistic regression recently.
What are Recommender Systems?
Why data cleaning plays a vital role in analysis?
Differentiate between univariate, bivariate and multivariate analysis.
What do you understand by the term Normal Distribution?
What is Linear Regression?
What is Interpolation and Extrapolation?
What is power analysis?
What is K-means? How can you select K for K-means?
What is Collaborative filtering?
What is the difference between Cluster and Systematic Sampling?
Are expected value and mean value different?
What does P-value signify about the statistical data?
Do gradient descent methods always converge to same point?
What are categorical variables?
How you can make data normal using Box-Cox transformation?
What is the difference between Supervised Learning an Unsupervised Learning?
Explain the use of Combinatorics in data science.
Why is vectorization considered a powerful method for optimizing numerical code?
What is the goal of A/B Testing?
What is an Eigenvalue and Eigenvector?
What is Singular Value Decomposition?
What is Gradient Descent?
How can outlier values be treated?
How can you assess a good logistic model?
How can you iterate over a list and also retrieve element indices at the same time?
During analysis, how do you treat missing values?
Explain about the box cox transformation in regression models.
Can you use machine learning for time series analysis?
Write a function that takes in two sorted lists and outputs a sorted list that is their union.
What is the difference between Bayesian Inference and Maximum Likelihood Estimation (MLE)?
What is Regularization and what kind of problems does regularization solve?
What is multicollinearity and how you can overcome it?
What is the curse of dimensionality?
How do you decide whether your linear regression model fits the data?
What is the difference between squared error and absolute error?
What is Machine Learning?
How are confidence intervals constructed and how will you interpret them?
How will you explain logistic regression to an economist, physician scientist and biologist?
How can you overcome Overfitting?
Differentiate between wide and tall data formats?
Is Naïve Bayes bad? If yes, under what aspects.
How would you develop a model to identify plagiarism?
Can you outline the steps in an analytics project?
Have you heard of CRISP-DM (Cross Industry Standard Process for Data Mining)?

Data Science Machine Learning Interview Questions

What is your favorite ML algorithm and why?
Describe the regression problem. Is it supervised learning? Why?
What is linear regression? Why is it called linear?
Discuss the bias-variance tradeoff.
What is Ordinary Least Squares Regression? How it can be learned?
Can you derive the OLS Regression formula? (For one-step solution)
Do we always need the intercept term? When do we need it and when do we not?
What is collinearity and what to do with it? How to remove multicollinearity?
What if the design matrix is not full rank?
What is overfitting a regression model? What are ways to avoid it?
What is Ridge Regression? How is it different from OLS Regression? Why do we need it?
What is Lasso regression? How is it different from OLS and Ridge?
What are the assumptions required for linear regression?
You would like to find significant features. How would you do that?
You fit a multiple regression to examine the effect of a particular feature. The feature comes back insignificant, but you believe it is significant. Why can it happen?
How to check is the regression model fits the data well?
Can you describe what is the classification problem?
What is the simplest classification algorithm?
What classification algorithms do you know? Which one you like the most?What is a decision tree?
What are some business reasons you might want to use a decision tree model?
How do you build it?
What impurity measures do you know?
Describe some of the different splitting rules used by different decision tree algorithms.
Is a big brushy tree always good? Why would you want to prune it?
Is it a good idea to combine multiple trees?
What is Random Forest? Why is it good?
What is logistic regression?
How do we train a logistic regression model?
How do we interpret its coefficients?
What is an Artificial Neural Network?
How to train an ANN? What is back propagation?
How does a neural network with three layers (one input layer, one inner layer and one output layer) compare to a logistic regression?
What is deep learning? What is CNN (Convolution Neural Network) or RNN (Recurrent Neural Network)?
What is Regularization?
Which problem does Regularization try to solve?
What does it mean (practically) for a design matrix to be “ill-conditioned”?
When might you want to use ridge regression instead of traditional linear regression?
What is the difference between the L1 and L2 regularization?
Why (geometrically) does LASSO produce solutions with zero-valued coefficients (as opposed to ridge)?
What is the purpose of dimensionality reduction and why do we need it?
Are dimensionality reduction techniques supervised or not? Are all of them are (un)supervised?
What ways of reducing dimensionality do you know?
Is feature selection a dimensionality reduction technique?
What is the difference between feature selection and feature extraction?
Is it beneficial to perform dimensionality reduction before fitting an SVM? Why or why not?
What is Principal Component Analysis (PCA)? What is the problem it solves? How is it related to eigenvalue decomposition (EVD)?
What’s the relationship between PCA and SVD? When SVD is better than EVD for PCA?
Under what conditions is PCA effective?
Why do we need to center data for PCA and what can happed if we don’t do it? Do we need to scale data for PCA?
Is PCA a linear model or not? Why?
Do you know other Dimensionality Reduction techniques?
What is Independent Component Analysis (ICA)? What’s the difference between ICA and PCA?
Suppose you have a very sparse matrix where rows are highly dimensional. You project these rows on a random vector of relatively small dimensionality. Is it a valid dimensionality reduction technique or not?
Have you heard of Kernel PCA or other non-linear dimensionality reduction techniques? What about LLE (Locally Linear Embedding) or tt-SNE (tt-distributed Stochastic Neighbor Embedding)
What is Fisher Discriminant Analysis? How it is different from PCA? Is it supervised or not?
What is the difference between a convex function and non-convex?
What is Gradient Descent Method?
Will Gradient Descent methods always converge to the same point?
What is a local optimum?
Is it always bad to have local optima?
What the Newton’s method is?
What kind of problems are well suited for Newton’s method? BFGS? SGD?
What are “slack variables”?
Describe a constrained optimization problem and how you would tackle it.
What is NLP? How is it related to Machine Learning?
How would you turn unstructured text data into structured data usable for ML models?
What is the Vector Space Model?
What is TF-IDF?
Which distances and similarity measures can we use to compare documents? What is cosine similarity?
Why do we remove stop words? When do we not remove them?
Language Models. What is NN-Grams?
What is Curse of Dimensionality? How does it affect distance and similarity measures?
What are the problems of large feature space? How does it affect different models, e.g. OLS? What about computational complexity?
What dimensionality reductions can be used for preprocessing the data?
What is the difference between density-sparse data and dimensionally-sparse data?

Data Science Culture Fit Interview Questions

Which is your favorite machine learning algorithm and why?
In which libraries for Data Science in Python and R, does your strength lie?
What kind of data is important for specific business requirements and how, as a data scientist will you go about collecting that data?
Tell us about the biggest data set you have processed till date and for what kind of analysis.
Which data scientists you admire the most and why?
Suppose you are given a data set, what will you do with it to find out if it suits the business needs of your project or not.
What were the business outcomes or decisions for the projects you worked on?
What unique skills you think can you add on to our data science team?
Which are your favorite data science startups?
Why do you want to pursue a career in data science?
What have you done to upgrade your skills in analytics?
What has been the most useful business insight or development you have found?
How will you explain an A/B test to an engineer who does not know statistics?
When does parallelism helps your algorithms run faster and when does it make them run slower?
How can you ensure that you don’t analyse something that ends up producing meaningless results?
How would you explain to the senior management in your organization as to why a particular data set is important?
Is more data always better?
What are your favorite imputation techniques to handle missing data?
What are your favorite data visualization tools?