## The Big List of 205 Data Science Interview Questions

**January 18th, 2017** by lewis

Here are our favorite data science interview questions.

- If you’re a hiring manager, select the interview questions based on the competencies you’re evaluating.
- If you’re a candidate, prepare and practice using this common list of data science interview questions.

# Probability and Statistics Interview Questions

- Explain what regularization is and why it is useful.
- How would you validate a model you created to generate a predictive model of a quantitative outcome variable using multiple regression.
- Explain what precision and recall are. How do they relate to the ROC curve?
- How can you prove that one improvement you’ve brought to an algorithm is really an improvement over not doing anything?
- What is root cause analysis?
- Are you familiar with price optimization, price elasticity, inventory management, competitive intelligence? Give examples.
- What is statistical power?
- Explain what resampling methods are and why they are useful. Also explain their limitations.
- Is it better to have too many false positives, or too many false negatives? Explain.
- What is selection bias, why is it important and how can you avoid it?
- Imagine a test with a true positive rate of 100% and false positive rate of 5%. Imagine a population with a 1/1000 rate of having the condition the test identifies. Given a positive test, what is the probability of having that condition?
- What is the normal distribution? Give an example of some variable that follows this distribution.
- What about log-normal?
- Explain what a long tailed distribution is and provide three examples of relevant phenomena that have long tails. Why are they important in classification and prediction problems?
- How to check if a distribution is close to Normal? Why would you want to check it? What is a QQ Plot?
- Give examples of data that does not have a Gaussian distribution, or log-normal.
- Do you know what the exponential family is?
- Do you know the Dirichlet distribution? the multinomial distribution
- What is the Laws of Large Numbers? Central Limit Theorem?
- Why are they important for Statistics?
- What summary statistics do you know?

# Data Modeling Interview Questions

- What are the most important skills for a data scientist to have?
- What types of data are important for business needs?
- What data would you go after and start working on?
- What are the assumptions required for linear regression?
- When you get a new data set, what do you do with it to see if it will suit your needs for a given project?
- How do you handle big data sets?
- How do you detect outliers?
- How do you control model complexity?
- How do you model a quantity you can’t observe?
- You have one model and want to find the best set of parameters for this model. How would you do that?
- How would you look for the best parameters? Do you know something else apart from grid search?
- What is Cross-Validation?
- What is 10-Fold CV?
- What is the difference between holding out a validation set and doing 10-Fold CV?
- How do you know if your model overfits?
- How do you assess the results of a logistic regression?
- Which evaluation metrics you know? Something apart from accuracy?
- Which is better: Too many false positives or too many false negatives?
- What precision and recall are?
- What is a ROC curve? What is AU ROC (AUC)? How to interpret the curve and AU ROC?
- Do you know about Concordance or Lift?

# Data Science Process Interview Questions

- How would you create a taxonomy to identify key customer trends in unstructured data?
- Python or R — Which one would you prefer for text analytics?
- Which technique is used to predict categorical responses?
- What is logistic regression? Or State an example when you have used logistic regression recently.
- What are Recommender Systems?
- Why data cleaning plays a vital role in analysis?
- Differentiate between univariate, bivariate and multivariate analysis.
- What do you understand by the term Normal Distribution?
- What is Linear Regression?
- What is Interpolation and Extrapolation?
- What is power analysis?
- What is K-means? How can you select K for K-means?
- What is Collaborative filtering?
- What is the difference between Cluster and Systematic Sampling?
- Are expected value and mean value different?
- What does P-value signify about the statistical data?
- Do gradient descent methods always converge to same point?
- What are categorical variables?
- How you can make data normal using Box-Cox transformation?
- What is the difference between Supervised Learning an Unsupervised Learning?
- Explain the use of Combinatorics in data science.
- Why is vectorization considered a powerful method for optimizing numerical code?
- What is the goal of A/B Testing?
- What is an Eigenvalue and Eigenvector?
- What is Singular Value Decomposition?
- What is Gradient Descent?
- How can outlier values be treated?
- How can you assess a good logistic model?
- How can you iterate over a list and also retrieve element indices at the same time?
- During analysis, how do you treat missing values?
- Explain about the box cox transformation in regression models.
- Can you use machine learning for time series analysis?
- Write a function that takes in two sorted lists and outputs a sorted list that is their union.
- What is the difference between Bayesian Inference and Maximum Likelihood Estimation (MLE)?
- What is Regularization and what kind of problems does regularization solve?
- What is multicollinearity and how you can overcome it?
- What is the curse of dimensionality?
- How do you decide whether your linear regression model fits the data?
- What is the difference between squared error and absolute error?
- What is Machine Learning?
- How are confidence intervals constructed and how will you interpret them?
- How will you explain logistic regression to an economist, physician scientist and biologist?
- How can you overcome Overfitting?
- Differentiate between wide and tall data formats?
- Is Naïve Bayes bad? If yes, under what aspects.
- How would you develop a model to identify plagiarism?
- Can you outline the steps in an analytics project?
- Have you heard of CRISP-DM (Cross Industry Standard Process for Data Mining)?

# Data Science Machine Learning Interview Questions

- What is your favorite ML algorithm and why?
- Describe the regression problem. Is it supervised learning? Why?
- What is linear regression? Why is it called linear?
- Discuss the bias-variance tradeoff.
- What is Ordinary Least Squares Regression? How it can be learned?
- Can you derive the OLS Regression formula? (For one-step solution)
- Do we always need the intercept term? When do we need it and when do we not?
- What is collinearity and what to do with it? How to remove multicollinearity?
- What if the design matrix is not full rank?
- What is overfitting a regression model? What are ways to avoid it?
- What is Ridge Regression? How is it different from OLS Regression? Why do we need it?
- What is Lasso regression? How is it different from OLS and Ridge?
- What are the assumptions required for linear regression?
- You would like to find significant features. How would you do that?
- You fit a multiple regression to examine the effect of a particular feature. The feature comes back insignificant, but you believe it is significant. Why can it happen?
- How to check is the regression model fits the data well?
- Can you describe what is the classification problem?
- What is the simplest classification algorithm?
- What classification algorithms do you know? Which one you like the most?What is a decision tree?
- What are some business reasons you might want to use a decision tree model?
- How do you build it?
- What impurity measures do you know?
- Describe some of the different splitting rules used by different decision tree algorithms.
- Is a big brushy tree always good? Why would you want to prune it?
- Is it a good idea to combine multiple trees?
- What is Random Forest? Why is it good?
- What is logistic regression?
- How do we train a logistic regression model?
- How do we interpret its coefficients?
- What is an Artificial Neural Network?
- How to train an ANN? What is back propagation?
- How does a neural network with three layers (one input layer, one inner layer and one output layer) compare to a logistic regression?
- What is deep learning? What is CNN (Convolution Neural Network) or RNN (Recurrent Neural Network)?
- What is Regularization?
- Which problem does Regularization try to solve?
- What does it mean (practically) for a design matrix to be “ill-conditioned”?
- When might you want to use ridge regression instead of traditional linear regression?
- What is the difference between the L1 and L2 regularization?
- Why (geometrically) does LASSO produce solutions with zero-valued coefficients (as opposed to ridge)?
- What is the purpose of dimensionality reduction and why do we need it?
- Are dimensionality reduction techniques supervised or not? Are all of them are (un)supervised?
- What ways of reducing dimensionality do you know?
- Is feature selection a dimensionality reduction technique?
- What is the difference between feature selection and feature extraction?
- Is it beneficial to perform dimensionality reduction before fitting an SVM? Why or why not?
- What is Principal Component Analysis (PCA)? What is the problem it solves? How is it related to eigenvalue decomposition (EVD)?
- What’s the relationship between PCA and SVD? When SVD is better than EVD for PCA?
- Under what conditions is PCA effective?
- Why do we need to center data for PCA and what can happed if we don’t do it? Do we need to scale data for PCA?
- Is PCA a linear model or not? Why?
- Do you know other Dimensionality Reduction techniques?
- What is Independent Component Analysis (ICA)? What’s the difference between ICA and PCA?
- Suppose you have a very sparse matrix where rows are highly dimensional. You project these rows on a random vector of relatively small dimensionality. Is it a valid dimensionality reduction technique or not?
- Have you heard of Kernel PCA or other non-linear dimensionality reduction techniques? What about LLE (Locally Linear Embedding) or tt-SNE (tt-distributed Stochastic Neighbor Embedding)
- What is Fisher Discriminant Analysis? How it is different from PCA? Is it supervised or not?
- What is the difference between a convex function and non-convex?
- What is Gradient Descent Method?
- Will Gradient Descent methods always converge to the same point?
- What is a local optimum?
- Is it always bad to have local optima?
- What the Newton’s method is?
- What kind of problems are well suited for Newton’s method? BFGS? SGD?
- What are “slack variables”?
- Describe a constrained optimization problem and how you would tackle it.
- What is NLP? How is it related to Machine Learning?
- How would you turn unstructured text data into structured data usable for ML models?
- What is the Vector Space Model?
- What is TF-IDF?
- Which distances and similarity measures can we use to compare documents? What is cosine similarity?
- Why do we remove stop words? When do we not remove them?
- Language Models. What is NN-Grams?
- What is Curse of Dimensionality? How does it affect distance and similarity measures?
- What are the problems of large feature space? How does it affect different models, e.g. OLS? What about computational complexity?
- What dimensionality reductions can be used for preprocessing the data?
- What is the difference between density-sparse data and dimensionally-sparse data?

# Data Science Culture Fit Interview Questions

- Which is your favorite machine learning algorithm and why?
- In which libraries for Data Science in Python and R, does your strength lie?
- What kind of data is important for specific business requirements and how, as a data scientist will you go about collecting that data?
- Tell us about the biggest data set you have processed till date and for what kind of analysis.
- Which data scientists you admire the most and why?
- Suppose you are given a data set, what will you do with it to find out if it suits the business needs of your project or not.
- What were the business outcomes or decisions for the projects you worked on?
- What unique skills you think can you add on to our data science team?
- Which are your favorite data science startups?
- Why do you want to pursue a career in data science?
- What have you done to upgrade your skills in analytics?
- What has been the most useful business insight or development you have found?
- How will you explain an A/B test to an engineer who does not know statistics?
- When does parallelism helps your algorithms run faster and when does it make them run slower?
- How can you ensure that you don’t analyse something that ends up producing meaningless results?
- How would you explain to the senior management in your organization as to why a particular data set is important?
- Is more data always better?
- What are your favorite imputation techniques to handle missing data?
- What are your favorite data visualization tools?

*Sources*

http://www.kdnuggets.com/2016/02/21-data-science-interview-questions-answers.html/3

https://www.dezyre.com/article/100-data-science-interview-questions-and-answers-general-for-2016/184

http://blog.udacity.com/2015/04/data-science-interview-questions.html

http://www.datasciencecentral.com/profiles/blogs/66-job-interview-questions-for-data-scientists

http://www.itshared.org/2015/10/data-science-interview-questions.html

*Photo credit: Sebastiaan ter Burg*