Handling Missing Values

blogging
jupyter
Data Preprocessing
Short Notes
Author

Kashish Mukheja

Published

Saturday, 27 May 2023

Short Notes : Handling missing values in the dataframe.

missing%20values%20in%20the%20dataframe.png

A dataset can have columns containing null values, and these arise from 3 reasons : 1. Missing Complete at Random, 2. Missing at Random, 3. Missing Not at Random.

Null values in case of (1) and (2) are not useful for insights and inferences, and should be replaced/imputed with some other value. The some other value depends on multiple factors. We will focus on numerical and categorical columns with missing values as part of the current blog.

We can hanlde missing values by any of the below techniques:

  1. Dropping rows or columns - This can lead to missing out of valuable information in the data. Most often, not a suggested approach. Listwise Deletion, is another form of dropping rows containing missing values.

  2. Replacing missing values with mean or median, i.e., P50 (for continuous data) - Effect of outliers will can play a role in replacing with mean. Replacing the values with median, is a good option.

  3. Replacing missing values with mode (for categorical) - This is only for categorical , and may or may not work depending on the dataset you’re dealing with. This completely ignores the affect of features (i.e., feature importance and tree interpretation) have on the target variables.

  4. Replacing missing values using KNN model - The k nearest neighbor algorithm is often used to impute a missing value based on how closely it resembles the points in the training set. The non-null features are used to predict the features having null values

  5. MultiVariate Imputation - It suggests imputing the null values based on the other columns in the dataset. It therefore assumes that data (or features) with missing values have some sort of relation with the non-missing feature columns. This is also called Multiple Imputation by Chained Equation.

References:

  1. https://medium.com/analytics-vidhya/a-beginners-guide-to-multivariate-imputation-fe4ae5591544
  2. https://pub.towardsai.net/handling-missing-data-for-advanced-machine-learning-b6eb89050357
  3. https://www.numpyninja.com/post/mice-and-knn-missing-value-imputations-through-python
Back to top