Missing values in data?
One of the major problems that comes across data professionals while working with large sets of data is missing values in the dataset. These missing values can impact the performance and the output of data models. Hence, it's a necessity to deal with it before we provide the data to machine learning algorithms. Let's check some of the methods that we can use to deal with this problem.
- Complete Case Analysis: This is one of the most unpopular ways to deal with missing data. In this method, we will remove the row where any of the data is missing at random. This is only used when the data is missing without any pattern. For example, if we have data for 12 months and for one-month data is not available, then we cannot use this method. This can be used where data is missing for a few days.
Advantages of using CCA:
- As we just need to drop a row, it's very easy to implement.
- If the data is at random, then the distribution of data is not impacted.
Disadvantages of using CCA:
- It can extract large datasets in case there are huge missing values.
- Removed dataset might have useful information which can lead to wrong results from the modal.
- Since for the training dataset, all the missing values will be removed it will confuse the model if in the future the model encounters any missing value.
2. Mean/Median Imputation: Consider a column where we have marks of 50 students for a call. Marks range from 0–100 and let's suppose out of 50, 3 students don't have any value in their marks column. In this case, we can use mean or median to replace the values.
Mean = (Sum of all the observations/Total number of observations)
data['age'] =
data['age'].fillna(data['age'].mean())
Median ={(total number of observation+1)/2}th
data['age'] =
data['age'].fillna(data['age'].median())
We can use either of them based on the distribution.
Advantages:
- Simple to use. Users can choose between mean or median.
Disadvantages:
- It will change the shape of the distribution as the missing values have been assigned a value.
- Change in correlation with respect to other parameters in the data frame. For example, A student might be absent from the exam and his value is missing, but when we use this technique we will assign some value in the marks section.
3. Arbitrary Value Imputation: This is mainly used in categorical data where we have missing values. For example, we have cat and dog as the values, but some values are missing. Here we can assign values like NA, missing, or none to specify that these values are different from all the other values in the same column. The same can be applied to numbers as well. The main goal of this technique is to create a difference between the data with missing values and data without missing values.
Advantages:
- Easy to use.
Disadvantages:
1. Change in correlation.
2. Change in the shape of distribution.
4. Most Frequent Missing Imputation: When we have categorical data, it's difficult to apply mean or median to this data so in this case we use mode, i.e. the data that is available the most will be used to fill the missing value.
It's easy to implement this technique, but at the same time, it does impact the distribution.
5. Random Imputation: Random imputation is the technique where we randomly choose a number from the column and assign it to the missing value.
Consider the above column where we have 2 missing values. Using random imputation we can fill these values using the random values. In this case, the random values are 10,13,23,15,34,26,12 and 21. This can be applied to both numerical as well as categorical data.
Advantages:
- Well suited for the linear model.
- The random values can be extracted from the training data.
6. KNN Imputation: K nearest neighbor imputation is used to identify the missing values in a table where we have a dependency on more than one column. It’s particularly useful when the underlying relationships between features are not linear and there might be complex interactions between different features. Here, K denotes the number of neighbors we want to consider while searching for the missing values.
Advantages:
- It can be easily used where we cannot use mean or median.
- It can be used both in categorical as well as numerical data.
Disadvantages:
1. Finding the exact value of K can be a challenging task.
7. MICE(Multiple Imputation by Chained Equations): This technique is considered the most accurate technique to fill the value of missing data.In order to understand let's have 3 columns with name A, B, and C which has some missing data.
In step 1 we will replace the missing value with the mean of their respective column.
In next step we will remove the mean value of A as we will predict this value using the other 2 column.
Now we can categorize our data into 2 parts where A is the output data and B,C are our training data. Based on the value for B and C, the value of column A will be predicted. We can use linear regression/decision tree to do so. Once we have the value in A, we can use columns A and B to predict the value of the missing value in C(removing the mean value and making it missing). And when we have for C we can use the same practice for B. By doing so we will have the new values for columns A,B, and C.
In order to attain perfection, we can repeat the above steps once with linear regression and the other with random forest, and try to find the difference between both of them. The minimum the difference, the better the predicted values.
Reference: