- Published on
Encoding Qualitative Data for Classifiers
- Authors
- Name
- Bojun Feng
Summary:
Intro
Recently I've been working with classifiers on qualitative data. Although I am using XGBoost, which has built in encoding ability, I noticed there are quite a few ways to encode qualitative data that have different pros and cons. This post contains some of my notes as well as copy-paste code snippets, mostly generated by GPT.
Setup
Let's begin with setting up the environment, assuming Python is already installed.
First install the necessary libraries:
pip install numpy pandas scikit-learn category_encoders
Next, let's set up the data we need, and define a list of columns to encode:
import pandas as pd
data = {'Color': ['Red', 'Blue', 'Blue', 'Blue', 'Blue'],
'Size': ['S', 'S', 'M', 'M', 'M'],
'Mixed': ['A', 0, 1.0, True, np.nan]}
df = pd.DataFrame(data)
# Filling nulls with a placeholder and converting all to strings
df.fillna('Null', inplace=True)
df = df.astype(str)
# Defining four identical lists for each encoding method
columns_for_one_hot = ['Color', 'Size', 'Letter']
columns_for_label = ['Color', 'Size', 'Letter']
columns_for_binary = ['Color', 'Size', 'Letter']
columns_for_frequency = ['Color', 'Size', 'Letter']
Now the dataframe df
should look something like this:
Color Size Mixed
0 Red S A
1 Blue S 0
2 Blue M 1.0
3 Blue M True
4 Blue M Null
We are ready to go!
One-Hot Encoding
Intuition: Creates a new column for each unique category value, filled with 1s and 0s indicating the presence of the value.
Code:
import pandas as pd
def apply_one_hot_encoding(df, columns):
return pd.get_dummies(df, columns=columns)
df_one_hot = apply_one_hot_encoding(df, columns_for_one_hot)
Original Data:
Color Size Mixed
0 Red S A
1 Blue S 0
2 Blue M 1.0
3 Blue M True
4 Blue M Null
After One-Hot Encoding:
Color_Blue Color_Red Size_M Size_S Mixed_0 Mixed_1.0 Mixed_A Mixed_Null Mixed_True
0 0 1 0 1 0 0 1 0 0
1 1 0 0 1 1 0 0 0 0
2 1 0 1 0 0 1 0 0 0
3 1 0 1 0 0 0 0 0 1
4 1 0 1 0 0 0 0 1 0
Pros:
- No Implicit Ordering: This method doesn't imply any order or priority among the categories.
- Intuitive: Makes a lot of sense to a human - each column corresponds to a value.
- Widely Used: Compatible with many types of models and a standard approach in many applications.
Cons:
- Many New Dimensions: Can significantly increase the dataset's dimension, so computation becomes slow.
- Sparse Matrix: Mostly produces sparse matrices, which can be bad for certain types of computations.
This methods is best for columns with only a few unique values. It works best when the categorical variable does not carry an inherent order, or when the model used does not handle categorical variables natively, like the Linear Regression Model. It also works well with Deep Learning / Neural Networks, which can handle large dimensions relatively better.
Label Encoding
Intuition: Assigns a unique integer to each category value.
Code:
from sklearn.preprocessing import LabelEncoder
def apply_label_encoding(df, columns):
df_encoded = df.copy()
le = LabelEncoder()
for col in columns:
df_encoded[col] = le.fit_transform(df_encoded[col])
return df_encoded
df_label = apply_label_encoding(df, columns_for_label)
Original Data:
Color Size Mixed
0 Red S A
1 Blue S 0
2 Blue M 1.0
3 Blue M True
4 Blue M Null
After Label Encoding:
Color Size Mixed
0 1 1 2
1 0 1 0
2 0 0 1
3 0 0 4
4 0 0 3
Pros:
- No New Dimensions: Efficient in terms of space, as it requires only one column regardless of the number of categories.
- Intuitive: Makes a lot of sense to a human - each number represents a unique value.
Cons:
- Implies Order: Can incorrectly imply that there is a ordered relationship between categories (e.g. Red > Blue).
This method is best for categorical variables with an inherent order like 'low', 'medium', 'high'. It's also useful when dealing with tree-based algorithms, which can handle the notion of 'order' in their split decisions, such as Decision Trees, Random Forests, and Gradient Boosted Trees. However, the implied order may influence order sensitive models like Linear Regression Models if applied on features with no inherent order.
Binary Encoding
Intuition: Similar to one-hot but converts everything to binary after label encoding, creating one column for each bit.
Code:
import category_encoders as ce
def apply_binary_encoding(df, columns):
be = ce.BinaryEncoder(cols=columns)
return be.fit_transform(df)
df_binary = apply_binary_encoding(df, columns_for_binary)
Original Data:
Color Size Mixed
0 Red S A
1 Blue S 0
2 Blue M 1.0
3 Blue M True
4 Blue M Null
After Binary Encoding:
Color_0 Color_1 Size_0 Size_1 Mixed_0 Mixed_1 Mixed_2
0 0 1 0 1 0 0 1
1 1 0 0 1 0 1 0
2 1 0 1 0 0 1 1
3 1 0 1 0 1 0 0
4 1 0 1 0 1 0 1
Pros:
- Less New Dimensions: Offers a good compromise between one-hot encoding and label encoding by reducing the number of new columns.
- Preserves Information: More information is retained as compared to pure label encoding, without an implied order.
Cons:
- Not Intuitive: It is hard to intuitively understand what each column means.
This method is best for unordered categorical data with a moderately high number of unique values. It can essentially be viewed as a trade-off between interpretability and number of dimensions - it is harder to understand but you have less columns to worry about.
Frequency Encoding
Intuition: Replaces categories with their frequencies.
Code:
def apply_frequency_encoding(df, columns):
df_freq = df.copy()
for col in columns:
freq = df[col].value_counts().to_dict()
df_freq[col] = df[col].map(freq)
return df_freq
df_freq = apply_frequency_encoding(df, columns_for_frequency)
Original Data:
Color Size Mixed
0 Red S A
1 Blue S 0
2 Blue M 1.0
3 Blue M True
4 Blue M Null
After Frequency Encoding:
Color Size Letter
0 1 2 1
1 4 2 1
2 4 3 1
3 4 3 1
4 4 3 1
Pros:
- No New Dimensions: Only requires one column, si label encoding.
- Captures Frequency Information: The order can be informative as it encodes information about the frequency of each category.
Cons:
- No Distinction: Different categories might end up with the same frequency despite being radically different.
- Risk of Overfitting: In cases where frequency directly correlates with the target variable, it might lead to overfitting.
This method is best for categorical variables where the frequency of occurrence of values is important. It's particularly useful when the number of unique values is large and when the frequency distribution is not uniform across different values.
Complete Code
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import category_encoders as ce
# Setting up the data
data = {'Color': ['Red', 'Blue', 'Blue', 'Blue', 'Blue'],
'Size': ['S', 'S', 'M', 'M', 'M'],
'Mixed': ['A', 0, 1.0, True, np.nan]}
df = pd.DataFrame(data)
# Filling nulls with a placeholder and converting all to strings
df.fillna('Null', inplace=True)
df = df.astype(str)
# Defining four identical lists for each encoding method
columns_for_one_hot = ['Color', 'Size', 'Mixed']
columns_for_label = ['Color', 'Size', 'Mixed']
columns_for_binary = ['Color', 'Size', 'Mixed']
columns_for_frequency = ['Color', 'Size', 'Mixed']
# One-Hot Encoding Function
def apply_one_hot_encoding(df, columns):
return pd.get_dummies(df, columns=columns)
# Label Encoding Function
def apply_label_encoding(df, columns):
df_encoded = df.copy()
le = LabelEncoder()
for col in columns:
df_encoded[col] = le.fit_transform(df_encoded[col])
return df_encoded
# Binary Encoding Function
def apply_binary_encoding(df, columns):
be = ce.BinaryEncoder(cols=columns)
return be.fit_transform(df)
# Frequency Encoding Function
def apply_frequency_encoding(df, columns):
df_freq = df.copy()
for col in columns:
freq = df[col].value_counts().to_dict()
df_freq[col] = df[col].map(freq)
return df_freq
# Applying the encoding methods
df_one_hot = apply_one_hot_encoding(df, columns_for_one_hot)
df_label = apply_label_encoding(df, columns_for_label)
df_binary = apply_binary_encoding(df, columns_for_binary)
df_freq = apply_frequency_encoding(df, columns_for_frequency)
# Printing the results
print("One-Hot Encoded Data:\n", df_one_hot)
print("\nLabel Encoded Data:\n", df_label)
print("\nBinary Encoded Data:\n", df_binary)
print("\nFrequency Encoded Data:\n", df_freq)
Bonus - Algos for Dimensions
Large Number of Features (1,000 - 100,000+)
- Deep Learning / Neural Networks: Excel in handling high-dimensional data, especially in domains like image and speech recognition.
- Dimensionality Reduction Techniques: Techniques like PCAor UMAP are often used to reduce the dimensionality of the data while retaining most of the information.
- Regularized Linear Models: Models like LASSO or Ridge regression can handle high-dimensional data by adding a regularization term that discourages overfitting.
Medium Number of Features (100 - 10,000)
- Gradient Boosting Machines (like XGBoost, LightGBM, CatBoost): These algorithms handle a moderate number of features well and are good at capturing complex patterns in data.
- Random Forests: This ensemble method is effective for a medium number of features and is robust to overfitting.
- Support Vector Machines: With appropriate kernel choice, SVMs can be effective for medium-sized feature spaces, especially in cases where the data is not linearly separable.
Small Number of Features (< 1000)
- Logistic Regression and Linear Regression: For simple, low-dimensional datasets, these models can be very effective. They are easy to implement and interpret.
- Decision Trees: Can be used effectively with a small number of features. They are interpretable and can model non-linear relationships.
- Naive Bayes: This algorithm is particularly suited for small datasets and can perform well even with a limited amount of data.
Mixed Dimensionality
- Ensemble Methods: Combining different models (like a random forest, a few neural networks, and gradient boosting machines) can yield good results on datasets with mixed dimensionality.
- Feature Engineering: If no methods work, try transform the dataset into a higher or lower-dimensional space to reduce it to a previous case.