Encoding Qualitative Data for Classifiers

Summary:

Intro
Setup
Complete Code
Bonus - Algos for Dimensions

Intro

Recently I've been working with classifiers on qualitative data. Although I am using XGBoost, which has built in encoding ability, I noticed there are quite a few ways to encode qualitative data that have different pros and cons. This post contains some of my notes as well as copy-paste code snippets, mostly generated by GPT.

Setup

Let's begin with setting up the environment, assuming Python is already installed.

First install the necessary libraries:

pip install numpy pandas scikit-learn category_encoders

Next, let's set up the data we need, and define a list of columns to encode:

import pandas as pd

data = {'Color': ['Red', 'Blue', 'Blue', 'Blue', 'Blue'],
        'Size': ['S', 'S', 'M', 'M', 'M'],
        'Mixed': ['A', 0, 1.0, True, np.nan]}
df = pd.DataFrame(data)

# Filling nulls with a placeholder and converting all to strings
df.fillna('Null', inplace=True)
df = df.astype(str)

# Defining four identical lists for each encoding method
columns_for_one_hot = ['Color', 'Size', 'Letter']
columns_for_label = ['Color', 'Size', 'Letter']
columns_for_binary = ['Color', 'Size', 'Letter']
columns_for_frequency = ['Color', 'Size', 'Letter']

Now the dataframe df should look something like this:

  Color Size Mixed
0   Red    S     A
1  Blue    S     0
2  Blue    M   1.0
3  Blue    M  True
4  Blue    M  Null

We are ready to go!

One-Hot Encoding

Intuition: Creates a new column for each unique category value, filled with 1s and 0s indicating the presence of the value.

Code:

import pandas as pd

def apply_one_hot_encoding(df, columns):
    return pd.get_dummies(df, columns=columns)

df_one_hot = apply_one_hot_encoding(df, columns_for_one_hot)

Original Data:

  Color Size Mixed
0   Red    S     A
1  Blue    S     0
2  Blue    M   1.0
3  Blue    M  True
4  Blue    M  Null

After One-Hot Encoding:

   Color_Blue  Color_Red  Size_M  Size_S  Mixed_0  Mixed_1.0  Mixed_A  Mixed_Null  Mixed_True
0           0          1       0       1        0          0        1           0           0
1           1          0       0       1        1          0        0           0           0
2           1          0       1       0        0          1        0           0           0
3           1          0       1       0        0          0        0           0           1
4           1          0       1       0        0          0        0           1           0

Pros:

No Implicit Ordering: This method doesn't imply any order or priority among the categories.
Intuitive: Makes a lot of sense to a human - each column corresponds to a value.
Widely Used: Compatible with many types of models and a standard approach in many applications.

Cons:

Many New Dimensions: Can significantly increase the dataset's dimension, so computation becomes slow.
Sparse Matrix: Mostly produces sparse matrices, which can be bad for certain types of computations.

This methods is best for columns with only a few unique values. It works best when the categorical variable does not carry an inherent order, or when the model used does not handle categorical variables natively, like the Linear Regression Model. It also works well with Deep Learning / Neural Networks, which can handle large dimensions relatively better.

Label Encoding

Intuition: Assigns a unique integer to each category value.

Code:

from sklearn.preprocessing import LabelEncoder

def apply_label_encoding(df, columns):
    df_encoded = df.copy()
    le = LabelEncoder()
    for col in columns:
        df_encoded[col] = le.fit_transform(df_encoded[col])
    return df_encoded

df_label = apply_label_encoding(df, columns_for_label)

Original Data:

  Color Size Mixed
0   Red    S     A
1  Blue    S     0
2  Blue    M   1.0
3  Blue    M  True
4  Blue    M  Null

After Label Encoding:

    Color  Size  Mixed
0      1     1      2
1      0     1      0
2      0     0      1
3      0     0      4
4      0     0      3

Pros:

No New Dimensions: Efficient in terms of space, as it requires only one column regardless of the number of categories.
Intuitive: Makes a lot of sense to a human - each number represents a unique value.

Cons:

Implies Order: Can incorrectly imply that there is a ordered relationship between categories (e.g. Red > Blue).

This method is best for categorical variables with an inherent order like 'low', 'medium', 'high'. It's also useful when dealing with tree-based algorithms, which can handle the notion of 'order' in their split decisions, such as Decision Trees, Random Forests, and Gradient Boosted Trees. However, the implied order may influence order sensitive models like Linear Regression Models if applied on features with no inherent order.

Binary Encoding

Intuition: Similar to one-hot but converts everything to binary after label encoding, creating one column for each bit.

Code:

import category_encoders as ce

def apply_binary_encoding(df, columns):
    be = ce.BinaryEncoder(cols=columns)
    return be.fit_transform(df)

df_binary = apply_binary_encoding(df, columns_for_binary)

Original Data:

  Color Size Mixed
0   Red    S     A
1  Blue    S     0
2  Blue    M   1.0
3  Blue    M  True
4  Blue    M  Null

After Binary Encoding:

    Color_0  Color_1  Size_0  Size_1  Mixed_0  Mixed_1  Mixed_2
0        0        1       0       1        0        0        1
1        1        0       0       1        0        1        0
2        1        0       1       0        0        1        1
3        1        0       1       0        1        0        0
4        1        0       1       0        1        0        1

Pros:

Less New Dimensions: Offers a good compromise between one-hot encoding and label encoding by reducing the number of new columns.
Preserves Information: More information is retained as compared to pure label encoding, without an implied order.

Cons:

Not Intuitive: It is hard to intuitively understand what each column means.

This method is best for unordered categorical data with a moderately high number of unique values. It can essentially be viewed as a trade-off between interpretability and number of dimensions - it is harder to understand but you have less columns to worry about.

Frequency Encoding

Intuition: Replaces categories with their frequencies.

Code:

def apply_frequency_encoding(df, columns):
    df_freq = df.copy()
    for col in columns:
        freq = df[col].value_counts().to_dict()
        df_freq[col] = df[col].map(freq)
    return df_freq

df_freq = apply_frequency_encoding(df, columns_for_frequency)

Original Data:

  Color Size Mixed
0   Red    S     A
1  Blue    S     0
2  Blue    M   1.0
3  Blue    M  True
4  Blue    M  Null

After Frequency Encoding:

   Color  Size  Letter
0      1     2       1
1      4     2       1
2      4     3       1
3      4     3       1
4      4     3       1

Pros:

No New Dimensions: Only requires one column, si label encoding.
Captures Frequency Information: The order can be informative as it encodes information about the frequency of each category.

Cons:

No Distinction: Different categories might end up with the same frequency despite being radically different.
Risk of Overfitting: In cases where frequency directly correlates with the target variable, it might lead to overfitting.

This method is best for categorical variables where the frequency of occurrence of values is important. It's particularly useful when the number of unique values is large and when the frequency distribution is not uniform across different values.

Complete Code

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import category_encoders as ce

# Setting up the data
data = {'Color': ['Red', 'Blue', 'Blue', 'Blue', 'Blue'],
        'Size': ['S', 'S', 'M', 'M', 'M'],
        'Mixed': ['A', 0, 1.0, True, np.nan]}
df = pd.DataFrame(data)

# Filling nulls with a placeholder and converting all to strings
df.fillna('Null', inplace=True)
df = df.astype(str)

# Defining four identical lists for each encoding method
columns_for_one_hot = ['Color', 'Size', 'Mixed']
columns_for_label = ['Color', 'Size', 'Mixed']
columns_for_binary = ['Color', 'Size', 'Mixed']
columns_for_frequency = ['Color', 'Size', 'Mixed']

# One-Hot Encoding Function
def apply_one_hot_encoding(df, columns):
    return pd.get_dummies(df, columns=columns)

# Label Encoding Function
def apply_label_encoding(df, columns):
    df_encoded = df.copy()
    le = LabelEncoder()
    for col in columns:
        df_encoded[col] = le.fit_transform(df_encoded[col])
    return df_encoded

# Binary Encoding Function
def apply_binary_encoding(df, columns):
    be = ce.BinaryEncoder(cols=columns)
    return be.fit_transform(df)

# Frequency Encoding Function
def apply_frequency_encoding(df, columns):
    df_freq = df.copy()
    for col in columns:
        freq = df[col].value_counts().to_dict()
        df_freq[col] = df[col].map(freq)
    return df_freq

# Applying the encoding methods
df_one_hot = apply_one_hot_encoding(df, columns_for_one_hot)
df_label = apply_label_encoding(df, columns_for_label)
df_binary = apply_binary_encoding(df, columns_for_binary)
df_freq = apply_frequency_encoding(df, columns_for_frequency)

# Printing the results
print("One-Hot Encoded Data:\n", df_one_hot)
print("\nLabel Encoded Data:\n", df_label)
print("\nBinary Encoded Data:\n", df_binary)
print("\nFrequency Encoded Data:\n", df_freq)

Bonus - Algos for Dimensions

Large Number of Features (1,000 - 100,000+)
- Deep Learning / Neural Networks: Excel in handling high-dimensional data, especially in domains like image and speech recognition.
- Dimensionality Reduction Techniques: Techniques like PCAor UMAP are often used to reduce the dimensionality of the data while retaining most of the information.
- Regularized Linear Models: Models like LASSO or Ridge regression can handle high-dimensional data by adding a regularization term that discourages overfitting.
Medium Number of Features (100 - 10,000)
- Gradient Boosting Machines (like XGBoost, LightGBM, CatBoost): These algorithms handle a moderate number of features well and are good at capturing complex patterns in data.
- Random Forests: This ensemble method is effective for a medium number of features and is robust to overfitting.
- Support Vector Machines: With appropriate kernel choice, SVMs can be effective for medium-sized feature spaces, especially in cases where the data is not linearly separable.
Small Number of Features (< 1000)
- Logistic Regression and Linear Regression: For simple, low-dimensional datasets, these models can be very effective. They are easy to implement and interpret.
- Decision Trees: Can be used effectively with a small number of features. They are interpretable and can model non-linear relationships.
- Naive Bayes: This algorithm is particularly suited for small datasets and can perform well even with a limited amount of data.
Mixed Dimensionality
- Ensemble Methods: Combining different models (like a random forest, a few neural networks, and gradient boosting machines) can yield good results on datasets with mixed dimensionality.
- Feature Engineering: If no methods work, try transform the dataset into a higher or lower-dimensional space to reduce it to a previous case.