Kash’s Portfolio - Encoding Columns in a DataFrame

Short Notes: Encoding columns in a dataframe

While there are numerous existing blogs detailing OneHotEncoding, LabelEncoding, and other encoding techniques, this blog will specifically concentrate on efficiently encoding one or multiple columns of a dataframe in a single operation. This is achieved through the use of the ColumnTransformer API provided by scikit-learn.

Let’s begin 😀

Installing Libraries

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

Let’s create a dummy dataframe

We will create a dataframe name employee_df with columns field, salary, avg_years_of_exp, and gender_category. Column gender_category will have one of either Male/Female whichever has the highest proportion in that particular field.

employees_df = pd.DataFrame({
    'field': ['Tech', 'Finance', 'HR', 'Marketing', 'Sales','BioTech'],
    'salary': ['high', 'high', 'low', 'medium', 'medium', 'high'],
    'avg_years_of_exp': [4, 6, 5, 8, 8, 10],
    'gender_category': ['Male', 'Female', 'Female', 'Male', 'Male', 'Female'], # max(Male, Female) gender for each field  
})

field, and gender_category are non-ordinal categorical features
salary is an ordinal categorical feature
avg_years_of_exp looks like a categorical feature as well, but when considering the bigger picture, where we would have thousands of records, and maybe in floating point data types, will not be treated as a categorical feature. We can create a year_experice_range column containing different range of experience (For E.g., 0-3, 4-6, etc.) and treat that as a categorical feature. But we will ignore that for now.

Creating Ordinal Feature and OrdinalEncoder

Ordinal related to a column which can be thought of as a categorical one, but with a maintained sequencing or hierarchy. For instance, (1) Rank 1,2, or 3 ; (2) Salary as high, low, or medium; (3) height as tall, taller, tallest and so on.

ordinal_feature = ['salary']
ordinal_transformer = OrdinalEncoder()

Creating Non Ordinal Feature and OneHotEncoder

non_ordinal_categorical_features = ['field', 'gender_category']
non_ordinal_categorical_transformer = OneHotEncoder(handle_unknown="ignore")

Creating Column Transformer

We provide data for ordinal_transformer & non_ordinal_categorical_transformer

column_transformer = ColumnTransformer(transformers=[
    ('ordinal', ordinal_transformer, ordinal_feature),
    ('non_ordinal_category', non_ordinal_categorical_transformer, non_ordinal_categorical_features)],
                                      remainder='drop')

remainder='drop' will drop all the remaining columns which do not required to be transformed. If you want to keep the remaining columns as it is, you may provide remainder='passthrough

pd.DataFrame(column_transformer.fit_transform(employees_df))

	0	1	2	3	4	5	6	7	8
0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	1.0
1	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0
2	1.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0
3	2.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0
4	2.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0
5	0.0	1.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0

As you can see, we are not really able to comprehend which column represents what value from the original dataframe. To compensate for it, we will just perform a couple of tweeks.

Creating the final Transformer with Columns intact and understandable

non_ordinal_categorical_transformer = OneHotEncoder(sparse_output=False, handle_unknown="ignore") # New code added

# Note: sparse_output=False is required to preserve column orders and provide a prefix for the columns. 

column_transformer = ColumnTransformer(transformers=[
    ('ordinal', ordinal_transformer, ordinal_feature),
    ('non_ordinal_category', non_ordinal_categorical_transformer, non_ordinal_categorical_features)],
                                      remainder='drop') # This remains same

column_transformer.set_output(transform='pandas') # New code added

ColumnTransformer(transformers=[('ordinal', OrdinalEncoder(), ['salary']),
                                ('non_ordinal_category',
                                 OneHotEncoder(handle_unknown='ignore',
                                               sparse_output=False),
                                 ['field', 'gender_category'])])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

df_pandas = column_transformer.fit_transform(employees_df)
df_pandas

	ordinal__salary	non_ordinal_category__field_BioTech	non_ordinal_category__field_Finance	non_ordinal_category__field_HR	non_ordinal_category__field_Marketing	non_ordinal_category__field_Sales	non_ordinal_category__field_Tech	non_ordinal_category__gender_category_Female	non_ordinal_category__gender_category_Male
0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	1.0
1	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0
2	1.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0
3	2.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0
4	2.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0
5	0.0	1.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0