While there are numerous existing blogs detailing OneHotEncoding, LabelEncoding, and other encoding techniques, this blog will specifically concentrate on efficiently encoding one or multiple columns of a dataframe in a single operation. This is achieved through the use of the ColumnTransformer API provided by scikit-learn.

Let’s begin 😀


Installing Libraries

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

Let’s create a dummy dataframe

We will create a dataframe name employee_df with columns field, salary, avg_years_of_exp, and gender_category. Column gender_category will have one of either Male/Female whichever has the highest proportion in that particular field.

employees_df = pd.DataFrame({
    'field': ['Tech', 'Finance', 'HR', 'Marketing', 'Sales','BioTech'],
    'salary': ['high', 'high', 'low', 'medium', 'medium', 'high'],
    'avg_years_of_exp': [4, 6, 5, 8, 8, 10],
    'gender_category': ['Male', 'Female', 'Female', 'Male', 'Male', 'Female'], # max(Male, Female) gender for each field  
  1. field, and gender_category are non-ordinal categorical features
  2. salary is an ordinal categorical feature
  3. avg_years_of_exp looks like a categorical feature as well, but when considering the bigger picture, where we would have thousands of records, and maybe in floating point data types, will not be treated as a categorical feature. We can create a year_experice_range column containing different range of experience (For E.g., 0-3, 4-6, etc.) and treat that as a categorical feature. But we will ignore that for now.

Creating Ordinal Feature and OrdinalEncoder

Ordinal related to a column which can be thought of as a categorical one, but with a maintained sequencing or hierarchy. For instance, (1) Rank 1,2, or 3 ; (2) Salary as high, low, or medium; (3) height as tall, taller, tallest and so on.

ordinal_feature = ['salary']
ordinal_transformer = OrdinalEncoder()

Creating Non Ordinal Feature and OneHotEncoder

non_ordinal_categorical_features = ['field', 'gender_category']
non_ordinal_categorical_transformer = OneHotEncoder(handle_unknown="ignore")

Creating Column Transformer

We provide data for ordinal_transformer & non_ordinal_categorical_transformer

column_transformer = ColumnTransformer(transformers=[
    ('ordinal', ordinal_transformer, ordinal_feature),
    ('non_ordinal_category', non_ordinal_categorical_transformer, non_ordinal_categorical_features)],
  • remainder='drop' will drop all the remaining columns which do not required to be transformed. If you want to keep the remaining columns as it is, you may provide remainder='passthrough
0 1 2 3 4 5 6 7 8
0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0
1 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
2 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0
3 2.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0
4 2.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0
5 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0

As you can see, we are not really able to comprehend which column represents what value from the original dataframe. To compensate for it, we will just perform a couple of tweeks.

Creating the final Transformer with Columns intact and understandable

non_ordinal_categorical_transformer = OneHotEncoder(sparse_output=False, handle_unknown="ignore") # New code added

# Note: sparse_output=False is required to preserve column orders and provide a prefix for the columns. 

column_transformer = ColumnTransformer(transformers=[
    ('ordinal', ordinal_transformer, ordinal_feature),
    ('non_ordinal_category', non_ordinal_categorical_transformer, non_ordinal_categorical_features)],
                                      remainder='drop') # This remains same

column_transformer.set_output(transform='pandas') # New code added
ColumnTransformer(transformers=[('ordinal', OrdinalEncoder(), ['salary']),
                                 ['field', 'gender_category'])])
df_pandas = column_transformer.fit_transform(employees_df)
ordinal__salary non_ordinal_category__field_BioTech non_ordinal_category__field_Finance non_ordinal_category__field_HR non_ordinal_category__field_Marketing non_ordinal_category__field_Sales non_ordinal_category__field_Tech non_ordinal_category__gender_category_Female non_ordinal_category__gender_category_Male
0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0
1 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
2 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0
3 2.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0
4 2.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0
5 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
