Encoding Columns in a DataFrame

blogging
jupyter
Data Preprocessing
Short Notes
Author

Kashish Mukheja

Published

Thursday, 08 June 2023

Short Notes: Encoding columns in a dataframe

While there are numerous existing blogs detailing OneHotEncoding, LabelEncoding, and other encoding techniques, this blog will specifically concentrate on efficiently encoding one or multiple columns of a dataframe in a single operation. This is achieved through the use of the ColumnTransformer API provided by scikit-learn.

Let’s begin 😀

encode.png

Installing Libraries

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

Let’s create a dummy dataframe

We will create a dataframe name employee_df with columns field, salary, avg_years_of_exp, and gender_category. Column gender_category will have one of either Male/Female whichever has the highest proportion in that particular field.

employees_df = pd.DataFrame({
    'field': ['Tech', 'Finance', 'HR', 'Marketing', 'Sales','BioTech'],
    'salary': ['high', 'high', 'low', 'medium', 'medium', 'high'],
    'avg_years_of_exp': [4, 6, 5, 8, 8, 10],
    'gender_category': ['Male', 'Female', 'Female', 'Male', 'Male', 'Female'], # max(Male, Female) gender for each field  
})
  1. field, and gender_category are non-ordinal categorical features
  2. salary is an ordinal categorical feature
  3. avg_years_of_exp looks like a categorical feature as well, but when considering the bigger picture, where we would have thousands of records, and maybe in floating point data types, will not be treated as a categorical feature. We can create a year_experice_range column containing different range of experience (For E.g., 0-3, 4-6, etc.) and treat that as a categorical feature. But we will ignore that for now.

Creating Ordinal Feature and OrdinalEncoder

Ordinal related to a column which can be thought of as a categorical one, but with a maintained sequencing or hierarchy. For instance, (1) Rank 1,2, or 3 ; (2) Salary as high, low, or medium; (3) height as tall, taller, tallest and so on.

ordinal_feature = ['salary']
ordinal_transformer = OrdinalEncoder()

Creating Non Ordinal Feature and OneHotEncoder

non_ordinal_categorical_features = ['field', 'gender_category']
non_ordinal_categorical_transformer = OneHotEncoder(handle_unknown="ignore")

Creating Column Transformer

We provide data for ordinal_transformer & non_ordinal_categorical_transformer

column_transformer = ColumnTransformer(transformers=[
    ('ordinal', ordinal_transformer, ordinal_feature),
    ('non_ordinal_category', non_ordinal_categorical_transformer, non_ordinal_categorical_features)],
                                      remainder='drop')
  • remainder='drop' will drop all the remaining columns which do not required to be transformed. If you want to keep the remaining columns as it is, you may provide remainder='passthrough
pd.DataFrame(column_transformer.fit_transform(employees_df))
0 1 2 3 4 5 6 7 8
0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0
1 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
2 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0
3 2.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0
4 2.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0
5 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0

As you can see, we are not really able to comprehend which column represents what value from the original dataframe. To compensate for it, we will just perform a couple of tweeks.

Creating the final Transformer with Columns intact and understandable

non_ordinal_categorical_transformer = OneHotEncoder(sparse_output=False, handle_unknown="ignore") # New code added

# Note: sparse_output=False is required to preserve column orders and provide a prefix for the columns. 

column_transformer = ColumnTransformer(transformers=[
    ('ordinal', ordinal_transformer, ordinal_feature),
    ('non_ordinal_category', non_ordinal_categorical_transformer, non_ordinal_categorical_features)],
                                      remainder='drop') # This remains same

column_transformer.set_output(transform='pandas') # New code added
ColumnTransformer(transformers=[('ordinal', OrdinalEncoder(), ['salary']),
                                ('non_ordinal_category',
                                 OneHotEncoder(handle_unknown='ignore',
                                               sparse_output=False),
                                 ['field', 'gender_category'])])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
df_pandas = column_transformer.fit_transform(employees_df)
df_pandas
ordinal__salary non_ordinal_category__field_BioTech non_ordinal_category__field_Finance non_ordinal_category__field_HR non_ordinal_category__field_Marketing non_ordinal_category__field_Sales non_ordinal_category__field_Tech non_ordinal_category__gender_category_Female non_ordinal_category__gender_category_Male
0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0
1 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
2 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0
3 2.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0
4 2.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0
5 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
Back to top