import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
Short Notes: Encoding columns in a dataframe
While there are numerous existing blogs detailing OneHotEncoding, LabelEncoding, and other encoding techniques, this blog will specifically concentrate on efficiently encoding one or multiple columns of a dataframe in a single operation. This is achieved through the use of the ColumnTransformer API provided by scikit-learn.
Let’s begin 😀
Installing Libraries
Let’s create a dummy dataframe
We will create a dataframe name employee_df
with columns field
, salary, avg_years_of_exp
, and gender_category
. Column gender_category
will have one of either Male/Female whichever has the highest proportion in that particular field.
= pd.DataFrame({
employees_df 'field': ['Tech', 'Finance', 'HR', 'Marketing', 'Sales','BioTech'],
'salary': ['high', 'high', 'low', 'medium', 'medium', 'high'],
'avg_years_of_exp': [4, 6, 5, 8, 8, 10],
'gender_category': ['Male', 'Female', 'Female', 'Male', 'Male', 'Female'], # max(Male, Female) gender for each field
})
field
, andgender_category
are non-ordinal categorical featuressalary
is an ordinal categorical featureavg_years_of_exp
looks like a categorical feature as well, but when considering the bigger picture, where we would have thousands of records, and maybe in floating point data types, will not be treated as a categorical feature. We can create a year_experice_range column containing different range of experience (For E.g., 0-3, 4-6, etc.) and treat that as a categorical feature. But we will ignore that for now.
Creating Ordinal Feature and OrdinalEncoder
Ordinal related to a column which can be thought of as a categorical one, but with a maintained sequencing or hierarchy. For instance, (1) Rank
1,2, or 3 ; (2) Salary
as high, low, or medium; (3) height
as tall, taller, tallest and so on.
= ['salary']
ordinal_feature = OrdinalEncoder() ordinal_transformer
Creating Non Ordinal Feature and OneHotEncoder
= ['field', 'gender_category']
non_ordinal_categorical_features = OneHotEncoder(handle_unknown="ignore") non_ordinal_categorical_transformer
Creating Column Transformer
We provide data for ordinal_transformer & non_ordinal_categorical_transformer
= ColumnTransformer(transformers=[
column_transformer 'ordinal', ordinal_transformer, ordinal_feature),
('non_ordinal_category', non_ordinal_categorical_transformer, non_ordinal_categorical_features)],
(='drop') remainder
remainder='drop'
will drop all the remaining columns which do not required to be transformed. If you want to keep the remaining columns as it is, you may provideremainder='passthrough
pd.DataFrame(column_transformer.fit_transform(employees_df))
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 |
1 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
2 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
3 | 2.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
4 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
5 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
As you can see, we are not really able to comprehend which column represents what value from the original dataframe. To compensate for it, we will just perform a couple of tweeks.
Creating the final Transformer with Columns intact and understandable
= OneHotEncoder(sparse_output=False, handle_unknown="ignore") # New code added
non_ordinal_categorical_transformer
# Note: sparse_output=False is required to preserve column orders and provide a prefix for the columns.
= ColumnTransformer(transformers=[
column_transformer 'ordinal', ordinal_transformer, ordinal_feature),
('non_ordinal_category', non_ordinal_categorical_transformer, non_ordinal_categorical_features)],
(='drop') # This remains same
remainder
='pandas') # New code added column_transformer.set_output(transform
ColumnTransformer(transformers=[('ordinal', OrdinalEncoder(), ['salary']), ('non_ordinal_category', OneHotEncoder(handle_unknown='ignore', sparse_output=False), ['field', 'gender_category'])])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ColumnTransformer(transformers=[('ordinal', OrdinalEncoder(), ['salary']), ('non_ordinal_category', OneHotEncoder(handle_unknown='ignore', sparse_output=False), ['field', 'gender_category'])])
['salary']
OrdinalEncoder()
['field', 'gender_category']
OneHotEncoder(handle_unknown='ignore', sparse_output=False)
= column_transformer.fit_transform(employees_df)
df_pandas df_pandas
ordinal__salary | non_ordinal_category__field_BioTech | non_ordinal_category__field_Finance | non_ordinal_category__field_HR | non_ordinal_category__field_Marketing | non_ordinal_category__field_Sales | non_ordinal_category__field_Tech | non_ordinal_category__gender_category_Female | non_ordinal_category__gender_category_Male | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 |
1 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
2 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
3 | 2.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
4 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
5 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |