import pandas as pd
import numpy as np
from bokeh.resources import INLINE
from bokeh.io import output_notebook, show
from bokeh.layouts import gridplot, column
from bokeh.models import ColumnDataSource, HoverTool, CustomJS
from bokeh.plotting import figure
from bokeh.models.widgets import Slider, TextInput
from bokeh.models import DataTable, TableColumn
import bokeh.io
A Report by …
@Author: Kashish Mukheja
Introduction:
In this data analysis report, we delve into the exploration of dataseets downloaded from 2017-March 2020 Pre-Pandemic Examination Data - NHANES dataset. We downloaded the following 5 datasets:
- P_ACQ.XPT - Pre-Pandemic Questionnaire Acculturation Data
- P_AUX.XPT - Pre-Pandemic Examination Audiometry Data
- P_DEMO.XPT - Pre-Pandemic Demographics Variables and Sample Weights Data
- P_BPXO.XPT - Pre-Pandemic Examination Blood Pressure - Oscillometric Measurement Data
- P_BMX.XPT - Pre-Pandemic Examination Body Measures Data
On data cleaning, we merged the dataset comprising multiple variables from all the above sources. Our goal is to gain insights into the relationships and patterns present within the data and to create interactive visualizations that facilitate a deeper understanding of the dataset. Throughout the analysis, we pay careful attention to data privacy and ethical considerations, ensuring that our conclusions are drawn responsibly and transparently.
Import Libraries
We’ll start by importing the necessary libraries
Data Preparation and Merging
In this section, we detail the process of preparing and merging data from multiple XPT files into a single dataframe.
Summary:
We began by loading the data from five different XPT files, each representing distinct datasets: P_ACQ, P_AUX, P_DEMO, P_BPXO, and P_BMX. These datasets contained various information related to Acculturation, Audiometry data, demographic details, blood pressure examination, and body measures, respectively.
To ensure data quality, we performed two key data preparation steps.
Firstly, we removed columns with more than 25% null values from each dataframe, ensuring that we retained only relevant and informative features.
Secondly, we eliminated any duplicate rows within each dataframe to prevent redundancy and maintain data integrity.
After completing the data preparation steps for each individual dataframe, we merged them based on a common identifier, the “SEQN” column. This facilitated the creation of a comprehensive dataset containing consolidated information from all five datasets.
The resulting merged dataframe, named merged_df
, provides a unified view of the data, enabling further analysis and exploration of relationships between different variables across the datasets. This consolidated dataset serves as a valuable resource for gaining insights and making informed decisions in subsequent data analysis tasks.
# Read P_ACQ.XPT into a dataframe
= pd.read_sas('P_ACQ.XPT', format='xport')
df_acq
# Read P_AUX.XPT into a dataframe
= pd.read_sas('P_AUX.XPT', format='xport')
df_aux
# Read P_DEMO.XPT into a dataframe
= pd.read_sas('P_DEMO.XPT', format='xport')
df_demo
# Read P_BPXO.XPT into a dataframe
= pd.read_sas('P_BPXO.XPT', format='xport')
df_bpxo
# Read P_BMX.XPT into a dataframe
= pd.read_sas('P_BMX.XPT', format='xport') df_bmx
# List of dataframes
= [df_acq, df_aux, df_demo, df_bpxo, df_bmx]
dfs
# Remove columns with > 25% null values and remove duplicated rows
for df in dfs:
# Remove columns with > 25% null values
=len(df) * 0.75, axis=1, inplace=True)
df.dropna(thresh# Remove duplicated rows
=True)
df.drop_duplicates(inplace
# Merge dataframes on "SEQN" column
= pd.merge(df_acq, df_aux, on='SEQN', how='inner')
merged_df = pd.merge(merged_df, df_demo, on='SEQN', how='inner')
merged_df = pd.merge(merged_df, df_bpxo, on='SEQN', how='inner')
merged_df = pd.merge(merged_df, df_bmx, on='SEQN', how='inner') merged_df
Understanding the Data
Before performing any data visualisation, it is important to understand the dataframe, specifically what type of data is present. This will empower us to understand if there is any further data cleaning required.
merged_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4605 entries, 0 to 4604
Data columns (total 82 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SEQN 4605 non-null float64
1 AUAEXSTS 4605 non-null float64
2 AUXOTSPL 4276 non-null float64
3 AUXLOEXC 4276 non-null float64
4 AUXLOIMC 4276 non-null float64
5 AUXLOCOL 4276 non-null float64
6 AUXLOABN 4276 non-null float64
7 AUXROTSP 4276 non-null float64
8 AUXROEXC 4276 non-null float64
9 AUXROIMC 4276 non-null float64
10 AUXROCOL 4276 non-null float64
11 AUXROABN 4276 non-null float64
12 AUXTMEPR 3813 non-null float64
13 AUXTPVR 3985 non-null float64
14 AUXTWIDR 3813 non-null float64
15 AUXTCOMR 3815 non-null float64
16 AUXTMEPL 3813 non-null float64
17 AUXTPVL 3983 non-null float64
18 AUXTWIDL 3815 non-null float64
19 AUXTCOML 3818 non-null float64
20 AUAEAR 4251 non-null float64
21 AUAMODE 4246 non-null float64
22 AUAFMANL 4246 non-null float64
23 AUAFMANR 4246 non-null float64
24 AUXU1K1R 4209 non-null float64
25 AUXU500R 4206 non-null float64
26 AUXU1K2R 4209 non-null float64
27 AUXU2KR 4208 non-null float64
28 AUXU3KR 4203 non-null float64
29 AUXU4KR 4204 non-null float64
30 AUXU6KR 4201 non-null float64
31 AUXU8KR 4201 non-null float64
32 AUXU1K1L 4201 non-null float64
33 AUXU500L 4200 non-null float64
34 AUXU1K2L 4201 non-null float64
35 AUXU2KL 4200 non-null float64
36 AUXU3KL 4200 non-null float64
37 AUXU4KL 4200 non-null float64
38 AUXU6KL 4200 non-null float64
39 AUXU8KL 4199 non-null float64
40 AUAREQC 4245 non-null float64
41 AUATYMTR 4605 non-null object
42 AUALEQC 4245 non-null float64
43 AUATYMTL 4605 non-null object
44 SDDSRVYR 4605 non-null float64
45 RIDSTATR 4605 non-null float64
46 RIAGENDR 4605 non-null float64
47 RIDAGEYR 4605 non-null float64
48 RIDRETH1 4605 non-null float64
49 RIDRETH3 4605 non-null float64
50 RIDEXMON 4605 non-null float64
51 DMDBORN4 4605 non-null float64
52 SIALANG 4605 non-null float64
53 SIAPROXY 4605 non-null float64
54 SIAINTRP 4605 non-null float64
55 FIALANG 4352 non-null float64
56 FIAPROXY 4352 non-null float64
57 FIAINTRP 4352 non-null float64
58 WTINTPRP 4605 non-null float64
59 WTMECPRP 4605 non-null float64
60 SDMVPSU 4605 non-null float64
61 SDMVSTRA 4605 non-null float64
62 INDFMPIR 4037 non-null float64
63 BPAOARM 4605 non-null object
64 BPAOCSZ 3996 non-null float64
65 BPXOSY1 3985 non-null float64
66 BPXODI1 3985 non-null float64
67 BPXOSY2 3965 non-null float64
68 BPXODI2 3965 non-null float64
69 BPXOSY3 3934 non-null float64
70 BPXODI3 3934 non-null float64
71 BPXOPLS1 3624 non-null float64
72 BPXOPLS2 3606 non-null float64
73 BPXOPLS3 3578 non-null float64
74 BMDSTATS 4605 non-null float64
75 BMXWT 4503 non-null float64
76 BMXHT 4498 non-null float64
77 BMXBMI 4489 non-null float64
78 BMXLEG 4298 non-null float64
79 BMXARML 4347 non-null float64
80 BMXARMC 4345 non-null float64
81 BMXWAIST 4295 non-null float64
dtypes: float64(79), object(3)
memory usage: 2.9+ MB
- Column
AUATYMTL
andBPAOARM
have object dtype.
'AUATYMTL'].value_counts() merged_df[
AUATYMTL
b'Type A' 3583
b'' 360
b'88888' 252
b'Type AS' 154
b'Type C' 99
b'99999' 85
b'Type B' 36
b'Type AD' 36
Name: count, dtype: int64
'BPAOARM'].value_counts() merged_df[
BPAOARM
b'R' 3626
b'' 962
b'L' 17
Name: count, dtype: int64
Interactive Visualizations from Merged DataFrame
This subheading introduces a section of code that creates interactive visualizations using the Bokeh package. The code is designed to work with a DataFrame called merged_df
containing data, and it aims to provide a visual exploration of selected columns.
# Output to notebook
output_notebook()
# Select only the columns from the top correlated pairs
= ['AUXU1K2R', 'AUXU1K1R', 'AUXU1K2L', 'AUXU1K1L', 'WTMECPRP', 'WTINTPRP',
selected_columns 'AUXROEXC', 'AUXROTSP', 'RIDRETH3', 'RIDRETH1', 'AUXLOEXC', 'AUXOTSPL']
# Create a copy of the DataFrame to avoid the SettingWithCopyWarning
= merged_df[selected_columns].copy()
df_subset
# Drop rows with NaN values from the copied DataFrame
=True)
df_subset.dropna(inplace
# Convert DataFrame to ColumnDataSource
= ColumnDataSource(df_subset)
source
# Scatter plot
= figure(title="Scatter Plot", x_axis_label='AUXU1K2R', y_axis_label='AUXU1K1R')
scatter 'AUXU1K2R', 'AUXU1K1R', size=10, source=source, color='navy', legend_label='Correlation Value')
scatter.circle(= HoverTool(tooltips=[("AUXU1K2R", "@AUXU1K2R"), ("AUXU1K1R", "@AUXU1K1R")])
scatter_hover
scatter.add_tools(scatter_hover)
# Line plot
= figure(title="Line Plot", x_axis_label='WTMECPRP', y_axis_label='WTINTPRP')
line 'WTMECPRP', 'WTINTPRP', source=source, line_width=2, line_color="orange")
line.line(
# Convert numerical values to strings
= [str(value) for value in df_subset['AUXROEXC'].unique()]
x_values
# Bar plot
= figure(title="Bar Plot", x_range=x_values, x_axis_label='AUXROEXC', y_axis_label='AUXROTSP')
bar ='AUXROEXC', top='AUXROTSP', width=0.5, source=source, color='navy', legend_label='AUXROTSP')
bar.vbar(x= 1.2
bar.xaxis.major_label_orientation
# Convert numerical values to strings
= [str(value) for value in df_subset['RIDRETH3'].unique()]
x_values = [str(value) for value in df_subset['RIDRETH1'].unique()]
y_values
# Heatmap
= figure(title="Heatmap", x_range=x_values, y_range=y_values,
heatmap ='RIDRETH3', y_axis_label='RIDRETH1')
x_axis_label='RIDRETH3', y='RIDRETH1', width=1, height=1, source=source,
heatmap.rect(x='blue', line_color=None)
fill_color= 1.2
heatmap.xaxis.major_label_orientation
# Stacked bar chart
= figure(title="Stacked Bar Chart", x_range=x_values, x_axis_label='RIDRETH3', y_axis_label='AUXU1K1R')
stacked_bar =selected_columns[:2], x='RIDRETH3', width=0.5, color=['blue', 'orange'], source=source,
stacked_bar.vbar_stack(stackers=['AUXU1K2R', 'AUXU1K1R'])
legend_label
# Slider for controlling circle size in scatter plot
= Slider(start=1, end=20, value=10, step=1, title="Circle Size")
slider_circle_size
# JavaScript callback to update circle size
= CustomJS(args=dict(circle=scatter), code="""
callback_update_circle_size circle.glyph.size = cb_obj.value;
""")
'value', callback_update_circle_size)
slider_circle_size.js_on_change(
# Slider for controlling line width in line plot
= Slider(start=1, end=5, value=2, step=0.5, title="Line Width")
slider_line_width
# JavaScript callback to update line width
= CustomJS(args=dict(line=line), code="""
callback_update_line_width line.glyph.line_width = cb_obj.value;
""")
'value', callback_update_line_width)
slider_line_width.js_on_change(
# Text input for filtering bar plot
= TextInput(value='', title='Filter Bar Plot')
text_input_filter
# JavaScript callback to filter bar plot
= CustomJS(args=dict(source=source, bar=bar), code="""
callback_filter_bar var data = source.data;
var filterValue = cb_obj.value.toLowerCase().trim(); // Convert input value to lowercase and remove leading/trailing whitespace
var new_data = {'AUXU1K2R': [], 'AUXU1K1R': [], 'AUXU1K2L': [], 'AUXU1K1L': [], 'WTMECPRP': [],
'WTINTPRP': [], 'AUXROEXC': [], 'AUXROTSP': [], 'RIDRETH3': [], 'RIDRETH1': [],
'AUXLOEXC': [], 'AUXOTSPL': []}; // Initialize new data object
console.log("Entered value:", cb_obj);
// Choose a column to iterate over (assuming it has the same length as others)
var columnLength = data['AUXU1K2R'].length;
// Check if all columns have the same length
for (var key in data) {
if (data[key].length !== columnLength) {
console.error("Columns have different lengths.");
return;
}
}
// Iterate over the chosen column length
for (var i = 0; i < columnLength; i++) {
// Check if any column contains the filter value (case-insensitive)
var matchFound = false;
for (var key in data) {
if (data[key][i].toString().toLowerCase().includes(filterValue)) {
matchFound = true;
break;
}
}
// If a match is found, include the corresponding row in the new data
if (matchFound) {
for (var key in new_data) {
new_data[key].push(data[key][i]);
}
}
}
// Update the source data with the filtered data
source.data = new_data;
source.change.emit();
""")
'value', callback_filter_bar)
text_input_filter.js_on_change(
# Data table
= [
columns =column, title=column) for column in selected_columns
TableColumn(field
]= DataTable(source=source, columns=columns, width=800, height=280)
data_table
# Combine plots into a grid layout
= gridplot([[scatter, line], [bar, heatmap], [stacked_bar]])
plots_grid
# Combine widgets into a column layout
= column(slider_circle_size, slider_line_width, text_input_filter)
widgets_column
# Combine plots grid and widgets column into a layout
= column(plots_grid, widgets_column)
layout
# Show the result
show(layout)
Summary:
The provided code performs the following tasks:
Output File: Specifies the output file name as “interactive_visualizations.html” where the visualizations will be rendered.
Select Relevant Columns: Defines a list of columns (
selected_columns
) to be used for creating visualizations. These columns are selected based on their high correlation values.Subset Data: Creates a copy of
merged_df
containing only the selected columns (df_subset
). It drops any rows with missing values to ensure data integrity.Convert to ColumnDataSource: Converts the subset DataFrame (
df_subset
) into aColumnDataSource
object (source
), which is a requirement for creating Bokeh plots.Create Visualizations:
- Scatter Plot: Plots the relationship between ‘AUXU1K2R’ and ‘AUXU1K1R’ columns.
- Line Plot: Shows the trend between ‘WTMECPRP’ and ‘WTINTPRP’ columns.
- Bar Plot: Displays the distribution of ‘AUXROTSP’ values across different levels of ‘AUXROEXC’.
- Heatmap: Represents the relationship between ‘RIDRETH3’ and ‘RIDRETH1’ columns using color intensity.
- Stacked Bar Chart: Illustrates the comparison between ‘AUXU1K2R’ and ‘AUXU1K1R’ across various levels of ‘RIDRETH3’.
- Data Table: Provides a tabular view of the selected columns for reference.
Combine Plots: Organizes the visualizations and the data table into a grid layout using
gridplot
.Show the Result: Renders the layout to the specified output file, enabling interaction with the visualizations in a web browser.
This section of code facilitates the exploration and analysis of the data by providing interactive visualizations that allow for deeper insights into the relationships and distributions within the dataset.
Insights and Reflections
Insights:
Scatter Plot Analysis: The scatter plot reveals a strong positive correlation between ‘AUXU1K2R’ and ‘AUXU1K1R’. This suggests that there is a linear relationship between these two variables, indicating that as one variable increases, the other tends to increase as well.
Line Plot Observation: The line plot shows a consistent trend between ‘WTMECPRP’ and ‘WTINTPRP’, indicating a stable relationship between the two variables over time.
Bar Plot Interpretation: The bar plot demonstrates the distribution of ‘AUXROTSP’ values across different categories of ‘AUXROEXC’. It highlights potential variations in ‘AUXROTSP’ based on the levels of ‘AUXROEXC’, providing insights into potential factors influencing ‘AUXROTSP’.
Heatmap Exploration: The heatmap visualizes the relationship between ‘RIDRETH3’ and ‘RIDRETH1’, showcasing potential patterns or clusters within the data. It allows for the identification of any trends or disparities across different demographic groups.
Stacked Bar Chart Insight: The stacked bar chart provides a comparison of ‘AUXU1K2R’ and ‘AUXU1K1R’ across various levels of ‘RIDRETH3’. It enables the exploration of differences or similarities in the distribution of these variables within different demographic groups.
Reflections on Privacy and Ethics:
Data Privacy Concerns: While analyzing and visualizing data, it’s crucial to ensure the protection of individuals’ privacy. Even though the dataset may be anonymized, the insights gained from the analysis could potentially lead to re-identification of individuals, especially in sensitive demographic variables like race or ethnicity (‘RIDRETH3’ and ‘RIDRETH1’). Care must be taken to avoid unintentional disclosure of sensitive information.
Ethical Considerations: Ethical issues may arise when interpreting and communicating the results of data analysis. It’s essential to present findings accurately and responsibly, avoiding any biased interpretations or misrepresentations. Additionally, decisions based on data analysis should prioritize fairness and equity to avoid perpetuating existing inequalities or biases present in the data.
Informed Consent: If the data used for analysis contains personally identifiable information, obtaining informed consent from the individuals involved is crucial. Individuals should be informed about how their data will be used and have the option to opt-out if they wish to protect their privacy.
Data Security: Ensuring the security of the data throughout the analysis process is paramount. Measures such as encryption, access controls, and secure data storage should be implemented to prevent unauthorized access or data breaches.
Transparency and Accountability: Transparency in data analysis practices, including documentation of methodologies and assumptions, fosters accountability and trust. Researchers should be transparent about the sources of data, analytical techniques employed, and any limitations or biases present in the data.
In conclusion, while data analysis provides valuable insights, it’s essential to approach it ethically and responsibly, considering the potential privacy implications and ensuring that the rights and dignity of individuals are respected throughout the process.
Conclusion:
In conclusion, this data analysis project has provided valuable insights into the relationships and distributions within the dataset. Through the creation of interactive visualizations, we have uncovered meaningful patterns and trends that contribute to a better understanding of the data. However, it’s important to recognize the ethical implications of data analysis and to prioritize privacy and fairness in our approach. By adhering to ethical guidelines and promoting transparency, we can ensure that our data analysis efforts contribute positively to the field while upholding the rights and dignity of individuals.