import matplotlib.pyplot as plt # We import matplotlib to generate bar plots
import numpy as np
from scipy.stats import skew
from scipy.stats import spearmanr
A Report by …
@Author: Kashish Mukheja
Introduction
In this report, we explore the NHANES adult male and female dataset to analyze the relationship between various anthropometric measurements and body mass index (BMI). The analysis includes handling missing values, comparing distributions against the two genders, and calculating Pearson’s and Spearman’s correlation coefficients, and visualizing correlation heatmaps.
Data Preparation
To To begin, let’s import the necessary libraries, numpy in this case, and then use numpy.genfromtxt
to read the CSV files into numpy matrices named male
and female
.
nhanes_adult_male_bmx_2020.csv and nhanes_adult_female_bmx_2020.csv
# Read the CSV files into numpy matrices
= np.genfromtxt('nhanes_adult_male_bmx_2020.csv', delimiter=',')
male = np.genfromtxt('nhanes_adult_female_bmx_2020.csv', delimiter=',') female
male.shape
(4082, 7)
Each matrix represents the data for adult males and females, respectively, with seven columns as described: 1. weight(kg), 2. standing height(cm), 3. upper arm length(cm), 4. upper leg length(cm), 4. arm circumference(cm), 6. hip circumference(cm), and 7. waist circumference(cm).
These matrices will serve as the foundation for our further analysis
Calculating and Adding Body Mass Index (BMI)
In this section, we compute the BMI for each participant based on their weight and standing height, and then augment the matrices with an eighth column to store these BMI values.
To calculate the body mass index (BMI) for each participant, we can use the formula:
\[ BMI = \frac{{\text{{weight}}}}{{(\text{{standing height}} / 100)^2}} \]
We’ll calculate the BMI for both male and female participants and add an eighth column to each matrix to store these values.
# Calculate BMI for males and add as the eighth column
= male[:, 0] / (male[:, 1] / 100)**2
male_bmi = np.column_stack((male, male_bmi))
male
# Calculate BMI for females and add as the eighth column
= female[:, 0] / (female[:, 1] / 100)**2
female_bmi = np.column_stack((female, female_bmi)) female
1], female[1], male.shape, female.shape male[
(array([ 98.8 , 182.3 , 42. , 40.1 ,
38.2 , 108.2 , 120.4 , 29.72922633]),
array([ 97.1 , 160.2 , 34.7 , 40.8 ,
35.8 , 126.1 , 117.9 , 37.83504078]),
(4082, 8),
(4222, 8))
Now, the male
and female
matrices have been updated to include an eighth column storing the BMI values for each participant. We can proceed with further analysis or visualization using these augmented data.
Visualizing BMI Distribution by Gender
This section illustrates the distribution of BMI values for adult males and females through histograms. The code utilizes matplotlib to create two subplots within a single figure, each displaying the BMI distribution for males and females, respectively.
# Create a figure and two subplots
= plt.subplots(2, 1, figsize=(8, 6))
fig, axs
# Plot histogram for male BMIs
0].hist(male[:, 7], bins=20, color='blue', alpha=0.7)
axs[0].set_title('Male BMI Distribution')
axs[0].set_xlabel('BMI')
axs[0].set_ylabel('Frequency')
axs[
# Plot histogram for female BMIs
1].hist(female[:, 7], bins=20, color='magenta', alpha=0.7)
axs[1].set_title('Female BMI Distribution')
axs[1].set_xlabel('BMI')
axs[1].set_ylabel('Frequency')
axs[
# Calculate appropriate x-axis limits
= min(np.nanmin(male[:, 7]), np.nanmin(female[:, 7]))
min_bmi = max(np.nanmax(male[:, 7]), np.nanmax(female[:, 7]))
max_bmi
# Set x-axis limits to be the same for both subplots
if not np.isnan(min_bmi) and not np.isnan(max_bmi):
plt.xlim(min_bmi, max_bmi)
# Adjust layout to prevent overlap
plt.tight_layout()
# Show the plot
plt.show()
To create a single plot with two histograms, one for male BMIs and the other for female BMIs, we used matplotlib.pyplot.subplot
to create two subplots in one figure. - We also set the number of histogram bins to 20 for each subplot and ensure that the x-axis limits are identical for both subfigures. - We use np.nanmin()
and np.nanmax()
functions to calculate the minimum and maximum BMI values, respectively, while ignoring NaN value. - Finally, we use plt.tight_layout()
to adjust the layout to prevent overlap of labels and titles, and then display the plot using plt.show()
.
Comparing BMI Distributions between Genders
This section visualizes the distribution of BMI values for adult males and females through a box-and-whisker plot. The code utilizes matplotlib’s boxplot function to generate a comparative visualization, allowing for a straightforward comparison between the two genders.
# Remove NaN values from male and female BMI data
= male[~np.isnan(male[:, 7]), 7]
male_bmi_clean = female[~np.isnan(female[:, 7]), 7]
female_bmi_clean
# Combine cleaned male and female BMI data into a list
= [male_bmi_clean, female_bmi_clean]
bmi_data
# Create a horizontal box-and-whisker plot
=['Male', 'Female'], vert=False)
plt.boxplot(bmi_data, labels
# Add title and labels
'BMI Distribution by Gender')
plt.title('BMI')
plt.xlabel('Gender')
plt.ylabel(
# Show the plot
plt.show()
We removed NaN values from the BMI data before creating the box-and-whisker plot. If we would have provided the raw data without cleaning (i.e., without removing NaN values), it would have resulted in an incorrect or empty plot.
We observed that mean for male and female are almost identical, somewhere between 25-30.
We also observe that max values for both are identical too somewhere near 65-70 range.
All the statistical summary points are closeby for male and female. We can also get the exact data points
Summary of BMI Distribution Aggregates
This section presents a comparative analysis of the basic numerical aggregates for male and female BMI distributions. Measures of location, dispersion, and shape are calculated and reported in a clear and readable format, providing insights into the characteristics of the BMI distributions for both genders.
# Compute aggregates for male BMI
= np.mean(male_bmi_clean)
male_mean = np.median(male_bmi_clean)
male_median = np.min(male_bmi_clean)
male_min = np.max(male_bmi_clean)
male_max = np.std(male_bmi_clean)
male_std = np.percentile(male_bmi_clean, 75) - np.percentile(male_bmi_clean, 25)
male_iqr = skew(male_bmi_clean)
male_skew
# Compute aggregates for female BMI
= np.mean(female_bmi_clean)
female_mean = np.median(female_bmi_clean)
female_median = np.min(female_bmi_clean)
female_min = np.max(female_bmi_clean)
female_max = np.std(female_bmi_clean)
female_std = np.percentile(female_bmi_clean, 75) - np.percentile(female_bmi_clean, 25)
female_iqr = skew(female_bmi_clean)
female_skew
# Report the aggregates
print("BMI Statistics:\n")
print("Female Male")
print(f"{female_mean:.2f} {male_mean:.2f} Mean")
print(f"{female_median:.2f} {male_median:.2f} Median")
print(f"{female_min:.2f} {male_min:.2f} Min")
print(f"{female_max:.2f} {male_max:.2f} Max")
print(f"{female_std:.2f} {male_std:.2f} Std")
print(f"{female_iqr:.2f} {male_iqr:.2f} IQR")
print(f"{female_skew:.2f} {male_skew:.2f} Skewness")
BMI Statistics:
Female Male
30.10 29.14 Mean
28.89 28.27 Median
14.20 14.91 Min
67.04 66.50 Max
7.76 6.31 Std
10.01 7.73 IQR
0.92 0.97 Skewness
Based on the results obtained from the above we can describe the distributions of BMI for adult males and females as follows:
- Location (Measures of Central Tendency):
- The mean BMI for males is slightly lower than that for females (29.14 vs. 30.10), indicating that, on average, females have a slightly higher BMI compared to males.
- Similarly, the median BMI for females is slightly higher than that for males (28.89 vs. 28.27), suggesting that the distribution of BMI values for females is shifted slightly towards higher values compared to males.
- Dispersion (Measures of Spread):
- The standard deviation of BMI values for females is higher than that for males (7.76 vs. 6.31), indicating greater variability in BMI among females compared to males.
- The interquartile range (IQR) for females is also higher than that for males (10.01 vs. 7.73), further supporting the notion of greater dispersion in BMI values among females.
- Shape (Skewness):
- The skewness of the BMI distribution for females is positive (0.92), indicating a right-skewed distribution where the tail of the distribution extends towards higher BMI values.
- In contrast, the skewness of the BMI distribution for males is slightly negative (-0.97), suggesting a left-skewed distribution where the tail of the distribution extends towards lower BMI values.
Overall, these results suggest that the distribution of BMI values for adult females tends to have slightly higher central tendency, greater variability, and a right-skewed shape compared to adult males. The higher mean, median, standard deviation, and interquartile range for females indicate a wider spread of BMI values and a tendency towards higher values, while the positive skewness suggests that there are more individuals with higher BMI values among females compared to males.
Scatterplot Matrix for Male Biometric Measurements
This section showcases the relationships between male heights, weights, waist circumferences, hip circumferences, and BMIs through a scatterplot matrix. The code utilizes NumPy matrices and Seaborn’s pairplot function, demonstrating proficiency in handling numerical data without relying on pandas DataFrames.
def pairplot(X, labels, bins=21, alpha=0.1):
"""
Draws a scatter plot matrix, given:
* X - data matrix,
* labels - list of column names
"""
assert X.shape[1] == len(labels)
= X.shape[1]
k = plt.subplots(nrows=k, ncols=k, sharex="col", sharey="row",
fig, axes =(plt.rcParams["figure.figsize"][0], )*2)
figsizefor i in range(k):
for j in range(k):
= axes[i, j]
ax if i == j: # diagonal
0.5, 0.5, labels[i], transform=ax.transAxes,
ax.text(="center", va="center", size="x-small")
haelse:
".", color="black", alpha=alpha)
ax.plot(X[:, j], X[:, i],
# Define labels for the variables
= ['Height', 'Weight', 'Waist Circumference', 'Hip Circumference', 'BMI']
labels
# Select columns from male data
= male[:, 1]
male_heights = male[:, 0]
male_weights = male[:, 6]
male_waist_circumferences = male[:, 5]
male_hip_circumferences = male[:, 7]
male_bmis
# Create a NumPy array with selected columns
= np.column_stack((male_heights, male_weights, male_waist_circumferences, male_hip_circumferences, male_bmis))
male_data
# Call the pairplot function with male_data and labels as arguments
pairplot(male_data, labels)
Generating Correlation Heatmaps
In this section, we calculate both Pearson’s and Spearman’s correlation coefficients for the male dataset and visualize the correlation heatmaps. However, male data contains NaN values, and hence we address the presence of NaN values in the male dataset by replacing them with the mean of each column. After preprocessing the data, we perform the correlation analysis.
def corrheatmap(R, labels):
"""
Draws a correlation heat map, given:
* R - matrix of correlation coefficients for all variable pairs,
* labels - list of column names
"""
assert R.shape[0] == R.shape[1] and R.shape[0] == len(labels)
= R.shape[0]
k
# plot the heat map using a custom colour palette
# (correlations are in [-1, 1])
=plt.colormaps.get_cmap("RdBu"), vmin=-1, vmax=1)
plt.imshow(R, cmap
# add text labels
for i in range(k):
for j in range(k):
f"{R[i, j]:.2f}", ha="center", va="center",
plt.text(i, j, ="black" if np.abs(R[i, j])<0.5 else "white")
color
=labels, rotation=30)
plt.xticks(np.arange(k), labels="x", which="both",
plt.tick_params(axis=True, labeltop=False, bottom=False, top=False)
labelbottom
=labels)
plt.yticks(np.arange(k), labels="y", which="both",
plt.tick_params(axis=True, labelright=False, left=False, right=False)
labelleft
False) plt.grid(
def corrheatmapr(R, labels):
"""
Draws a correlation heat map, given:
* R - matrix of correlation coefficients for all variable pairs,
* labels - list of column names
"""
assert R.shape[0] == R.shape[1] and R.shape[0] == len(labels)
= R.shape[0]
k
# plot the heat map using a custom colour palette
# (correlations are in [-1, 1])
=plt.colormaps.get_cmap("RdBu"), vmin=-1, vmax=1)
plt.imshow(R, cmap
# add text labels
for i in range(k):
for j in range(k):
f"{R[i, j]:.2f}", ha="center", va="center",
plt.text(i, j, ="black" if np.abs(R[i, j])<0.5 else "white")
color
=labels, rotation=30)
plt.xticks(np.arange(k), labels="x", which="both",
plt.tick_params(axis=True, labeltop=False, bottom=False, top=False)
labelbottom
=labels)
plt.yticks(np.arange(k), labels="y", which="both",
plt.tick_params(axis=False, labelright=True, left=False, right=False)
labelleft
False) plt.grid(
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import spearmanr
# Define the labels variable
= ['Weight', 'Standing Height', 'Hip Circumference', 'Waist Circumference', 'BMI']
labels
# Select the columns from male data based on labels
= [0, 1, 5, 6, 7] # Adjust column indices if needed
cols
# Replace NaN values with the mean of each column
= np.nan_to_num(male, nan=np.nanmean(male, axis=0))
male_cleaned
# Ensure male_cleaned is a 2D array
= np.atleast_2d(male_cleaned)
male_cleaned
# Calculate Pearson's correlation coefficients for selected columns of male_cleaned data
1,2,1)
plt.subplot(= np.corrcoef(male_cleaned[:, cols], rowvar=False)
R
corrheatmap(R, labels)"Pearson's r")
plt.title(
# Calculate Spearman's correlation coefficients for selected columns of male_cleaned data
1,2,2)
plt.subplot(= spearmanr(male_cleaned[:, cols], axis=0)
rho, _
corrheatmapr(rho, labels)"Spearman's ρ")
plt.title(
plt.tight_layout() plt.show()
Based on the correlation analysis conducted, it can be inferred that weight exhibits a high correlation with hip circumference, waist circumference, and BMI. This suggests that individuals with higher weights tend to have larger hip and waist circumferences, as well as higher BMI values.
On the other hand, standing height demonstrates a low correlation with hip circumference and waist circumference. This implies that there is less of a linear relationship between standing height and hip/waist circumferences compared to weight. In other words, while there may be some association between standing height and hip/waist circumferences, it is not as strong or consistent as the relationship observed with weight.
hip circumference and waist circumference are closely related with each other and also do BMI.
Conclusion
Through this analysis, we gained valuable insights into the relationship between anthropometric measurements and BMI in the NHANES adult male population. The correlation heatmaps provided a visual representation of the correlations, aiding in understanding the complex interplay between different variables. By handling missing values and conducting comprehensive correlation analysis, we have laid the foundation for further exploratory data analysis and hypothesis testing in future research studies.