Nassika Dabel | June 5, 2018
The topic for my project is natality. I am interested in learning if there is a correlation between the age and the education level of new mothers in the United States. Do women who pursue higher education wait longer to have children? The dataset is sourced from "Natality by State, 2007-2016". It includes the education, age, and race of mothers who gave birth as well as the birth weights of their newborns seperated by each state.
I am interested in seeing if there are correlations between the following variables a) the education level and the age of the mothers and b) the age of the mothers and their newborns' birth weights.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.axes as ax
import statsmodels.formula.api as smf
%matplotlib inline
df = pd.read_csv("natality_2007_2016.csv")
print(df.describe())
print(pd.DataFrame.describe(df))
births = df["Births"]
birth_weight= df["Average_Birth_Weight"]
age_mom = df["Average_Age_of_Mother"]
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use("classic")
x = np.linspace(0,10,100)
fig1 = plt.figure()
plt.scatter(age_mom, birth_weight)
plt.title("Avg Age of Mom vs. Avg Birth Weight (US)")
plt.xlabel("Average Age of Mother")
plt.ylabel("Average Birth Weight")
plt.show()
fig1.savefig("fig1.png")
df = df [["Average_Age_of_Mother", "Average_Birth_Weight"]].dropna()
df.head()
model = smf.ols(formula="np.log(Average_Birth_Weight) ~ Average_Age_of_Mother", data=df)
est = model.fit()
est.summary()
df = pd.read_csv("natality_2007_2016.csv")
education_lev = df["Education"]
age_mom = df["Average_Age_of_Mother"]
#print(type(education_lev))
edu_age = df.loc[:,["Education", "Average_Age_of_Mother"]]
#print(edu_age.head())
#print(type(edu_age))
fig3 = plt.figure()
#plt.scatter(age_mom, education_lev)
edu_age.boxplot(vert=False, by='Education')
plt.title("Average Age of Mom vs. Education Level (US)")
plt.xlabel("Average Age of Mother")
plt.show()
fig3.savefig("fig3.png")
The first hypothesis I would like to test is if mothers who have higher education are on average older than mothers who have less education in the United States. The null hypothesis would be that there is little to no correlation between the age and education level of the mothers. The alternative hypothesis is that there is a trend showing that mothers of different education levels give birth at similar ages.
The second hypothesis I would like to test is if the age of a mother has any affect on their babies' birth weight? The null hypothesis would be that there is no correlation between the mother's age and baby's birth weights. The alternative hypothesis would be that older mother's newborn birth weights were lower than that of younger mothers and vice versa.
# Hypothesis testing - difference of means
# High school diploma completed
age_hs_mean = 26
age_hs_sd = np.sqrt(3.1)
age_hs_n = 284
# Master's degree completed
age_ms_mean = 33
age_ms_sd = np.sqrt(2.2)
age_ms_n = 222
sample_age_hs = np.random.normal(loc=age_hs_mean, scale=age_hs_sd, size=age_hs_n)
print("Age (High School Diploma)")
print("Observations:", len(sample_age_hs))
print("Mean:", np.mean(sample_age_hs))
print("Standard Deviation", np.std(sample_age_hs))
print("Variance", np.std(sample_age_hs)**2)
print("")
sample_age_ms = np.random.normal(loc=age_ms_mean, scale=age_ms_sd, size=age_ms_n)
print("Age (Master's Degree)")
print("Observations:", len(sample_age_ms))
print("Mean:", np.mean(sample_age_ms))
print("")
print("Standard Deviation", np.std(sample_age_ms))
print("Variance", np.std(sample_age_ms)**2)
obs_diff = age_hs_mean - age_ms_mean
exp_diff = 0
var_hs = (age_hs_sd**2) / age_hs_n
var_ms = (age_ms_sd**2) / age_ms_n
std_err = np.sqrt(var_hs + var_ms)
print("Sample difference:", obs_diff)
print("Expected population difference:", exp_diff)
print("Standard Error:", std_err)
Z = (obs_diff - exp_diff) / std_err
print("")
print("Z-score:", Z)
After analyzing the data generated by my first hypothesis, I found that for this set of data the education level and the age of the mothers did show a correlation. Mothers who pursued a higher education were older than mothers with less education. The "8th grade or less" education category deviated from my hypothesis in that mothers who pursued higher education were older, seeing as that category produced an older age range than the "9th through 12th grade with no diploma" category.
Through analyzing the data from my second hypothesis, I found that there is no correlation between the age of the mothers and the birth weights of their newborns for this set of data.
The most challenging aspect of working with this dataset was having a lot of non-numeric variables and coming up with the best potentially meaningful hypothesis questions to pose from them based on the numeric variables provided. Only one of my hypotheses produced a plot that I could discern any actual meaning from.