Natality by State, 2007-2016

Nassika Dabel | June 5, 2018

Background

The topic for my project is natality. I am interested in learning if there is a correlation between the age and the education level of new mothers in the United States. Do women who pursue higher education wait longer to have children? The dataset is sourced from "Natality by State, 2007-2016". It includes the education, age, and race of mothers who gave birth as well as the birth weights of their newborns seperated by each state.

Data

I am interested in seeing if there are correlations between the following variables a) the education level and the age of the mothers and b) the age of the mothers and their newborns' birth weights.

In [122]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.axes as ax
import statsmodels.formula.api as smf
%matplotlib inline
df = pd.read_csv("natality_2007_2016.csv")
print(df.describe())
print(pd.DataFrame.describe(df))
        State_Code  Education_Code        Births  Average_Birth_Weight  \
count  2115.000000     1966.000000   2115.000000           2115.000000   
mean     28.496454        4.362665   1864.778251           3233.222799   
std      15.637702        2.164297   5642.721592            133.124443   
min       1.000000        1.000000     10.000000           2410.200000   
25%      16.000000        3.000000     45.000000           3155.995000   
50%      29.000000        4.000000    193.000000           3229.810000   
75%      41.000000        6.000000   1087.000000           3322.285000   
max      56.000000        8.000000  95429.000000           3876.000000   

       Average_Age_of_Mother  
count            2115.000000  
mean               29.088435  
std                 3.059544  
min                20.660000  
25%                26.880000  
50%                29.330000  
75%                31.360000  
max                36.120000  
        State_Code  Education_Code        Births  Average_Birth_Weight  \
count  2115.000000     1966.000000   2115.000000           2115.000000   
mean     28.496454        4.362665   1864.778251           3233.222799   
std      15.637702        2.164297   5642.721592            133.124443   
min       1.000000        1.000000     10.000000           2410.200000   
25%      16.000000        3.000000     45.000000           3155.995000   
50%      29.000000        4.000000    193.000000           3229.810000   
75%      41.000000        6.000000   1087.000000           3322.285000   
max      56.000000        8.000000  95429.000000           3876.000000   

       Average_Age_of_Mother  
count            2115.000000  
mean               29.088435  
std                 3.059544  
min                20.660000  
25%                26.880000  
50%                29.330000  
75%                31.360000  
max                36.120000  
In [123]:
births = df["Births"]
birth_weight= df["Average_Birth_Weight"]
age_mom = df["Average_Age_of_Mother"]
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use("classic")
x = np.linspace(0,10,100)
fig1 = plt.figure()
plt.scatter(age_mom, birth_weight)
plt.title("Avg Age of Mom vs. Avg Birth Weight (US)")
plt.xlabel("Average Age of Mother")
plt.ylabel("Average Birth Weight")
plt.show()
fig1.savefig("fig1.png")
In [124]:
df = df [["Average_Age_of_Mother", "Average_Birth_Weight"]].dropna()
df.head()
Out[124]:
Average_Age_of_Mother Average_Birth_Weight
0 24.96 3049.36
1 24.69 3144.98
2 26.06 3209.97
3 28.33 3431.08
4 30.64 3366.45
In [125]:
model = smf.ols(formula="np.log(Average_Birth_Weight) ~ Average_Age_of_Mother", data=df)
est = model.fit()
est.summary()
Out[125]:
OLS Regression Results
Dep. Variable: np.log(Average_Birth_Weight) R-squared: 0.044
Model: OLS Adj. R-squared: 0.044
Method: Least Squares F-statistic: 98.38
Date: Sat, 16 Jun 2018 Prob (F-statistic): 1.07e-22
Time: 15:21:56 Log-Likelihood: 3777.6
No. Observations: 2115 AIC: -7551.
Df Residuals: 2113 BIC: -7540.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 7.9972 0.008 947.902 0.000 7.981 8.014
Average_Age_of_Mother 0.0029 0.000 9.919 0.000 0.002 0.003
Omnibus: 201.672 Durbin-Watson: 1.148
Prob(Omnibus): 0.000 Jarque-Bera (JB): 628.473
Skew: -0.479 Prob(JB): 3.38e-137
Kurtosis: 5.493 Cond. No. 280.
In [126]:
df = pd.read_csv("natality_2007_2016.csv")
education_lev = df["Education"]
age_mom = df["Average_Age_of_Mother"]
#print(type(education_lev))

edu_age = df.loc[:,["Education", "Average_Age_of_Mother"]]
#print(edu_age.head())
#print(type(edu_age))
fig3 = plt.figure()
#plt.scatter(age_mom, education_lev)

edu_age.boxplot(vert=False, by='Education')

plt.title("Average Age of Mom vs. Education Level (US)")
plt.xlabel("Average Age of Mother")

plt.show()
fig3.savefig("fig3.png")
C:\Users\Nassika\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py:57: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
  return getattr(obj, method)(*args, **kwds)
<matplotlib.figure.Figure at 0x1c0bb5b1ba8>

Analysis

The first hypothesis I would like to test is if mothers who have higher education are on average older than mothers who have less education in the United States. The null hypothesis would be that there is little to no correlation between the age and education level of the mothers. The alternative hypothesis is that there is a trend showing that mothers of different education levels give birth at similar ages.

The second hypothesis I would like to test is if the age of a mother has any affect on their babies' birth weight? The null hypothesis would be that there is no correlation between the mother's age and baby's birth weights. The alternative hypothesis would be that older mother's newborn birth weights were lower than that of younger mothers and vice versa.

In [127]:
# Hypothesis testing - difference of means 

# High school diploma completed
age_hs_mean =  26
age_hs_sd = np.sqrt(3.1)
age_hs_n = 284

# Master's degree completed
age_ms_mean =  33
age_ms_sd = np.sqrt(2.2)
age_ms_n = 222
sample_age_hs = np.random.normal(loc=age_hs_mean, scale=age_hs_sd, size=age_hs_n)
print("Age (High School Diploma)")
print("Observations:", len(sample_age_hs))
print("Mean:", np.mean(sample_age_hs))
print("Standard Deviation", np.std(sample_age_hs))
print("Variance", np.std(sample_age_hs)**2)
print("")
sample_age_ms = np.random.normal(loc=age_ms_mean, scale=age_ms_sd, size=age_ms_n)
print("Age (Master's Degree)")
print("Observations:", len(sample_age_ms))
print("Mean:", np.mean(sample_age_ms))
print("")
print("Standard Deviation", np.std(sample_age_ms))
print("Variance", np.std(sample_age_ms)**2)
obs_diff = age_hs_mean - age_ms_mean
exp_diff = 0
var_hs = (age_hs_sd**2) / age_hs_n
var_ms = (age_ms_sd**2) / age_ms_n
std_err = np.sqrt(var_hs + var_ms)
print("Sample difference:", obs_diff)
print("Expected population difference:", exp_diff)
print("Standard Error:", std_err)
Z = (obs_diff - exp_diff) / std_err
print("")
print("Z-score:", Z)
Age (High School Diploma)
Observations: 284
Mean: 26.1482978011
Standard Deviation 1.74095292643
Variance 3.03091709206

Age (Master's Degree)
Observations: 222
Mean: 32.8716839635

Standard Deviation 1.6573605576
Variance 2.7468440179
Sample difference: -7
Expected population difference: 0
Standard Error: 0.144310092744

Z-score: -48.5066558196

Conclusion

After analyzing the data generated by my first hypothesis, I found that for this set of data the education level and the age of the mothers did show a correlation. Mothers who pursued a higher education were older than mothers with less education. The "8th grade or less" education category deviated from my hypothesis in that mothers who pursued higher education were older, seeing as that category produced an older age range than the "9th through 12th grade with no diploma" category.

Through analyzing the data from my second hypothesis, I found that there is no correlation between the age of the mothers and the birth weights of their newborns for this set of data.

The most challenging aspect of working with this dataset was having a lot of non-numeric variables and coming up with the best potentially meaningful hypothesis questions to pose from them based on the numeric variables provided. Only one of my hypotheses produced a plot that I could discern any actual meaning from.