The topic of my project is the statistics of every NBA player in the 2017-2018 season. I hope to know understand what attributes that the player brings to their specific teams. The source of the datasheet comes from MySportsFeeds. The information in the datasheet includes the weight, height, name, age, games played, and shot attempts/makes.
Intersted in a relationship between variables which are age and 2 pt field goals made.
import pandas as pd
import numpy as np
df=pd.read_csv("nba_player_stats_2017_18.csv")
df.loc[:,["Age","Fg2PtMade"]]
df_sub = df[df["Age"] > 0]
df_sub.head()
Age=df['Age']
print('Age stats')
a=np.min(Age)
print('min=',a)
b=np.max(Age)
print('max=',b)
c=np.mean(Age)
print('mean=', c)
d=np.std(Age)
print('standard deviation=', d)
print('Fg2PtMade stats')
fg=df['Fg2PtMade']
e=np.min(fg)
print('min=',e)
f=np.max(fg)
print('max=',f)
g=np.mean(fg)
print('mean=', g)
h=np.std(fg)
print('standard deviation=', h)
print('FtMade stats')
ft=df['FtMade']
i=np.min(ft)
print('min=',i)
j=np.max(ft)
print('max=',j)
k=np.mean(ft)
print('mean=', k)
l=np.std(ft)
print('standard deviation=', l)
import matplotlib as mpl
import matplotlib.pyplot as plt
age = df_sub["Age"]
age.head()
fig1 = plt.figure()
plt.hist(age)
plt.title("Age")
plt.show()
fig1.savefig('Age.png')
fg = df_sub["Fg2PtMade"]
fg.head()
fig2 = plt.figure()
plt.hist(fg)
plt.title("Fg2PtMade")
plt.show()
fig2.savefig('fg.png')
fig3 = plt.figure()
plt.scatter(age, fg)
plt.title("Scatter")
plt.xlabel("Age")
plt.ylabel("Fg2PtMade")
plt.show()
fig3.savefig('Scat.png')
Among all the NBA players, they are more free throws made than 2 pointers made.
FT = df['FtMade']
print("Observation FT:", sum(FT))
FG = df['Fg2PtMade']
print("Observation FG:", sum(FG))
import matplotlib.axes as ax
import matplotlib.pyplot as plt
%matplotlib inline
sd_fg = np.sqrt(h)
n_fg = 40903
p_fg = np.random.normal(loc=g, scale=h, size=n_fg)
print("Field Goals Made")
print("Observation:", len(p_fg))
print("Mean:", g)
print("Standard Deviation:", sd_fg)
print("Variance:", np.std(p_fg)**2)
sd_ft = np.sqrt(l)
n_ft = 71627
p_ft = np.random.normal(loc=k, scale=l, size=n_ft)
print("Free Throws Made")
print("Observation:", len(p_ft))
print("Mean:", k)
print("Standard Deviation:", l)
print("Variance:", np.std(p_ft)**2)
obs_diff = g-k
exp_diff = 0
vnfg = (h**2)/n_fg
vnft = (l**2)/n_ft
std_err = np.sqrt(vnfg + vnft)
print("Sample Difference:", obs_diff)
print("Expected Population Differnence:", exp_diff)
print("Standard Error:", std_err)
z=(obs_diff - exp_diff)/std_err
print("z-score:", z)
stdn_data = np.random.randn(5000)
stdn = plt.figure()
plt.hist(stdn_data, bins=100)
plt.axvline(x=-1.96, color='gold')
plt.axvline(x=1.96, color='gold')
plt.title("Made Shots")
plt.show()
stdn.savefig("Age_Efficiency.png")
#Rejected the null hypothesis
import statsmodels.formula.api as smf
nba = pd.read_csv("nba_player_stats_2017_18.csv")
nba.head()
df_sub = df[df["Age"] > 0 & (df['Weight']>0)]
#age = df_sub["Age"]
#age.head()
#weight = df_sub["Weight"]
#weight.head()
inc_edu = plt.figure()
plt.scatter(df_sub["Weight"], df_sub["Age"])
plt.title('Age vs Weight')
plt.xlabel('Weight')
plt.ylabel("Age")
plt.show()
df_sub['Weight'].head()
ln_inc_edu = plt.figure()
plt.scatter(df_sub['Age'], np.log(df_sub['Weight']))
np.log(df_sub['Weight']).head()
plt.title("Age vs Weight")
plt.xlabel("Age")
plt.ylabel("Weight")
plt.show()
model = smf.ols(formula="np.log(Weight) ~ Age",data=df_sub)
est = model.fit()
est.summary()
Based on the Hypothesis test that rejected the null statement, among the players there are more 2 pointers made than there are free throws. But on the correlation between age and field goals made, it looked like as the older the player was the less shot they made. The linear regression that compared age & weight had no correlation with each known that the r squared was a low value. The most challenging part of the data set was trying to get rid of all the zeros that were entered in the weight and age category.