Computer and Internet Use, from 1994 to 2015

Dalton Grady
June 5, 2018

Background

The topic for my project is revolved around Computer and Internet Use in the United States from 1994 to 2015.

From studying this data I hope to learn specifically how many people in the US have internet access in their homes in the current day versus previous years, and see by how much it has changed overtime.

The source of this information is from the National Telecommunications and Information Administration, and includes variables, such as if the surveyee is a householder, if their age is over 15 years old, if they use the internet at home, school or work, if the surveyee has wired high-speed internet service, works online, and then looks at various reasons as to why a household may not be online, such as being able to use internet elsewhere, not needing it at home, too expensive, and other variables.

Data

From the data provided I am more interested in the relationship between the variables, because the amount of individuals who have internet access goes in proportion with the total amount of inviduals in various states, and country as a whole.

The variables I am focusing on are the total home internet users over the years in the whole US, as well as California and Massachusetts. Furthermore I am also looking at individuals of white and black descent and comparing total amount of inviduals who have internet access at home between the two races.

In [103]:
#Importing the necessary librarys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.axes as ax
import statsmodels.formula.api as smf
%matplotlib inline
In [104]:
#Slicing out the variable for only homeInternetUser & providing the statistics on it
df =  pd.read_csv( "ntia_computer_usage.csv" )

#Calculating the total number of households
house_holders = df[df['variable'] == 'isHouseholder']
data_house_holders = house_holders.loc[:, ['usCount']] 
total_house_holders = data_house_holders.iat[12,0]

#Calculating the dataset for homeInternetUsers
home_internet_users = df[df['variable'] == 'homeInternetUser']
data_home = home_internet_users.loc[:, ['dataset','usCount', 'CACount','MACount', 'raceWhiteCount', 'raceBlackCount']]

print( data_home )
print( pd.DataFrame.describe( data_home ) )
    dataset    usCount   CACount  MACount  raceWhiteCount  raceBlackCount
17   Dec-98   57702119   7363861  1552784        49332127         2947298
33   Aug-00   94362619  11576973  2435359        77471397         6075186
53   Sep-01  116352818  13801785  2943721        93023223         8633104
75   Oct-03  129643445  15852182  3279637       101510034        10011265
119  Oct-10  190020533  22860497  4706513       135822107        18434135
149  Jul-11  187950363  22144657  4468574       134120815        18042261
174  Oct-12  205499861  24609471  4809220       140908417        20738766
217  Jul-13  195520693  23759901  4695527       133331722        19263895
278  Jul-15  206763854  24783503  4662113       137281081        21261429
            usCount       CACount       MACount  raceWhiteCount  \
count  9.000000e+00  9.000000e+00  9.000000e+00    9.000000e+00   
mean   1.537574e+08  1.852809e+07  3.728161e+06    1.114223e+08   
std    5.528365e+07  6.497990e+06  1.209455e+06    3.273633e+07   
min    5.770212e+07  7.363861e+06  1.552784e+06    4.933213e+07   
25%    1.163528e+08  1.380178e+07  2.943721e+06    9.302322e+07   
50%    1.879504e+08  2.214466e+07  4.468574e+06    1.333317e+08   
75%    1.955207e+08  2.375990e+07  4.695527e+06    1.358221e+08   
max    2.067639e+08  2.478350e+07  4.809220e+06    1.409084e+08   

       raceBlackCount  
count    9.000000e+00  
mean     1.393415e+07  
std      6.995434e+06  
min      2.947298e+06  
25%      8.633104e+06  
50%      1.804226e+07  
75%      1.926390e+07  
max      2.126143e+07  
In [105]:
#Histogram plot of total US households with Internet Access
figh_data_home = plt.figure( )
plt.hist( home_internet_users['usCount'], bins = 10 )
plt.title( "Frequency Distribution of Total Households with Internet Access")
plt.xlabel( "Amount of US Households (100 Million)" )
plt.ylabel( "Frequency" )
plt.show( )

#Histogram plot of total CA households with Internet Access
figh_data_home = plt.figure( )
plt.hist( home_internet_users['CACount'], bins = 10 )
plt.title( "Frequency Distribution of California Households with Internet Access")
plt.xlabel( "Amount of California Households" )
plt.ylabel( "Frequency" )
plt.show( )

#Histogram plot of total MA households with Internet Access
figh_data_home = plt.figure( )
plt.hist( home_internet_users['MACount'], bins = 10 )
plt.title( "Frequency Distribution of Massachusetts Households with Internet Access")
plt.xlabel( "Amount of Massachusetts Households" )
plt.ylabel( "Frequency" )
plt.show( )

#Histogram plot of total white invididuals with Internet Access
figh_data_race = plt.figure( )
plt.hist( home_internet_users['raceWhiteCount'], bins = 9 )
plt.title( "Frequency Distribution of White Individuals with Internet Access")
plt.xlabel( "Amount of White Individuals (100 Million)" )
plt.ylabel( "Frequency" )
plt.show( )

#Histogram plot of total black invididuals with Internet Access
figh_data_race = plt.figure( )
plt.hist( home_internet_users['raceBlackCount'], bins = 9 )
plt.title( "Frequency Distribution of Black Individuals with Internet Access")
plt.xlabel( "Amount of Black Individuals (10 Million)" )
plt.ylabel( "Frequency" )
plt.show( )
In [106]:
#Scatterplot for total US Count 
figs_data_home = plt.figure( )
plt.scatter( home_internet_users['dataset'], home_internet_users['usCount'] )
plt.title( 'Amount of Individuals with Internet Access (US)' )
plt.xlabel('Time' )
plt.ylabel( 'Total US Count (100 Million)' )
plt.show()

#Scatterplot for total MA Count
figs_data_home = plt.figure( )
plt.scatter( home_internet_users['dataset'], home_internet_users['CACount'], color = 'blue', label = "California" )
plt.scatter( home_internet_users['dataset'], home_internet_users['MACount'], color = 'orange', label = "Massachusetts" )
plt.title( 'Amount of California Households with Internet Access vs Massachusetts Households' )
plt.xlabel('Time' )
plt.ylabel( 'Total Household Count (10 Million)' )
plt.legend( )
plt.show( )

#Scatterplot for total White Count
figs_data_home = plt.figure( )
plt.scatter( home_internet_users['dataset'], home_internet_users['raceWhiteCount'] )
plt.title( 'Amount of White Individuals with Internet Access' )
plt.xlabel('Time' )
plt.ylabel( 'Total Individual Count (100 Million)' )
plt.show( )

#Scatterplot for total Black Count
figs_data_home = plt.figure( )
plt.scatter( home_internet_users['dataset'], home_internet_users['raceBlackCount'] )
plt.title( 'Amount of Black Individuals with Internet Access' )
plt.xlabel('Time' )
plt.ylabel( 'Total Individual Count (10 Million)' )
plt.show()

Analysis

Hypothesis Testing

A hypothesis that I would like to test in regards to internet access in the United States is that as time increased is there a difference between the amount of white individuals who have internet access, and black individuals who have internet access.

The null hypothesis is that as time increased there is no significant difference in the in the increasing percentage of black individuals who had internet had internet access, compared to the increasing percentage of white individuals. Whereas the alternative hypothesis is that, as time increased the percentage of black individuals who have internet access rose greater than white individuals who have internet access.

time trende formula

y = perecent_households_internet x = minority + time

In [107]:
#Hypothesis Testing Values for White
mean_white = data_home['raceWhiteCount'].mean( )
std_white = data_home['raceWhiteCount'].std( )
size_white = data_home['raceWhiteCount'].count( )


#Hypothesis Testing Values for Black
mean_black = data_home['raceBlackCount'].mean( )
std_black = data_home['raceBlackCount'].std( )
size_black = data_home['raceBlackCount'].count( )
In [108]:
#Creating the random sample
sample_ht_white = np.random.normal( loc = mean_white, scale = std_white, size = size_white )
sample_ht_black = np.random.normal( loc = mean_black, scale = std_black, size = size_black )

#Veryfying sample
print( "White Individuals" )
print( "Observations:", len( sample_ht_white ) )
print( "Mean:", np.mean( sample_ht_white ) )
print( "Standard Deviation", np.std( sample_ht_white ) )
print( "Variance:", np.var( sample_ht_white ), '\n' )

print( "Black Individuals" )
print( "Observations:", len( sample_ht_black ) )
print( "Mean:", np.mean( sample_ht_black ) )
print( "Standard Deviation", np.std( sample_ht_black ) )
print( "Variance:", np.var( sample_ht_black ) )
White Individuals
Observations: 9
Mean: 110751262.7266687
Standard Deviation 37562627.120074965
Variance: 1410950956161791.0 

Black Individuals
Observations: 9
Mean: 14187907.062489817
Standard Deviation 6488659.476352369
Variance: 42102701800057.41
In [109]:
#Calculating Difference
obs_diff = mean_white - mean_black
exp_diff = 0
var_white = ( std_white ** 2 ) / size_white
var_black = ( std_black ** 2 ) / size_black
std_err = np.sqrt( var_white + var_black )
z = ( obs_diff - exp_diff ) / std_err

print( "Sample difference:", obs_diff )
print( "Expected population difference:", exp_diff )
print( "Standard Error:", std_err )
print( "Z-score:", z )

#From the test statistic we are able to reject the null hypothesis with 95% confidence, as the Z score is greater than 1.96

plt.scatter( home_internet_users['dataset'], home_internet_users['raceWhiteCount'], color = 'blue', label = "White"  )
plt.scatter( home_internet_users['dataset'], home_internet_users['raceBlackCount'], color = 'orange', label = "Black" )
plt.title( 'White vs Black Individuals who Have Internet Access' )
plt.xlabel('Time' )
plt.ylabel( 'Amount of Individuals (100 Million)' )
plt.legend( )
plt.show( )
Sample difference: 97488176.0
Expected population difference: 0
Standard Error: 11158470.140928881
Z-score: 8.736697304267254
In [110]:
ln_internet_access = plt.figure( )
plt.scatter( home_internet_users['dataset'], np.log( home_internet_users['raceWhiteCount'] ), color = 'blue', label = "White"  ) 
plt.scatter( home_internet_users['dataset'], np.log( home_internet_users['raceBlackCount'] ), color = 'orange', label = "Black"  ) 
plt.title( 'White vs Black Individuals who Have Internet Access' )
plt.xlabel('Time' )
plt.ylabel( 'LN Amount of Individuals' )
plt.legend( )
plt.show( )

Regression

I am interested estimating the linear regression of the relationship of the total amount of white individuals who have internet access compared to black individuals.

In [111]:
model = smf.ols( formula = 'np.log( raceWhiteCount ) ~ np.log( raceBlackCount )', data = data_home )
est = model.fit( )
est.summary ( ) 
C:\Users\dalto\Anaconda3\lib\site-packages\scipy\stats\stats.py:1394: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=9
  "anyway, n=%i" % int(n))
Out[111]:
OLS Regression Results
Dep. Variable: np.log(raceWhiteCount) R-squared: 0.991
Model: OLS Adj. R-squared: 0.990
Method: Least Squares F-statistic: 776.6
Date: Thu, 07 Jun 2018 Prob (F-statistic): 1.97e-08
Time: 11:09:51 Log-Likelihood: 18.260
No. Observations: 9 AIC: -32.52
Df Residuals: 7 BIC: -32.13
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 10.0542 0.303 33.229 0.000 9.339 10.770
np.log(raceBlackCount) 0.5174 0.019 27.867 0.000 0.474 0.561
Omnibus: 1.850 Durbin-Watson: 0.873
Prob(Omnibus): 0.396 Jarque-Bera (JB): 0.911
Skew: -0.385 Prob(JB): 0.634
Kurtosis: 1.645 Cond. No. 412.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Conclusion

From analyzing this this topic I was able to find that there was a large differece between the amount of white individuals who have internet access over the years compared to black individuals, same for the total amount of households in California compared to Massachusetts. The most interesting aspect of my analysis was even though there have and still are more white individuals who have internet access throughout the years, it seems that the percentage of black individuals who gained internet access is greater than the percentage of total white individuals who have home internet access. The most challenging aspect of working with my dataset, was that there tons of information and data that was collected, but I only wanted to focus on one specfic variable out of the many different variables provided, and trying to slice the data to obtain the correct variable was challenging.