Dalton Grady
June 5, 2018
The topic for my project is revolved around Computer and Internet Use in the United States from 1994 to 2015.
From studying this data I hope to learn specifically how many people in the US have internet access in their homes in the current day versus previous years, and see by how much it has changed overtime.
The source of this information is from the National Telecommunications and Information Administration, and includes variables, such as if the surveyee is a householder, if their age is over 15 years old, if they use the internet at home, school or work, if the surveyee has wired high-speed internet service, works online, and then looks at various reasons as to why a household may not be online, such as being able to use internet elsewhere, not needing it at home, too expensive, and other variables.
From the data provided I am more interested in the relationship between the variables, because the amount of individuals who have internet access goes in proportion with the total amount of inviduals in various states, and country as a whole.
The variables I am focusing on are the total home internet users over the years in the whole US, as well as California and Massachusetts. Furthermore I am also looking at individuals of white and black descent and comparing total amount of inviduals who have internet access at home between the two races.
#Importing the necessary librarys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.axes as ax
import statsmodels.formula.api as smf
%matplotlib inline
#Slicing out the variable for only homeInternetUser & providing the statistics on it
df = pd.read_csv( "ntia_computer_usage.csv" )
#Calculating the total number of households
house_holders = df[df['variable'] == 'isHouseholder']
data_house_holders = house_holders.loc[:, ['usCount']]
total_house_holders = data_house_holders.iat[12,0]
#Calculating the dataset for homeInternetUsers
home_internet_users = df[df['variable'] == 'homeInternetUser']
data_home = home_internet_users.loc[:, ['dataset','usCount', 'CACount','MACount', 'raceWhiteCount', 'raceBlackCount']]
print( data_home )
print( pd.DataFrame.describe( data_home ) )
#Histogram plot of total US households with Internet Access
figh_data_home = plt.figure( )
plt.hist( home_internet_users['usCount'], bins = 10 )
plt.title( "Frequency Distribution of Total Households with Internet Access")
plt.xlabel( "Amount of US Households (100 Million)" )
plt.ylabel( "Frequency" )
plt.show( )
#Histogram plot of total CA households with Internet Access
figh_data_home = plt.figure( )
plt.hist( home_internet_users['CACount'], bins = 10 )
plt.title( "Frequency Distribution of California Households with Internet Access")
plt.xlabel( "Amount of California Households" )
plt.ylabel( "Frequency" )
plt.show( )
#Histogram plot of total MA households with Internet Access
figh_data_home = plt.figure( )
plt.hist( home_internet_users['MACount'], bins = 10 )
plt.title( "Frequency Distribution of Massachusetts Households with Internet Access")
plt.xlabel( "Amount of Massachusetts Households" )
plt.ylabel( "Frequency" )
plt.show( )
#Histogram plot of total white invididuals with Internet Access
figh_data_race = plt.figure( )
plt.hist( home_internet_users['raceWhiteCount'], bins = 9 )
plt.title( "Frequency Distribution of White Individuals with Internet Access")
plt.xlabel( "Amount of White Individuals (100 Million)" )
plt.ylabel( "Frequency" )
plt.show( )
#Histogram plot of total black invididuals with Internet Access
figh_data_race = plt.figure( )
plt.hist( home_internet_users['raceBlackCount'], bins = 9 )
plt.title( "Frequency Distribution of Black Individuals with Internet Access")
plt.xlabel( "Amount of Black Individuals (10 Million)" )
plt.ylabel( "Frequency" )
plt.show( )
#Scatterplot for total US Count
figs_data_home = plt.figure( )
plt.scatter( home_internet_users['dataset'], home_internet_users['usCount'] )
plt.title( 'Amount of Individuals with Internet Access (US)' )
plt.xlabel('Time' )
plt.ylabel( 'Total US Count (100 Million)' )
plt.show()
#Scatterplot for total MA Count
figs_data_home = plt.figure( )
plt.scatter( home_internet_users['dataset'], home_internet_users['CACount'], color = 'blue', label = "California" )
plt.scatter( home_internet_users['dataset'], home_internet_users['MACount'], color = 'orange', label = "Massachusetts" )
plt.title( 'Amount of California Households with Internet Access vs Massachusetts Households' )
plt.xlabel('Time' )
plt.ylabel( 'Total Household Count (10 Million)' )
plt.legend( )
plt.show( )
#Scatterplot for total White Count
figs_data_home = plt.figure( )
plt.scatter( home_internet_users['dataset'], home_internet_users['raceWhiteCount'] )
plt.title( 'Amount of White Individuals with Internet Access' )
plt.xlabel('Time' )
plt.ylabel( 'Total Individual Count (100 Million)' )
plt.show( )
#Scatterplot for total Black Count
figs_data_home = plt.figure( )
plt.scatter( home_internet_users['dataset'], home_internet_users['raceBlackCount'] )
plt.title( 'Amount of Black Individuals with Internet Access' )
plt.xlabel('Time' )
plt.ylabel( 'Total Individual Count (10 Million)' )
plt.show()
A hypothesis that I would like to test in regards to internet access in the United States is that as time increased is there a difference between the amount of white individuals who have internet access, and black individuals who have internet access.
The null hypothesis is that as time increased there is no significant difference in the in the increasing percentage of black individuals who had internet had internet access, compared to the increasing percentage of white individuals. Whereas the alternative hypothesis is that, as time increased the percentage of black individuals who have internet access rose greater than white individuals who have internet access.
y = perecent_households_internet x = minority + time
#Hypothesis Testing Values for White
mean_white = data_home['raceWhiteCount'].mean( )
std_white = data_home['raceWhiteCount'].std( )
size_white = data_home['raceWhiteCount'].count( )
#Hypothesis Testing Values for Black
mean_black = data_home['raceBlackCount'].mean( )
std_black = data_home['raceBlackCount'].std( )
size_black = data_home['raceBlackCount'].count( )
#Creating the random sample
sample_ht_white = np.random.normal( loc = mean_white, scale = std_white, size = size_white )
sample_ht_black = np.random.normal( loc = mean_black, scale = std_black, size = size_black )
#Veryfying sample
print( "White Individuals" )
print( "Observations:", len( sample_ht_white ) )
print( "Mean:", np.mean( sample_ht_white ) )
print( "Standard Deviation", np.std( sample_ht_white ) )
print( "Variance:", np.var( sample_ht_white ), '\n' )
print( "Black Individuals" )
print( "Observations:", len( sample_ht_black ) )
print( "Mean:", np.mean( sample_ht_black ) )
print( "Standard Deviation", np.std( sample_ht_black ) )
print( "Variance:", np.var( sample_ht_black ) )
#Calculating Difference
obs_diff = mean_white - mean_black
exp_diff = 0
var_white = ( std_white ** 2 ) / size_white
var_black = ( std_black ** 2 ) / size_black
std_err = np.sqrt( var_white + var_black )
z = ( obs_diff - exp_diff ) / std_err
print( "Sample difference:", obs_diff )
print( "Expected population difference:", exp_diff )
print( "Standard Error:", std_err )
print( "Z-score:", z )
#From the test statistic we are able to reject the null hypothesis with 95% confidence, as the Z score is greater than 1.96
plt.scatter( home_internet_users['dataset'], home_internet_users['raceWhiteCount'], color = 'blue', label = "White" )
plt.scatter( home_internet_users['dataset'], home_internet_users['raceBlackCount'], color = 'orange', label = "Black" )
plt.title( 'White vs Black Individuals who Have Internet Access' )
plt.xlabel('Time' )
plt.ylabel( 'Amount of Individuals (100 Million)' )
plt.legend( )
plt.show( )
ln_internet_access = plt.figure( )
plt.scatter( home_internet_users['dataset'], np.log( home_internet_users['raceWhiteCount'] ), color = 'blue', label = "White" )
plt.scatter( home_internet_users['dataset'], np.log( home_internet_users['raceBlackCount'] ), color = 'orange', label = "Black" )
plt.title( 'White vs Black Individuals who Have Internet Access' )
plt.xlabel('Time' )
plt.ylabel( 'LN Amount of Individuals' )
plt.legend( )
plt.show( )
I am interested estimating the linear regression of the relationship of the total amount of white individuals who have internet access compared to black individuals.
model = smf.ols( formula = 'np.log( raceWhiteCount ) ~ np.log( raceBlackCount )', data = data_home )
est = model.fit( )
est.summary ( )
From analyzing this this topic I was able to find that there was a large differece between the amount of white individuals who have internet access over the years compared to black individuals, same for the total amount of households in California compared to Massachusetts. The most interesting aspect of my analysis was even though there have and still are more white individuals who have internet access throughout the years, it seems that the percentage of black individuals who gained internet access is greater than the percentage of total white individuals who have home internet access. The most challenging aspect of working with my dataset, was that there tons of information and data that was collected, but I only wanted to focus on one specfic variable out of the many different variables provided, and trying to slice the data to obtain the correct variable was challenging.