Probability & Visualization

Jun 5, 2018, 9:00 AM

Topics

working with series and data frames
probability axioms, random variables
distribution functions
visualizing data and relationships

Materials

Datasets

Bachelors’ Degrees by Major and Sex, Massachusetts 2015-2016, National Center for Education Statistics Integrated Postsecondary Education Data System
Computer and Internet Use 1994-2015, National Telecommunications and Information Administration Digital Nation Data Explorer
County Demographics 2016, US Census Bureau American Community Survey 5-Year Estimates
National Survey on Drug Use and Health 2014-2015, Substance Abuse and Mental Health Services Administration Data Tables
Natality 2007-2016, Centers for Disease Control and Prevention Natality Information: Live Births
NBA Player Statistics 2017-2018 Regular Season, MySportsFeeds Sports Data API
NBA Team Rosters 2017-2018 Regular Season, MySportsFeeds Sports Data API

Activity Instructions

Download the dataset (above) that you would like to use for your project
Confirm that the dataset is saved in the same directory as your notebook (in the bootcamp folder
Review the pandas documentation for .isnull(), .notnull(), and .fillna()
Create a data frame object from your dataset
Create a subset with three quantitative variables
Answer the following in your notebook: Are your variables missing any observations? If so, how many?
Compute the following descriptive statistics for each variable: mean, variance, and standard deviation

Bootcamp Project Part I

For the final project, you will create a Jupyter notebook and facilitate a short discussion with the group. The notebook should have the following sections:

Background
Data
Analysis
Conclusion

For today, you will complete the Background and Data sections. These should be brief, or in outline form.

Background

What is the topic for your project?
What do you hope to learn?
Who or what is the source of your dataset?
Broadly, what kind of information is included in the dataset?

Data

Describe whether you are interested in a particular variable, or in a relationship between variables
Choose two or three variables of interest
Complete any data manipulation necessary so that you have a data frame with variables as columns and observations as rows
Complete the following
- A table of summary statistics, including number of observations, min, max, mean, and standard deviation
- A frequency distribution plot (see plt.hist()) for each variable
- A scatterplot for two variables (see plt.scatter())

`NaLette Brodnax`

Python & Statistics Bootcamp - Summer 2018