Probability & Visualization
Topics
- working with series and data frames
- probability axioms, random variables
- distribution functions
- visualizing data and relationships
Materials
Datasets
- Bachelors’ Degrees by Major and Sex, Massachusetts 2015-2016, National Center for Education Statistics Integrated Postsecondary Education Data System
- Computer and Internet Use 1994-2015, National Telecommunications and Information Administration Digital Nation Data Explorer
- County Demographics 2016, US Census Bureau American Community Survey 5-Year Estimates
- National Survey on Drug Use and Health 2014-2015, Substance Abuse and Mental Health Services Administration Data Tables
- Natality 2007-2016, Centers for Disease Control and Prevention Natality Information: Live Births
- NBA Player Statistics 2017-2018 Regular Season, MySportsFeeds Sports Data API
- NBA Team Rosters 2017-2018 Regular Season, MySportsFeeds Sports Data API
Activity Instructions
- Download the dataset (above) that you would like to use for your project
- Confirm that the dataset is saved in the same directory as your notebook (in the
bootcamp
folder - Review the
pandas
documentation for.isnull()
,.notnull()
, and.fillna()
- Create a data frame object from your dataset
- Create a subset with three quantitative variables
- Answer the following in your notebook: Are your variables missing any observations? If so, how many?
- Compute the following descriptive statistics for each variable: mean, variance, and standard deviation
Bootcamp Project Part I
For the final project, you will create a Jupyter notebook and facilitate a short discussion with the group. The notebook should have the following sections:
- Background
- Data
- Analysis
- Conclusion
For today, you will complete the Background and Data sections. These should be brief, or in outline form.
Background
- What is the topic for your project?
- What do you hope to learn?
- Who or what is the source of your dataset?
- Broadly, what kind of information is included in the dataset?
Data
- Describe whether you are interested in a particular variable, or in a relationship between variables
- Choose two or three variables of interest
- Complete any data manipulation necessary so that you have a data frame with variables as columns and observations as rows
- Complete the following
- A table of summary statistics, including number of observations, min, max, mean, and standard deviation
- A frequency distribution plot (see
plt.hist()
) for each variable - A scatterplot for two variables (see
plt.scatter()
)
Resources
Python
- Python Software Foundation
- PEP 8 Python Style Guide
- Python Data Science Handbook
- NumPy Documentation
- Pandas Documentation
- Matplotlib Documentation
Tools
Previous Jun 4, 2018
« Data Wrangling
« Data Wrangling
Jun 6, 2018 Next
Statistical Inference »
Statistical Inference »