JulienBeaulieu
  • Introduction
  • Sciences
    • Math
      • Probability
        • Bayes Rule
        • Binomial distribution
        • Conditional Probability
      • Statistics
        • Descriptive Statistics
        • Inferential Statistics
          • Normal Distributions
          • Sampling Distributions
          • Confidence Intervals
          • Hypothesis Testing
          • AB Testing
        • Simple Linear Regression
        • Multiple Linear Regression
          • Statistical learning course
          • Model Assumptions And How To Address Each
        • Logistic Regression
      • Calculus
        • The big picture of Calculus
          • Derivatives
          • 2nd derivatives
          • The exponential e^x
        • Calculus
        • Gradient
      • Linear Algebra
        • Matrices
          • Matrix Multiplication
          • Inverses and Transpose and permutations
        • Vector Space and subspaces
        • Orthogonality
          • Orthogonal Sets
          • Projections
          • Least Squares
        • Gaussian Elimination
    • Programming
      • Command Line
      • Git & GitHub
      • Latex
      • Linear Algebra
        • Element-wise operations, Multiplication Transpose
      • Encodings and Character Sets
      • Uncategorized
      • Navigating Your Working Directory and File I/O
      • Python
        • Problem Solving
        • Strings
        • Lists & Dictionaries
        • Storing Data
        • HTTP Requests
      • SQL
        • Basic Statements
        • Entity Relationship Diagram
      • Jupyter Notebooks
      • Data Analysis
        • Data Visualization
          • Data Viz Cheat Sheet
          • Explanatory Analysis
          • Univariate Exploration of Data
            • Bar Chart
            • Pie Charts
            • Histograms
            • Kernel Density Estimation
            • Figures, Axes, and Subplots
            • Choosing a Plot for Discrete Data
            • Scales and Transformations (Log)
          • Bivariate Exploration of Data
            • Scatterplots
            • Overplotting, Transparency, and Jitter
            • Heatmaps
            • Violin & Box Plots
            • Categorical Variable Analysis
            • Faceting
            • Line Plots
            • Adapted Bar Charts
            • Q-Q, Swarm, Rug, Strip, Stacked, and Rigeline Plots
          • Multivariate Exploration of Data
            • Non-Positional Encodings for Third Variables
            • Color Palettes
            • Faceting for Multivariate Data
            • Plot and Correlation Matrices
            • Other Adaptations of Bivariate PLots
            • Feature Engineering for Data Viz
        • Python - Cheat Sheet
    • Machine Learning
      • Courses
        • Practical Deep learning for coders
          • Convolutional Neural Networks
            • Image Restauration
            • U-net
          • Lesson 1
          • Lesson 2
          • Lesson 3
          • Lesson 4 NLP, Collaborative filtering, Embeddings
          • Lesson 5 - Backprop, Accelerated SGD
          • Tabular data
        • Fast.ai - Intro to ML
          • Neural Nets
          • Business Applications
          • Class 1 & 2 - Random Forests
          • Lessons 3 & 4
      • Unsupervised Learning
        • Dimensionality Reduction
          • Independant Component Analysis
          • Random Projection
          • Principal Component Analysis
        • K-Means
        • Hierarchical Clustering
        • DBSCAN
        • Gaussian Mixture Model Clustering
        • Cluster Validation
      • Preprocessing
      • Machine Learning Overview
        • Confusion Matrix
      • Linear Regression
        • Feature Scaling and Normalization
        • Regularization
        • Polynomial Regression
        • Error functions
      • Decision Trees
      • Support Vector Machines
      • Training and Tuning
      • Model Evaluation Metrics
      • NLP
      • Neural Networks
        • Perceptron Algorithm
        • Multilayer Perceptron
        • Neural Network Architecture
        • Gradient Descent
        • Backpropagation
        • Training Neural Networks
  • Business
    • Analytics
      • KPIs for a Website
  • Books
    • Statistics
      • Practice Statistics for Data Science
        • Exploring Binary and Categorical Data
        • Data and Sampling Distributions
        • Statistical Experiments and Significance Testing
        • Regression and Prediction
        • Classification
        • Correlation
    • Pragmatic Thinking and Learning
      • Untitled
    • A Mind For Numbers: How to Excel at Math and Science
      • Focused and diffuse mode
      • Procrastination
      • Working memory and long term memory
        • Chunking
      • Importance of sleeping
      • Q&A with Terrence Sejnowski
      • Illusions of competence
      • Seeing the bigger picture
        • The value of a Library of Chunks
        • Overlearning
Powered by GitBook
On this page
  • Types of data
  • Spread of a dataset
  • Shape
  • Outliers

Was this helpful?

  1. Sciences
  2. Math
  3. Statistics

Descriptive Statistics

Describe the data we've collected using measures of center, measures of spread, shape of our distribution, and outliers. We can also use plots of our data to gain a better understanding.

PreviousStatisticsNextInferential Statistics

Last updated 6 years ago

Was this helpful?

Types of data

Reminder:

5 number summary

Q2 = median.

Q1 = median of the first half of the data.

Q3 = median of the 2nd half of the data.

Range = max - min

Interquartile range: Q3 - Q1.

These numbers and graph is a boxplot:

Spread of a dataset

Most common = standard deviation and variance.

Standard Deviation: On average, how much each point varies from the mean of the points. How to calculate?

Variance: Average squared distance of each observation from the mean

  1. Find the mean. x-bar

  2. Find the distance of each point of this mean: xi - x-bar

  3. Square all the values (xi-x-bar)^2

This gives us the variance.

4. Take the square root of the variance = Standard deviation.

We initially squared to get positive values of xi-xbar. So to "cancel" it out, we take the square root.

Units

It,s important when comparing data that the units are the same. Ex: if measuring $, we'll use the std dev and not variance since std dev has same units are $.

The standard deviation is a measurement that has the same units as our original data, while the units of the variance are the square of the units in our original data. For example, if the units in our original data were dollars, then units of the standard deviation would also be dollars, while the units of the variance would be dollars squared.

Common usecase: find the std dev of different groups to find out which ones are more spread out.

Variance and Std Dev in Excel

To calculate the variance of a set of 10 values in a spreadsheet application, with our 10 data points in column A, we would create a new column B by typing in something like =A1-AVERAGE(A$1:A$10) and copying this down for all 10 rows. This would find us the difference between each data point and the mean average of all the data. Then we create a new column C having the square of these differences, using the formula =B1^2 in cell C1, and copying that down for all rows. Then in the cell below this new column, cell C11, type in =SUM(C1:C10). This adds up all these values in column C. Finally in cell C12, we divide this sum by the number of data points we have, in this case ten: =C11/10. This cell C12 now contains the variance for our 10 data points.

More detailed guidance on using spreadsheets like this may be included in a future lesson in your program.

In the same spreadsheet as above, to find the standard deviation of our same set of 10 data values, we would use another cell like C13 to take the square root of our variance measure, by typing in =sqrt(C12).

Shape

Mode = highest bar of the histogram.

Most common:

Real life examples are

Left skewed examples: Age of death, asset price changes, GPA

Right skewed examples: amount of drugs in your blood over time , distribution of wealth, human atheltic abilities.

Outliers

Understand the impact they have on our summary statistics.

If outliers are typos or mistakes = remove.

If not, why do they exist?

The median is the middle number and is not effected by outliers.

Usually the best way to see what's happenning is a visual.

Have to be careful of how we share our results.