JulienBeaulieu
  • Introduction
  • Sciences
    • Math
      • Probability
        • Bayes Rule
        • Binomial distribution
        • Conditional Probability
      • Statistics
        • Descriptive Statistics
        • Inferential Statistics
          • Normal Distributions
          • Sampling Distributions
          • Confidence Intervals
          • Hypothesis Testing
          • AB Testing
        • Simple Linear Regression
        • Multiple Linear Regression
          • Statistical learning course
          • Model Assumptions And How To Address Each
        • Logistic Regression
      • Calculus
        • The big picture of Calculus
          • Derivatives
          • 2nd derivatives
          • The exponential e^x
        • Calculus
        • Gradient
      • Linear Algebra
        • Matrices
          • Matrix Multiplication
          • Inverses and Transpose and permutations
        • Vector Space and subspaces
        • Orthogonality
          • Orthogonal Sets
          • Projections
          • Least Squares
        • Gaussian Elimination
    • Programming
      • Command Line
      • Git & GitHub
      • Latex
      • Linear Algebra
        • Element-wise operations, Multiplication Transpose
      • Encodings and Character Sets
      • Uncategorized
      • Navigating Your Working Directory and File I/O
      • Python
        • Problem Solving
        • Strings
        • Lists & Dictionaries
        • Storing Data
        • HTTP Requests
      • SQL
        • Basic Statements
        • Entity Relationship Diagram
      • Jupyter Notebooks
      • Data Analysis
        • Data Visualization
          • Data Viz Cheat Sheet
          • Explanatory Analysis
          • Univariate Exploration of Data
            • Bar Chart
            • Pie Charts
            • Histograms
            • Kernel Density Estimation
            • Figures, Axes, and Subplots
            • Choosing a Plot for Discrete Data
            • Scales and Transformations (Log)
          • Bivariate Exploration of Data
            • Scatterplots
            • Overplotting, Transparency, and Jitter
            • Heatmaps
            • Violin & Box Plots
            • Categorical Variable Analysis
            • Faceting
            • Line Plots
            • Adapted Bar Charts
            • Q-Q, Swarm, Rug, Strip, Stacked, and Rigeline Plots
          • Multivariate Exploration of Data
            • Non-Positional Encodings for Third Variables
            • Color Palettes
            • Faceting for Multivariate Data
            • Plot and Correlation Matrices
            • Other Adaptations of Bivariate PLots
            • Feature Engineering for Data Viz
        • Python - Cheat Sheet
    • Machine Learning
      • Courses
        • Practical Deep learning for coders
          • Convolutional Neural Networks
            • Image Restauration
            • U-net
          • Lesson 1
          • Lesson 2
          • Lesson 3
          • Lesson 4 NLP, Collaborative filtering, Embeddings
          • Lesson 5 - Backprop, Accelerated SGD
          • Tabular data
        • Fast.ai - Intro to ML
          • Neural Nets
          • Business Applications
          • Class 1 & 2 - Random Forests
          • Lessons 3 & 4
      • Unsupervised Learning
        • Dimensionality Reduction
          • Independant Component Analysis
          • Random Projection
          • Principal Component Analysis
        • K-Means
        • Hierarchical Clustering
        • DBSCAN
        • Gaussian Mixture Model Clustering
        • Cluster Validation
      • Preprocessing
      • Machine Learning Overview
        • Confusion Matrix
      • Linear Regression
        • Feature Scaling and Normalization
        • Regularization
        • Polynomial Regression
        • Error functions
      • Decision Trees
      • Support Vector Machines
      • Training and Tuning
      • Model Evaluation Metrics
      • NLP
      • Neural Networks
        • Perceptron Algorithm
        • Multilayer Perceptron
        • Neural Network Architecture
        • Gradient Descent
        • Backpropagation
        • Training Neural Networks
  • Business
    • Analytics
      • KPIs for a Website
  • Books
    • Statistics
      • Practice Statistics for Data Science
        • Exploring Binary and Categorical Data
        • Data and Sampling Distributions
        • Statistical Experiments and Significance Testing
        • Regression and Prediction
        • Classification
        • Correlation
    • Pragmatic Thinking and Learning
      • Untitled
    • A Mind For Numbers: How to Excel at Math and Science
      • Focused and diffuse mode
      • Procrastination
      • Working memory and long term memory
        • Chunking
      • Importance of sleeping
      • Q&A with Terrence Sejnowski
      • Illusions of competence
      • Seeing the bigger picture
        • The value of a Library of Chunks
        • Overlearning
Powered by GitBook
On this page
  • Difference between parameter and statistics
  • Law of large numbers
  • Central limit theorem
  • Bootstrapping

Was this helpful?

  1. Sciences
  2. Math
  3. Statistics
  4. Inferential Statistics

Sampling Distributions

PreviousNormal DistributionsNextConfidence Intervals

Last updated 5 years ago

Was this helpful?

Sampling distributions is defined as the distribution of a statistic.

We found that for proportions (and also means, as proportions are just the mean of 1 and 0 values), the following characteristics hold.

  1. The sampling distribution is centered on the original parameter value.

  2. The sampling distribution decreases its variance depending on the sample size used. Specifically, the variance of the sampling distribution is equal to the variance of the original data divided by the sample size used. This is always true for the variance of a sample mean!

Difference between parameter and statistics

When we look at the sampling distributions, we take statistics, not parameters. Parameters are values based on the population so they do not change.

However, Statistics will change depending on the sampling distribution.

Law of large numbers

Central limit theorem

With a large enough sample size the sampling distribution of the mean will be normally distributed. Since the proportion is like a mean of 0 and 1 data values, it also abides by the central limit theorem.

Exercise:

import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
np.random.seed(42)

pop_data = np.random.gamma(1,100,3000)
plt.hist(pop_data);

1. In order to create the sampling distribution for the average of 100 draws of this distribution, follow these steps:

a. Use numpy's random.choice to simulate 100 draws from the pop_data array. b. Compute the mean of these 100 draws. c. Write a loop to simulate this process 10,000 times, and store each mean into an array called means_size_100. d. Plot a histogram of your sample means. e. Use means_size_100 and pop_data to answer the quiz questions below.

sample1 = np.random.choice(pop_data, 100, replace=True)
sample2 = np.random.choice(pop_data, 100, replace=True)
sample3 = np.random.choice(pop_data, 100, replace=True)
print(sample1.mean())
print(sample2.mean())
print(sample3.mean())
print(pop_data.mean())
print(pop_data.std())

means_size_100 = []
for _ in range(10000):
    sample = np.random.choice(pop_data, 100,  replace=True)
    means_size_100.append(sample.mean())
    
plt.hist(means_size_100)

For variance, the distribution will resemble what is known as a chi-squared distribution.

So we have 2 questions for the central limit theorem: What does it mean for a sample size to be large enough? Which statistics does it apply to?

But instead of relying on theorem, we can simulate the sampling distribution. This technique is called Bootstrapping.

Bootstrapping

Bootstrapping means sampling with replacement, meaning that it could potentially be chosen again in another sample. It is possible, though unlikely, to choose the same individual.

# Example of bootstrap sampling
import numpy as np
np.random.seed(42)

die_vals = np.array([1,2,3,4,5,6])

np.random.choice(die_vals, size=20)

You can draw inferences on the populations parameters, only performing repeated sampling from our existing sample.

Therefore: the benefit is that no more data is needed to gain a better understanding of the parameter.

Bootstrapping is also used for machine learning.