Sampling Distributions

Sampling distributions is defined as the distribution of a statistic.

We found that for proportions (and also means, as proportions are just the mean of 1 and 0 values), the following characteristics hold.

  1. The sampling distribution is centered on the original parameter value.

  2. The sampling distribution decreases its variance depending on the sample size used. Specifically, the variance of the sampling distribution is equal to the variance of the original data divided by the sample size used. This is always true for the variance of a sample mean!

Difference between parameter and statistics

When we look at the sampling distributions, we take statistics, not parameters. Parameters are values based on the population so they do not change.

However, Statistics will change depending on the sampling distribution.

Law of large numbers

Central limit theorem

With a large enough sample size the sampling distribution of the mean will be normally distributed. Since the proportion is like a mean of 0 and 1 data values, it also abides by the central limit theorem.

Exercise:

import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
np.random.seed(42)

pop_data = np.random.gamma(1,100,3000)
plt.hist(pop_data);

1. In order to create the sampling distribution for the average of 100 draws of this distribution, follow these steps:

a. Use numpy's random.choice to simulate 100 draws from the pop_data array. b. Compute the mean of these 100 draws. c. Write a loop to simulate this process 10,000 times, and store each mean into an array called means_size_100. d. Plot a histogram of your sample means. e. Use means_size_100 and pop_data to answer the quiz questions below.

sample1 = np.random.choice(pop_data, 100, replace=True)
sample2 = np.random.choice(pop_data, 100, replace=True)
sample3 = np.random.choice(pop_data, 100, replace=True)
print(sample1.mean())
print(sample2.mean())
print(sample3.mean())
print(pop_data.mean())
print(pop_data.std())

means_size_100 = []
for _ in range(10000):
    sample = np.random.choice(pop_data, 100,  replace=True)
    means_size_100.append(sample.mean())
    
plt.hist(means_size_100)

For variance, the distribution will resemble what is known as a chi-squared distribution.

So we have 2 questions for the central limit theorem: What does it mean for a sample size to be large enough? Which statistics does it apply to?

But instead of relying on theorem, we can simulate the sampling distribution. This technique is called Bootstrapping.

Bootstrapping

Bootstrapping means sampling with replacement, meaning that it could potentially be chosen again in another sample. It is possible, though unlikely, to choose the same individual.

# Example of bootstrap sampling
import numpy as np
np.random.seed(42)

die_vals = np.array([1,2,3,4,5,6])

np.random.choice(die_vals, size=20)

You can draw inferences on the populations parameters, only performing repeated sampling from our existing sample.

Therefore: the benefit is that no more data is needed to gain a better understanding of the parameter.

Bootstrapping is also used for machine learning.

Last updated