Descriptive Statistics
Describe the data we've collected using measures of center, measures of spread, shape of our distribution, and outliers. We can also use plots of our data to gain a better understanding.
Last updated
Was this helpful?
Describe the data we've collected using measures of center, measures of spread, shape of our distribution, and outliers. We can also use plots of our data to gain a better understanding.
Last updated
Was this helpful?
Reminder:
5 number summary
Q2 = median.
Q1 = median of the first half of the data.
Q3 = median of the 2nd half of the data.
Range = max - min
Interquartile range: Q3 - Q1.
These numbers and graph is a boxplot:
Most common = standard deviation and variance.
Standard Deviation: On average, how much each point varies from the mean of the points. How to calculate?
Variance: Average squared distance of each observation from the mean
Find the mean. x-bar
Find the distance of each point of this mean: xi - x-bar
Square all the values (xi-x-bar)^2
This gives us the variance.
4. Take the square root of the variance = Standard deviation.
We initially squared to get positive values of xi-xbar. So to "cancel" it out, we take the square root.
It,s important when comparing data that the units are the same. Ex: if measuring $, we'll use the std dev and not variance since std dev has same units are $.
The standard deviation is a measurement that has the same units as our original data, while the units of the variance are the square of the units in our original data. For example, if the units in our original data were dollars, then units of the standard deviation would also be dollars, while the units of the variance would be dollars squared.
Common usecase: find the std dev of different groups to find out which ones are more spread out.
Mode = highest bar of the histogram.
Most common:
Real life examples are
Left skewed examples: Age of death, asset price changes, GPA
Right skewed examples: amount of drugs in your blood over time , distribution of wealth, human atheltic abilities.
Understand the impact they have on our summary statistics.
If outliers are typos or mistakes = remove.
If not, why do they exist?
The median is the middle number and is not effected by outliers.
Usually the best way to see what's happenning is a visual.
Have to be careful of how we share our results.