Statistical Experiments and Significance Testing

Data scientists are faced with the need to conduct continual experiments, particularly regarding user interface and product marketing. The term inference reflects the intention to apply the experiment results, which involve a limited set of data, to a larger process or population.

KEY TERMS FOR A/B TESTING

Treatment Something (drug, price, web headline) to which a subject is exposed.

Treatment group A group of subjects exposed to a specific treatment.

Control group A group of subjects exposed to no (or standard) treatment.

Randomization The process of randomly assigning subjects to treatments.

Subjects The items (web visitors, patients, etc.) that are exposed to treatments.

Test statistic The metric used to measure the effect of the treatment.

You also need to pay attention to the test statistic or metric you use to compare group A to group B. Perhaps the most common metric in data science is a binary variable: click or no-click, buy or don’t buy, fraud or no fraud, and so on. Those results would be summed up in a 2×2 table. Table 3-1 is a 2×2 table for an actual price test.

Resampling

Permutation tests are useful heuristic procedures for exploring the role of random variation. They are relatively easy to code, interpret and explain, and they offer a useful detour around the formalism and “false determinism” of formula-based statistics. One virtue of resampling, in contrast to formula approaches, is that it comes much closer to a “one size fits all” approach to inference. Data can be numeric or binary. Sample sizes can be the same or different. Assumptions about normallydistributed data are not needed

There are two main types of resampling procedures: the bootstrap and permutation tests. The bootstrap is used to assess the reliability of an estimate. Permutation tests are used to test hypotheses, typically involving two or more groups

KEY TERMS

Permutation test The procedure of combining two or more samples together, and randomly (or exhaustively) reallocating the observations to resamples. Synonyms Randomization test, random permutation test, exact test. This is a good way to know if there is a statistical difference between 2 group means or is it just due to chance?

With or without replacement In sampling, whether or not an item is returned to the sample before the next draw.

Permutation Test

  1. Combine the results from the different groups in a single data set.

  2. Shuffle the combined data, then randomly draw (without replacing) a

    resample of the same size as group A.

  3. From the remaining data, randomly draw (without replacing) a resample

    of the same size as group B.

  4. Do the same for groups C, D, and so on.

  5. Whatever statistic or estimate was calculated for the original samples

    (e.g., difference in group proportions), calculate it now for the

    resamples, and record; this constitutes one permutation iteration.

  6. Repeat the previous steps R times to yield a permutation distribution of

    the test statistic.

Now go back to the observed difference between groups and compare it to the set of permuted differences. If the observed difference lies well within the set of permuted differences, then we have not proven anything — the observed difference is within the range of what chance might produce. However, if the observed difference lies outside most of the permutation distribution, then we conclude that chance is not responsible. In technical terms, the difference is statistically significant.

This is better than a t-test if we have access to a lot of computing power.

An exhaustive permutation test: In an exhaustive permutation test, instead of just randomly shuffling and dividing the data, we actually figure out all the possible ways it could be divided. This is practical only for relatively small sample sizes

In a bootstrap permutation test, the draws outlined in steps 2 and 3 of the random permutation test are made with replacement instead of without replacement. In this way the resampling procedure models not just the random element in the assignment of treatment to subject, but also the random element in the selection of subjects from a population.

Statistical Significance and P-Values

KEY TERMS

P-value Given a chance model that embodies the null hypothesis, the p-value is the probability of obtaining results as unusual or extreme as the observed results.

Alpha The probability threshold of “unusualness” that chance results must surpass, for actual outcomes to be deemed statistically significant. The probability question being answered is not “what is the probability that this happened by chance?” but rather “given a chance model, what is the probability of a result this extreme?” We then deduce backward about the appropriateness of the chance model, but that judgment does not carry a probability. This point has been the subject of much confusion.

Type 1 error Mistakenly concluding an effect is real (when it is due to chance).

Type 2 error Mistakenly concluding an effect is due to chance (when it is real).

t-tests

KEY IDEAS

Before the advent of computers, resampling tests were not practical and statisticians used standard reference distributions.

A test statistic could then be standardized and compared to the reference distribution.

One such widely used standardized statistic is the t-statistic

Multiple testing

Problem of overfitting in data mining, or “fitting the model to the noise.” The more variables you add, or the more models you run, the greater the probability that something will emerge as “significant” just by chance. Multiplicity in a research study or data mining project (multiple comparisons, many variables, many models, etc.) increases the risk of concluding that something is significant just by chance.

  • For predictive modeling, the risk of getting an illusory model whose apparent efficacy is largely a product of random chance is mitigated by crossvalidation (see “Cross-Validation”), and use of a holdout sample.

  • For other procedures without a labeled holdout set to check the model, you must rely on:

    • Awareness that the more you query and manipulate the data, the greater the role that chance might play; and

    • Resampling and simulation heuristics to provide random chance benchmarks against which observed results can be compared.

Degrees of freedom

The concept is applied to statistics calculated from sample data, and refers to the number of values free to vary. For example, if you know the mean for a sample of 10 values, and you also know 9 of the values, you also know the 10th value. Only 9 are free to vary.

When you use a sample to estimate the variance for a population, you will end up with an estimate that is slightly biased downward if you use n in the denominator. If you use n – 1 in the denominator, the estimate will be free of that bias.

Anova

Suppose that, instead of an A/B test, we had a comparison of multiple groups, say A-B-C-D, each with numeric data. The statistical procedure that tests for a statistically significant difference among the groups is called analysis of variance, or ANOVA. In its simplest form, ANOVA provides a statistical test of whether the population means of several groups are equal, and therefore generalizes the t-test to more than two groups. ANOVA is useful for comparing (testing) three or more group means for statistical significance. It is conceptually similar to multiple two-sample t-tests, but is more conservative, resulting in fewer type I errors,[1] and is therefore suited to a wide range of practical problems.

KEY TERMS FOR ANOVA

Pairwise comparison A hypothesis test (e.g., of means) between two groups among multiple groups.

Omnibus test A single hypothesis test of the overall variance among multiple group means.

Decomposition of variance Separation of components. contributing to an individual value (e.g., from the overall average, from a treatment mean, and from a residual error).

F-statistic A standardized statistic that measures the extent to which differences among group means exceeds what might be expected in a chance model.

SS “Sum of squares,” referring to deviations from some average value.

When we have say an ABCDE test. The more such pairwise comparisons we make, the greater the potential for being fooled by random chance (see “Multiple Testing”). Instead of worrying about all the different comparisons between individual pages we could possibly make, we can do a single overall omnibus test that addresses the question, “Could all the pages have the same underlying stickiness, and the differences among them be due to the random way in which a common set of session times got allocated among the four pages?”

The procedure used to test this is ANOVA. The basis for it can be seen in the following resampling procedure (specified here for the A-B-C-D test of web page stickiness):

1. Combine all the data together in a single box

2. Shuffle and draw out four resamples of five values each

3. Record the mean of each of the four groups

4. Record the variance among the four group means

5. Repeat steps 2–4 many times (say 1,000)

What proportion of the time did the resampled variance exceed the observed variance? This is the p-value. This type of permutation test is a bit more involved than the type used in “Permutation Test”.

F-Statistic

Just like the t-test can be used instead of a permutation test for comparing the mean of two groups, there is a statistical test for ANOVA based on the F-statistic. The F-statistic is based on the ratio of the variance across group means (i.e., the treatment effect) to the variance due to residual error. The higher this ratio, the more statistically significant the result.

Chi-Square Test

KEY TERMS

Chi-square statistic A measure of the extent to which some observed data departs from expectation.

Expectation or expected How we would expect the data to turn out under some assumption, typically the null hypothesis.

(Page 209).

Ex: Suppose you are testing three different headlines — A, B, and C — and you run them each on 1,000 visitors, with the results shown in Table 3-4.

We need to have the “expected” distribution of clicks, and, in this case, that would be under the null hypothesis assumption that all three headlines share the same click rate, for an overall click rate of 34/3,000. Under this assumption, our contingency table would look like Table 3-5.

The Pearson residual is defined as:

R measures the extent to which the actual counts differ from these expected counts (see Table 3-6).

The chi-squared statistic is defined as the sum of the squared Pearson residuals:

where r and c are the number of rows and columns, respectively. The chi-squared statistic for this example is 1.666. Is that more than could reasonably occur in a chance model? We can test with this resampling algorithm:

1. Constitute a box with 34 ones (clicks) and 2,966 zeros (no clicks).

2. Shuffle, take three separate samples of 1,000, and count the clicks in each.

3. Find the squared differences between the shuffled counts and the expected counts, and sum them.

4. Repeat steps 2 and 3, say, 1,000 times.

5. How often does the resampled sum of squared deviations exceed the observed? That’s the p-value.

The distribution of the chi-squared statistic can be approximated by a chi-square distribution. The appropriate standard chisquare distribution is determined by the degrees of freedom

The chi-square distribution is typically skewed, with a long tail to the right; see Figure 3-7 for the distribution with 1, 2, 5, and 10 degrees of freedom. The further out on the chi-square distribution the observed statistic is, the lower the p-value.

Fisher’s Exact Test

The chi-square distribution is a good approximation of the shuffled resampling test just described, except when counts are extremely low (single digits, especially five or fewer). In such cases, the resampling procedure will yield more accurate p-values. In fact, most statistical software has a procedure to actually enumerate all the possible rearrangements (permutations) that can occur, tabulate their frequencies, and determine exactly how extreme the observed result is. This is called Fisher’s exact test after the great statistician R. A. Fisher.

Relevance for Data Science

One data science application of the chi-square test, especially Fisher’s exact version, is in determining appropriate sample sizes for web experiments. These experiments often have very low click rates and, despite thousands of exposures, count rates might be too small to yield definitive conclusions in an experiment. In such cases, Fisher’s exact test, the chi-square test, and other tests can be useful as a component of power and sample size calculations

Conclusion: The chi-square distribution is the reference distribution (which embodies the assumption of independence) to which the observed calculated chi-square statistic must be compared.

Mult-Arm Bandit Algorithm

Multi-arm bandits offer an approach to testing, especially web testing, that allows explicit optimization and more rapid decision making than the traditional statistical approach to designing experiments.

How it works

Your goal is to win as much money as possible, and more specifically, to identify and settle on the winning arm sooner rather than later. The challenge is that you don’t know at what rate the arms pay out — you only know the results of pulling the arm. Suppose each “win” is for the same amount, no matter which arm. What differs is the probability of a win. Suppose further that you initially try each arm 50 times and get the following results:

Arm A: 10 wins out of 50 Arm B: 2 win out of 50 Arm C: 4 wins out of 50

We start pulling A more often, to take advantage of its apparent superiority, but we don’t abandon B and C. We just pull them less often. If A continues to outperform, we continue to shift resources (pulls) away from B and C and pull A more often. If, on the other hand, C starts to do better, and A starts to do worse, we can shift pulls from A back to C. If one of them turns out to be superior to A and this was hidden in the initial trial due to chance, it now has an opportunity to emerge with further testing.

A more sophisticated algorithm uses “Thompson’s sampling.” This procedure “samples” (pulls a bandit arm) at each stage to maximize the probability of choosing the best arm. Of course you don’t know which is the best arm — that’s the whole problem! — but as you observe the payoff with each successive draw, you gain more information. Thompson’s sampling uses a Bayesian approach: some prior distribution of rewards is assumed initially, using what is called a beta distribution (this is a common mechanism for specifying prior information in a Bayesian problem). As information accumulates from each draw, this information can be updated, allowing the selection of the next draw to be better optimized as far as choosing the right arm.

Power and Sample Size

KEY TERMS

Effect size The minimum size of the effect that you hope to be able to detect in a statistical test, such as “a 20% improvement in click rates”.

Power The probability of detecting a given effect size with a given sample size.

Significance level The statistical significance level at which the test will be conducted.

Power in detail: Power is the probability of detecting a specified effect size with specified sample characteristics (size and variability). For example, we might say (hypothetically) that the probability of distinguishing between a .330 hitter and a .200 hitter in 25 at-bats is 0.75. The effect size here is a difference of .130. And “detecting” means that a hypothesis test will reject the null hypothesis of “no difference” and conclude there is a real effect. So the experiment of 25 at-bats (n = 25) for two hitters, with an effect size of 0.130, has (hypothetical) power of 0.75 or 75%.

Most data scientists will not need to go through all the formal steps needed to report power, for example, in a published paper.

Sample Size

The most common use of power calculations is to estimate how big a sample you will need.

In summary, for calculating power or required sample size, there are four moving parts:

Sample size Effect size you want to detect

Significance level (alpha) at which the test will be conducted

Power

Specify any three of them, and the fourth can be calculated.

Last updated