Statistics and probability are related fields in mathematics, but the nature of the relationship is a little tricky. Statistics is used to mean several things, including the collection of data, which is part of descriptive statistics. But the part of statistics that is related to probability is inferential statistics, which is the art and craft of separating signal from noise; inferential statistics may also be described as methods of saying something about a population when you only have data on a sample from that population. Probability, on the other hand, is about finding out how probable certain events are, given certain conditions.
The relationship between statistics and probability is best illustrated by example. A fairly simple problem in inferential statistics is that of comparing two groups on some numeric variable. For example, we might want to find out if American boys and girls in 3rd grade report having the same number of friends. We are interested in a huge population – all the 3rd boys and girls in America – and we cannot collect data on all of them. Therefore, we draw a random sample from the population, ask each child in the sample how many friends they have, and then compare the numbers. We set up a null hypothesis that there is no difference in the population, and then we test how likely it is that the data we have could have arisen if the null hypothesis is true. One way of performing this test is with a t-test. We can then get, from a table or a computer output, a p-value, which answers the following question: “If the null hypothesis is true, how likely is a t statistic as extreme as the one we obtained?”
(It is important to note that the p-value does NOT answer the question “how likely is the null hypothesis?”).
But where do we get that p-value? That’s where probability comes in. Remember that probability uses various mathematical tools to determine how likely certain events are, given certain conditions. Here, the conditions are “There is no difference in the number of friends reported by 3rd grade boys and girls in America” and the event is a t-statistic as far from zero as the one we obtained when we did the t-test.
Any time you test a null hypothesis – whether with regression, or analysis of variance, or some other statistical technique, the above logic applies.
One interesting note is that
all of the theory behind this was developed before the invention of computers. Therefore, it relies on approximations and on assumptions that are not always reasonable. Simulation is one way of avoiding these approximations and assumptions. If we wanted to simulate our example, we would program a computer to do the following:
1) Create a large population of boys and girls, with the number of friends reported by each varying only by chance.
2) Draw many samples from that population (at least 1000 samples).
3) Calculate a t statistic for each sample
4) Determine how many of the results in 3) are as far from 0 as our sample was.