Distributions

Note: This section depends on your solid understanding of continuous random variables, so if you're not up to speed on those, go that section first.

The concept of a probability distribution is very important in statistics and probability. You might say it's the very foundation of statistics. By the way, you might get stuck on the word "distribution." It's a old word we've inherited from studies of things that involve random chance. Don't get hung up on the word, and in time, it will probably make a lot of sense to you.

The graph below illustrates the concept of probabilty distribution. It's a Census Bureau study of the heights of American women ages 30-39. The data is represented by the purple bars: the higher the bar, the more women (as a percent of all women 30-39) of that height. That is, the higher the bar, the greater the probability of a woman you meet aged 30-39 being that height.

You can see from the graph that the mean height is about 64 inches (5'-4"). You can see that there are about as many women taller and shorter than the average, and that the numbers of women taller and shorter than the average falls fairly smoothly away from it. And there are more women of heights closer to the average than those of heights much shorter or taller than it.

The underlying curve (gray-shaded area) is a special curve we call the normal distribution or the Gaussian distribution, and it shows what the data would likely look like if we had a great many data points (see law of large numbers) – and indeed this was quite a large study. We would also head toward a more continuous curve if we divided our height "bins" more finely, say in increments of 1/4-inch units or smaller.

 

The normal or Gaussian distribution

The curve under the bar graph above has a familiar "bell" shape. It's often referred to as a bell curve, but more often as the normal distribution or the Gaussian distribution, after Carl Friedrich Gauss.

The curve is a probability distribution. You can always read its meaning by imagining that the vertical axis is a measure of the relative probability or likelihood of something happening and that all of the somethings are arrayed in order along the x-axis.

The Gaussian curve is aways symmetric on either side of its maximum, and the maximum is the mean or average value. Whatever value or event is in the middle is the most likely. That "event" in our women's height example would be the "event" of being 5'-4" tall. Out in the "wings," probability is the lowest: There are far fewer very short and very tall women, and the probability of being short or tall is lower than being of more average height.

If we add up all of the probability under a Gaussian curve, we should get one (or 100%), the probability that somethinganything at all — happened. Often we scale a Gaussian curve so that its total area – the area under the curve – is one. That's called "normalizing" the distribution.

Here's another example before we move on. The graph below shows the results of 5000 simulated throws of two dice. The sum of both dice is shown. Notice that because there are more ways to come up with a total of 7 (6+1, 5+2, 4+3),it's the most probable throw. After 5000 throws, the dice-total distribution looks pretty "normal."

Notice that in this example, we're not graphing probability but number of occurrences of a total, but the two should have the same shape. The sum of the heights of the green bars should be 5000, the total number of throws. Likewise, throwing a 2 or a 12 is less likely than throwing a 7.

We could normalize this distribution by dividing each value column value by the sum of all columns. This would give us the percent chance (if we multiplied by 100) of each throw, and it would sale the graph but maintain its shape. In the graph below, the green bars are the normalized simulated curve, and the purple bars are the exact expectations (see law of large numbers) we'd expect for a very large number of throws.

So where does that curve come from?

That's a tricky question. It comes from modeling random chance, but the functional form of the curve has to be derived using calculus. In particular, it is derived using the second fundamental theorem of calculus. We don't need to go there just yet, though; the result will serve our needs just fine. Here's what the Guassian function / Normal function looks like, with some explanation of its parameters.

The Gaussian function

That formula looks scary, I know. But just think of it as that prefactor in the front (dashed oval) multiplied by an exponential function with a slightly complicated exponent, then trust your knowledge of transformations to understand it. Notice that because the exponent (everything behind that negative sign is squared, so the exponent is always negative), the function always gets smaller as x approaches ± ∞.

 

Simulation

Use the sliders on the graph of the Gaussian function below to adjust the parameters h and σ, and watch the effect on the graph of the function. The parameter h, as in all functions, performs horizontal translation of the Gaussian curve along the x-axis. The parameter σ, the standard deviation, adjusts the width of the function. In other words, σ is a horizontal scaling parameter. Because of the presence of the normalization factor, the area under the curve will remain the same as you scale it – that's why as the distribution gets wider, it also gets shorter.

 

 

Width of a distribution: What does it mean?

The width of a Gaussian distribution is controlled algebraically by the parameter σ — we saw that above. But how does the data in a set of measurements (like our heights in the first example above) affect that width?

The example below might help. Imagine that you and a friend are throwing darts – aiming for the center. Your friend throws 12 darts in the pattern on the left. That's not so good, and the purple distribution reflects it. The distribution is wide, reflecting that many of the darts stuck relatively far from the center.

Now it's your turn. You throw darts much better and you throw the spread on the right, with most darts much closer to the center. If we were to graph the result of a great many of your throws (law of large numbers), we'd obtain the green Gaussian curve. That curve is much narrower, reflecting the fact that most shots are closer to the bullseye, with only a small number far away.

The width of a distribution is proportional to the precision of what it represents. The dart spread on the right is clustered closer together, therefore its precision is higher – and its distribution curve is narrower.

When it comes to data, the narrower the distribution, the better the quality of the data. A narrow distribution means more data closer to the mean, or more precise data. Remember, however, that precision and accuracy are different. It is possible to have a nice narrow distribution that is centered in the wrong place because of some other error.

More about the standard deviation (σ)

Sigma (σ, the standard deviation) is one of the most important characteristics that we can calculate from a set of data. It tells us a lot about the quality of a data set, in particular, its precision.

Sigma is called the standard error or the standard deviation of a data set or the distribution that underlies it. It is a specific measure of the width of a distribution which is, once again, mathematically defined using calculus. Still, it's not hard to understand what the result means.

Take a look at the figure below. If we calculate the mean of data that conforms to a Gaussian distribution, then that calculated mean plus or minus σ (x ± σ) will comprise 68% of the total area under the distribution. That means 68% of the total probability or 68% of the data, depending on which we're talking about.

Take a look at the graph below. It shows a Gaussian distribution with x ± σ, x ± 2σ and x ± 3σ marked out.

It's worth remembering those numbers: ±σ captures about 68% of the total data set or distribution, and ±2σ and ±3σ capture about 95% and 98.5%, respectively.

 

It is customary in most cases to report an average taken from data that is normally distributed (that is, if we took enough data, its spread would look like a normal distribution) with an error of ± σ.

Well, so far, σ is just an abstraction – just some lines and colors on a graph. We need to figure out how to calculate it for any set of data. Let's go . . .

How to calculate σ

The formula for calculating σ from a set of N values in a data set is:

Notice first that we're actually calculating σ2, and not just σ – we'll get to that later. The set of { xi } is our data, and the x is the mean of all N points of the data.

Now let's think about how that formula works. when we subtract x from xi, we're getting the "distance" of each point from the mean. We're squaring those differences to make them all positive, and then we're summing those up and dividing by their number N- i.e. averaging them.

That means that σ2 is the mean of the squares of the distances of each data point from the center of the distribution (which we assume to be the mean of the data).

Now σ2 is called the variance. We find the standard deviation of the data simply by taking a square root.

A sample calculation

I've made up some data (with errors distributed normally) and calculated the standard deviation using a spreadsheet below. See if you can follow the logic.

The data (50 values) was entered into the sheet in the blue column, and those data were averaged at the bottom of the column using the built-in AVERAGE() function of the program (I used NeoOffice Calc).

Then, in columns C and D, the difference between each data point and the mean, and its square, were calculated. The sum of the differences-squared was found in cell D53 using the built-in SUM() function.

In cell G3, the variance was calculated using the formula just to the right (normally these functions aren't displayed, I just showed it for your benefit). Notice that the MAX() function, which chooses the highest number from a group, allows us to put in fewer than 50 values and still do a proper calculation.

And finally we just take the square root of the variance in cell G5 to get sigma.

Now all of this could have been done just by applying the built in function STDEV() to the data in blue. The command would have been =STDEV(B3:B52), and would have yielded the same result. I wanted to go through it the long way – and you should, too – so that you could understand better what the formula does.

We would report this mean as 19.9 ± 2.5. This means that 68% (about two-thirds) of our data were within ±2.5 units of our mean, and it's a highly accepted way to report data and communicate to a reader that any mean has some associated error.

 

 

If you'd like to download this dataset and try to reproduce this calculation yourself (which I strongly encourage), just click below to get a .csv (comma-separated values) file that you can load into any spreadsheet program.


A hitch in calculating σ

There is a slight problem, usually not too big a deal, with our calculation of σ from the formula

It's a small change and to be correct, we ought to use it. To get a feel for it, let's do a simple example.

Let's calculate the average and standard deviation (σ) of three test scores: 85%, 80% and 90%. The average is

(85 + 80 + 90) / 3 = 85

Now the variance (σ2) is

[(85-85)2 + (80-85)2 + (90-85)2]/3 = 16.7

and the standard deviation is the square root of the variance

σ = 4.1,

and that's fine – and it captures 68% of the underlying distribution as expected, if we can assume that it's normal. The trouble is that this is an awfully small sample, and we've assumed that these measurements contribute equally and independently to the calculation of σ – but they don't.

Take another look. In each of the squared terms of the σ calculation, we see the mean,

which contains contributions from all three of our measurements, 85, 80 and 90, so each of these terms depends, through the mean, on the other two. The data is not entirely independent in this calculation, but in dividing by N = 3 scores, we're treating it like it is.

Another way to think about it is that the data, at least when it comes to calculating σ, doesn't have as many degrees of freedom as we thought. There is some dependency of one measurement on the others. In this example, we've "used up" one of our degrees of freedom in the calculation of the mean, so we really only have 2 left, and in our calculation of σ, we really should divide by N-1 = 2 rather than N = 3.

It turns out that what we should really divide by when calculating σ2 is not N, per se, but N minus the number of pieces of data (also called degrees of freedom) that we've already used up in calculating other things – in this case the mean. It's kind of like needing n equations to find n unknowns.

For this reason, in simple calculations of σ, we usually modify our formula to

Notice that for very large data sets, the difference between N and N-1 is very small, and these two versions of σ converge – they get closer to each other.

 

 

Creative Commons License   optimized for firefox
xaktly.com by Dr. Jeff Cruzan is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. © 2012, Jeff Cruzan. All text and images on this website not specifically attributed to another source were created by me and I reserve all rights as to their use. Any opinions expressed on this website are entirely mine, and do not necessarily reflect the views of any of my employers. Please feel free to send any questions or comments to jeff.cruzan@verizon.net.