Getting More Out of The Normal Distribution

https://commons.wikimedia.org/wiki/File:Planche_de_Galton.jpg

ISE Magazine Volume : 49 Number: 10
By Merwan Mehta

So what is normal?

In essence, the normal distribution is a histogram of data readings, which becomes a smooth curve as the bin sizes for the histogram get smaller and smaller, provided the number of readings are not less than 30.

The percent of data for all normal distributions that lies between various standard deviations away from the mean remains constant, and this relationship is referred to as the empirical property of the normal distribution. For example, 99.73 percent of any normal distribution will lie between mean minus three times the SD and mean plus three times the SD. Similarly, 95.44 percent of the normal distribution will lie between mean minus two times the SD and mean plus two times the SD, and 68.26 percent of the normal distribution will lie between mean minus one SD and mean plus one SD.

Most data, like intelligence, heights and weights of individuals, follow a normal distribution. Many other things in nature also vary based on the normal distribution. As a general rule, data will be normally distributed unless an external factor acts on the process that creates the data. An example of a non-normal data set can be weight data collected from a sample where some people are on a diet and others are not.

There’s a probability for that

Hence, it is essential to verify that the sample or population data that we have is normally distributed before using the properties of the normal distribution. In most cases, plotting a histogram of the data and superimposing it with a bell curve can provide an adequate means to make a judgment about whether the data can be claimed to be normally distributed or not.

Creating a probability plot, something available in most statistical software, or using the chi-square distribution to verify the goodness of fit between the expected number of data points and the actual data points for a distribution are some other ways of checking for normality of data.

Going back to our discussion of comparing the scores for father and son, let us assume that we have checked that the data for both examination results are normally distributed. We can then find in what lower (or upper) percentile the father and son were in their respective distributions. Naturally, whoever is in the least lower percentile or the highest upper percentile did better.

In the long, long ago before computers, we had to convert all normal distributions into the standard normal distribution to make comparisons between two sets of data. With Excel, we have a function called NORM.DIST that we can use for finding the lower percentile for a normal distribution. Given the mean, SD and the reading, the function provides us the lower percentile in which the reading occurs within the entire distribution. For the father, we can use the formula =NORM. DIST(786,750,45,TRUE) to find the lower percentile comes out to be 78.8 percent. Similarly, for the son the result comes out to be 70.9 percent. To get the upper percentile, we subtract the lower percentile from 100 percent. Hence, we can deduce that the father was in the top 21.2 percent of his class, and the son was in the top 29.1 percent of his class.In this case at least, father knows best, or is at least the better test taker.

Packing on the predictions

Another use of the normal distribution can be for predicting an output from a normally distributed process. The output can be weight, dimension or time as a process metric, something that would signify an acceptable or unacceptable level. To illustrate this, imagine that we have a packing process that needs to fill boxes with an average weight of 1,200 grams and a tolerance spread of plus or minus 30 grams. Hence, the weight of the product filled should be between 1,170 grams, (1,200 – 30) and 1,230 grams (1,200 + 30). The two limits within which we would like the process data to fall are called the specification limits. Hence, for our process 1,170 is the lower specification limit (LSL) and 1,230 is the upper specification limit (USL).

Now say on a specific day we take 30 readings and calculate the mean to be 1,200 grams and the SD to be 15 grams. With these initial results, say we want to know what percent of the product will fall outside the specification limits and whether we should continue with the process. From the empirical rule, we know that 99.73 percent of the data will fall within mean plus three times the SD and mean minus three times the SD. These two limits for the process are called control limits.

Hence, for our process, the lower control limit (LCL) = mean – 3 x SD = 1200 – 45 = 1,155, and the upper control limit (UCL) = mean + 3 x SD = 1,200 + 45 = 1,245.

To determine what percentage of the product will be outside the specification limits, we use the normal distribution to find out what percentage of the bell curve will be below 1,170 and above 1,230. A graphical representation of the scenario for finding the percentage on the left side is =NORM.DIST(1170,1200,15,TRUE), which gives us a value of 0.02275 or 2.3 percent. The formula for the right side is =(1 – NORM.DIST(1230,1200,15,TRUE)), which gives us a value of 0.02275 or 2.3 percent. Hence, the total percentage of boxes that will be outside the specification limits will be 4.6 percent.

Note that the mean may not at times be centered between the specification limits, and we simply will need to use the new mean number for the left and right to get the percentage of the distribution that falls outside the specifications limits.

Finally, we can use the normal distribution to find out new specification limits based on the maximum reject rate that we can withstand. To illustrate this, say for the above process where we are getting a reject rate of 4.6 percent we can only tolerate a reject rate of 2 percent. We would now like to find out what the specification limits should be if we want only 2 percent of the distribution to be outside the specification limits, which translates into 1 percent outside on the lower side of the distribution and 1 percent on the upper side.

To do this, we use the inverse of the NORM.DIST function, which is called the NORM.INV function. Given the percentage of the distribution from the left side (1 percent in our case), the mean and the standard deviation, NORM.INV gives us the value of the reading for the lower specification. Substituting the numbers, we get LSL = NORM. INV(1%,1200,15) = 1,165.1.

For the USL, we need to specify 100 percent minus the percentage of the distribution from the right side, which comes out to be 99 percent for our scenario. Hence, we get the USL= NORM.INV(99%,1200,15) = 1,234.9. Hence, we can say what for the above process, only 2 percent of the product will be outside the specification limits if the LSL is 1,165.1 and USL is 1,234.9.

The concepts discussed above also can be used to predict unacceptable output percentages if the capability index for a process is known. Capability index is the ratio of the provided specification limits for a process (USL – LSL) divided by the control limits for a process, which using the normal distribution, we assume to be six times the standard deviation (or UCL – LCL).