Many of the statistics detailed in the Inferential statistics page rely on the assumption that continuous data approximates a normal distribution. Hence knowing what the normal distribution is and how to test for it is very important, and is covered in this page first.
The normal distribution is a special kind of distribution that large amounts of naturally occurring continuous data (and hence also smaller samples of such data) often approximates. As a result, properties of the normal distribution are the underlying basis of calculations for many inferential statistical tests (called parametric tests). These key properties are as follows:
Hence the histogram for a normally distributed variable has a bell shape as shown below (note the percentages displayed here are given to two decimal places, while the percentages above are rounded values; also note that the \(\mu\) and \(\sigma\) symbols represent population mean and population standard deviation respectively):
If you would like to practise interpreting a normally distributed data set, have a go at the following activity:
Since many inferential statistical tests rely on the assumption that a sample of continuous data approximates a normal distribution, it is important to be able to test for this. Unfortunately though there is not a single yes-or-no test for normality, and rather it requires assessing up to eight different factors in order to determine if the data approximates the normal distribution 'closely enough' (the data will never be perfectly normally distributed, and often a fair bit of deviation is acceptable). While some of these tests are more commonly used than others, it is a good idea to evaluate as many as possible, particularly when you are first getting started, as the more information you have means the more complete of a picture you will have of your data, and the more well-informed your conclusion will be.
The eight graphs and figures you can interpret, in no particular order, are as follows:
If the data is normally distributed the histogram should be symmetric and centred around the mean (Figure 1). Alternatively if there is a long tail to the left only we say it is skewed to the left (negatively skewed) (Figure 2); or if there is a long tail to the right only we say it is skewed to the right (positively skewed) (Figure 3):
If the data is normally distributed the median should be positioned approximately in the centre of the box, both whiskers should have similar length and ideally there should be no outliers (Figure 4). However the variable may be negatively skewed (Figure 5) or positively skewed (Figure 6):
3. Normal Q-Q plot
If the data is normally distributed the points on a normal Q-Q plot will fall on the straight diagonal line (Figure 7). Otherwise, the points will not lie on the straight diagonal line (Figures 8 and 9):
Another version of this plot, the detrended Q-Q plot, is sometimes also analysed; in the detrended plot there should be roughly equal number of points above and below the line, with no obvious trend.
4. Stem and leaf plot
A stem and leaf plot displays the frequency of each value in the data set, organised into 'stems' and 'leaves'. For example, the first plot below shows that there is one value of \(63\), two values of \(65\), six values of either \(66\) or \(67\), etc. While this plot is less frequently analysed, if you do choose to use it note that it can be interpreted in the same way as a histogram, only rotated on its side.
If the data is normally distributed the skewness should be close to \(0\), but at least in the range of \(-1\) to \(1\) (negative values indicate negative skew; positive values indicate positive skew). Additionally, the z-score for the skewness (which can be calculated by dividing the skewness by its standard error) should be within the range of \(-1.96\) to \(1.96\).
If the data is normally distributed the kurtosis should be close to \(0\), but at least in the range of \(-1\) to \(1\) (positive kurtosis indicates a high peak around the mean and fatter tails; negative kurtosis indicates a lower peak around the mean and thinner tails). Additionally, the z-score for the kurtosis (which can be calculated by dividing the kurtosis by its standard error) should be within the range of \(-1.96\) to \(1.96\).
7. Mean, median and mode
If the data is normally distributed the mean, median and mode should all be similar.
8. Normality test (i.e. Shapiro-Wilk)
This test is generally only used for sample sizes less than \(100\) as it can be too sensitive for larger samples. If you do use it note that it tests the null hypothesis that the distribution approximates a normal distribution, so a significance (\(p\)) value greater than \(0.05\) is typically required (more on hypothesis testing in the Inferential statistics page).
If you would like to practise assessing whether or not data approximates a normal distribution, have a go at the following activity:
If tests for normality indicate that the variable is not normally distributed, you can try transforming the variable so that it conforms more to the normal distribution.
To transform a skewed continuous variable, you can apply:
Once the data has been transformed, it should be tested again for normality. If the transformation has ‘worked’, any further inferential analysis should be conducted on the transformed data. If it hasn’t, you will need to use non-parametric tests instead.