So far we have been using descriptive statistics to describe a sample of data, by calculating sample statistics such as \(\bar{x}\) (the sample mean) and \(s\) (the sample standard deviation).

However, research is often conducted with the aim of using these sample statistics to estimate (and compare) true values for populations. The latter are known as population parameters, and are denoted by Greek letters such as \(\mu\) (population mean) and \(\sigma\) (population standard deviation). Inferential statistics allow us to make statements about unknown population parameters based on sample statistics obtained for a random sample of the population.

There are two key types of inferential statistics, estimation and hypothesis testing, and these will both be covered on this page. First, though, you may like to test your understanding of inferential statistics by choosing the best answer to the following question:

Estimation

When sample statistics are used to estimate population parameters, either using a single value (known as a point estimate) or a range of values (known as a confidence interval), it is referred to as estimation. These two different kinds of estimation are covered in more detail in the following sections.

Point estimates

Suppose a researcher is interested in cholesterol levels in a population. If they recruit a sample randomly from the population, they can estimate the cholesterol level of the whole population using the actual cholesterol level they calculated directly from the sample. In this case:

  • \(\bar{x}\) (the sample mean cholesterol level) is a point estimate of \(\mu\) (the population mean cholesterol level)
  • \(s\) (the sample standard deviation for cholesterol level) is a point estimate of \(\sigma\) (the population standard deviation).

While point estimates are useful, most often it is preferable to estimate a population parameter using a range of values so that the likely variation between the sample and population statistics is taken into account. This is where confidence intervals come in.

Confidence intervals

A confidence interval gives a range of values as an estimate for a population parameter, along with an accompanying confidence coefficient. This is the level of certainty that the interval includes the population parameter, and is typically \(95\%\). For example, a \(95\%\) confidence interval for population mean blood glucose level of (\(4\) mmol/L, \(6\) mmol/L) indicates that we are \(95\%\) certain that the population mean blood glucose level lies between \(4\) mmol/L and \(6\) mmol/L.

For another way of interpreting confidence intervals, think back to the sampling distribution of a sample statistic and consider that you could calculate a confidence interval for each possible sample of the same size. A \(95\%\) (for example) confidence interval means that you would expect 95 out of every 100 of these intervals to contain the population mean.

While confidence intervals can be calculated for a range of statistics, a common example is a confidence interval for the population mean. This can be calculated based on the sampling distribution of the sample mean according to the following formula:

\[\textrm{confidence interval for population mean} = \textrm{sample mean} \pm \textrm{a multiple of the standard error of the mean}\]

There are two things to note when using this formula, the first being that the multiple to use depends on the confidence coefficient. For example, a \(95\%\) confidence interval requires a multiple of \(1.96\) (this relates back to the normal distribution, and the fact that \(95\%\) of the area under a normal curve lies within 1.96 standard deviations of the mean). The second thing to note is that the standard error of the mean for a sample of size \(n\) with standard deviation \(s\) is equal to \(\frac{s}{\sqrt{n}}\)

Therefore, the formula to calculate the \(95\%\) confidence interval for the population mean using a sample of size \(n\) with mean \(\bar{x}\) and standard deviation \(s\) becomes:

\[95\% \textrm{ confidence interval for population mean} = \bar{x} \pm 1.96 \times \frac{s}{\sqrt{n}}\]

Note that before calculating a confidence interval for a population mean, however, you should ensure the following assumptions are valid:

Assumption 1: The sample is a random sample that is representative of the population.

Assumption 2: The observations are independent, meaning that measurements for one subject have no bearing on any other subject’s measurements.

Assumption 3: The variable is normally distributed, or the sample size is large enough that the sampling distribution of the mean approximates a normal distribution.

Some important points to note about confidence intervals are as follows:

  • The confidence interval for the population mean is symmetric about the sample mean.
  • The length of a confidence interval increases for higher levels of confidence.
  • The length of a confidence interval is shorter for larger samples than for smaller samples.
  • Wider confidence intervals are obtained from variables with larger standard deviations, since more variation in the variable implies less accuracy in estimation.

Finally, it is important to remember that a population parameter is fixed and that it is the sample statistic and confidence interval that change from sample to sample. Once the interval is calculated then the unknown population value is either inside or outside of the interval, and we can only state the certainty with which we believe the interval to contain the population value.

If you would like to practise calculating and interpreting confidence intervals, have a go at one or both of the following activities.

Activity 1

Activity 2

Hypothesis testing

Hypothesis testing involves formulating hypotheses about the population in general based on information observed in a sample. These hypotheses can then be tested to find out whether differences or relationships observed in the sample are statistically significant in terms of the population, or whether they have just occurred due to random chance in the sample.

In order to do this two complementary, contradictory hypotheses need to be formulated, called the null hypothesis and the alternative hypothesis (or research hypothesis). These hypotheses will be formulated differently depending on whether you are conducting a two-tailed hypothesis test (which tests for an effect in either direction) or a one-tailed hypothesis test (which only tests for an effect in one direction). Choosing which test to perform will depend on your methodology, however typically two-tailed tests are used unless there is a reason not to (e.g. if the effect is only possible in one direction). For this reason, we will focus on two-tailed hypothesis tests in this module. These have the following hypotheses:

Null hypothesis (\(\textrm{H}_\textrm{0}\)): This hypothesis states that there is no difference or relationship between variables in a population. For example, no significant difference between two population means, no significant association between two categorical variables, no significant correlation between two continuous variables or no significant difference from the normal distribution (as for Shapiro-Wilk’s test).

Alternative hypothesis (\(\textrm{H}_\textrm{A}\)): Also known as the research hypothesis, this hypothesis states the opposite of the null hypothesis (i.e. it states that there is a difference or relationship between variables in a population). For example, that there is a significant difference between two population means, a significant association between two categorical variables, a significant correlation between two continuous variables or a significant difference from the normal distribution (as for Shapiro-Wilk’s test).

Both hypotheses can be written using either words or symbols, often in a few different ways. For example, if we want to test whether there is a significant change in the mean blood pressure of a population of patients after they have take a new medication, some of the different ways we could write null and alternative hypotheses are:

\(\textrm{H}_\textrm{0}\): there is no significant difference in blood pressure before and after the medication
\(\textrm{H}_\textrm{0}: \mu_{\textrm{bp before}} = \mu_{\textrm{bp after}}\)
\(\textrm{H}_\textrm{0}: \mu_{\textrm{bp before}} - \mu_{\textrm{bp after}} = 0\)

\(\textrm{H}_\textrm{A}\): there is a significant difference in blood pressure before and after the medication
\(\textrm{H}_\textrm{A}: \mu_{\textrm{bp before}} \neq \mu_{\textrm{bp after}}\)
\(\textrm{H}_\textrm{A}: \mu_{\textrm{bp before}} - \mu_{\textrm{bp after}} \neq 0\)

If you would like to practise writing hypotheses, have a go at formulating null and alternative hypotheses for the following activity.

Activity

\(\textrm{H}_\textrm{0}\): There is no significant difference in heart rate before and after the fun run (\(\mu_{\textrm{hr before}} = \mu_{\textrm{hr after}}\), or \(\mu_{\textrm{hr before}} - \mu_{\textrm{hr after}} = 0\))

\(\textrm{H}_\textrm{A}\): There is a significant difference in heart rate before and after the fun run (\(\mu_{\textrm{hr before}} \neq \mu_{\textrm{hr after}}\), or \(\mu_{\textrm{hr before}} - \mu_{\textrm{hr after}} \neq 0\))

\(\textrm{H}_\textrm{0}\): There is no significant difference in mean grades for male and female students (\(\mu_{\textrm{male}} = \mu_{\textrm{female}}\), or \(\mu_{\textrm{male}} - \mu_{\textrm{female}} = 0\))

\(\textrm{H}_\textrm{A}\): There is a significant difference in mean grades for male and female students (\(\mu_{\textrm{male}} \neq \mu_{\textrm{female}}\), or \(\mu_{\textrm{male}} - \mu_{\textrm{female}} \neq 0\))

\(\textrm{H}_\textrm{0}\): There is no significant correlation between hours of study and exam marks (\(r = 0\))

\(\textrm{H}_\textrm{A}\): There is a significant correlation between hours of study and exam marks (\(r \neq 0\))

Once the hypotheses have been formulated they can be tested to evaluate statistical significance, as explained in the following section. In addition, note that it is important to keep practical significance in mind at this point as well, and that this is explained in the subsequent section.

Statistical significance

An appropriate test needs to be conducted in order to evaluate statistical significance, with some common examples being one sample, paired samples and independent samples \(t\) tests, one-way ANOVA, the chi-square test of independence and Pearson’s correlation (all of which are covered in later pages of this module). Typically you will conduct such a test using statistical software or a programming language (e.g. SPSS, Stata, SAS, R or Python), although you can do it manually if wished (not covered here).

Either way, a test statistic will be calculated which compares the value of the sample statistic (for example, the sample mean change in blood pressure in our blood pressure example) with the value specified by the null hypothesis for the population statistic (for example, a mean change in blood pressure of zero). The name of this test statistic, and how it is calculated, will vary depending on the test you are doing (for example a \(t\) test will calculate a \(t\) value, a one-way ANOVA will calculate an \(F\) value and a chi-square test of independence will calculate a chi-square value), but in each case a large test statistic indicates that there is a large discrepancy between the hypothesised value and the sample statistic.

Furthermore, a \(p\) value (or probability value) will also be calculated for the test (or two \(p\) values may be calculated, a one-sided \(p\) value and a two-sided \(p\) value; in this case the two-sided \(p\) value should be used unless you have previously determined that you will conduct a one-sided hypothesis test). This gives the probability of obtaining the test statistic in question if the null hypothesis is true, and it is this value that is interpreted when deciding whether or not to reject the null hypothesis (as opposed to the test statistic itself). In particular, a small \(p\) value indicates that there is a low probability of obtaining the result if the null hypothesis is true.

How low is too low though? In order to decide when to reject the null hypothesis we need to choose a level of significance which tells us exactly how small our \(p\) value can be before we reject the null hypothesis. This is denoted by \(\alpha\) and is typically \(.05\) (\(5\%\)), but other values can also be used. Note that if:

  • \(p\) value \(\leqslant \alpha\) : There is less than or equal to \(\alpha\%\) chance that the discrepancy between our sample statistic and our hypothesised population statistic could have occurred due to random chance in the sample if the null hypothesis is true. So we reject the null hypothesis in favour of the alternative hypothesis, meaning that the difference or relationship we have hypothesised about is statistically significant.

  • \(p\) value \( > \alpha\): There is greater than \(\alpha\%\) chance that the discrepancy between our sample statistic and our hypothesised population statistic could have occurred due to random chance in the sample if the null hypothesis is true. So we cannot reject the null hypothesis, meaning that the difference or relationship we have hypothesised about is not statistically significant.

If you would like to practise interpreting \(p\) values, have a go at the following activity.

Activity

At this point, it is important to note that confidence intervals can also be used to decide whether a difference or relationship is statistically significant or not. For example, based on data collected in the sample for our blood pressure example, a \(95\%\) confidence interval can be calculated giving the range of values we expect the difference in mean blood pressure to lie between for the population. If this confidence interval does not contain the value \(0\) it means we are \(95\%\) confident that the difference between the two values is not zero, which indicates that the difference is statistically significant. Confidence intervals are good because not only do they tell us about statistical significance, they also tell us about the magnitude and direction of any difference (or relationship).

If you would like to test your understanding of this concept, have a go at this activity.

Activity

Error types

Because hypothesis testing involves drawing conclusions about complete populations from incomplete information it is always possible that an error might occur when deciding whether or not to reject a null hypothesis, regardless of how thorough we are with our calculations. In particular there are two types of possible errors, which are as follows:

Type I error: This occurs when we reject a null hypothesis that is actually correct. The probability of this occurring is equal to our level of significance \(\alpha\) hence why we generally select a very low value for it (e.g. \(0.05\)).

Type II error: This occurs when we do not reject a null hypothesis that is actually incorrect. The probability of this type of error is denoted by \(\beta\), and it is usually desirable for this to be \(0.2\) or below.

To minimise the risk of a Type II error a power analysis is often used to determine an appropriate sample size, as the power of a particular statistical test is the probability that the test will find an effect if one actually exists. Since this is the opposite of the Type II error rate it can be expressed as \(1-\beta\), and hence to keep the Type II error \(\leq 0.2\) the power needs to be \(\geq 0.8\)

The power of a test depends on three factors:

  1. the effect size (how big the effect is; more on this shortly)
  2. how strict we are about deciding if the effect is significant (i.e. our \(\alpha\) level)
  3. the sample size.

You can use this information to calculate the power of a test using software, for example using SPSS software (Version 27 or above). Alternatively, and ideally, you can use this software to determine an appropriate sample size to achieve a power \(\geq 0.8\)

If you would like to test your understanding of the different error types, have a go at the following activity.

Activity

Practical significance

Statistical significance is influenced by sample size, meaning that in a very large sample very small differences may be statistically significant, and in a very small sample very large differences may not be statistically significant. For this reason it is a often a good idea to measure practical significance as well, which is determined by calculating an effect size. The effect size provides information about whether the difference or relationship is meaningful in a practical sense (i.e. in real life), and it is calculated differently for different tests. Details on how to calculate effect size are covered for each of the tests outlined in subsequent pages of this module.

Parametric and non-parametric tests

Different inferential statistical tests are used depending on the nature of the hypothesis to be tested, and the following pages detail some of the most common ones. First, though, it is important to understand that there are two different types of tests:

Parametric tests: These require at least one continuous variable, which must be normally distributed (or the sample size must be large enough that the sampling distribution of the mean approximates a normal distribution).

Non-parametric tests: These don’t require any continuous variables to be normally distributed, and in fact don’t require any continuous variables at all.

As a general rule, if it is possible to use a parametric test then these are considered preferable, as parametric tests use the mean and standard deviations in their calculations whereas non-parametric tests use the ordinal position of data. So just like the mean is typically the go-to measure of central tendency over the median, so too are parametric tests over non-parametric tests.

This following pages detail five of the most commonly used parametric tests (with reference to the non-parametric versions), and one commonly used non-parametric test.