Download PDF

ISSN 2203-5249

RESEARCH PAPER SERIES, 2020-21 01 JUNE 2021

Commonly used statistical terms: a quick guide Rory Haupt Statistics and Mapping

Statistics refers to the collection and analysis of data in a way that helps us make decisions.

Without statistics it would be difficult to develop policies in government or business and it would be near impossible to assess if a program was working (such as improved life expectancy in rural communities). And as individuals we would not know basic information about our community, such as the population, number of births or deaths, median house prices and even such things as school test scores. Statistics plays an important role in solving problems, understanding the world we live in and how to make improvements. Even without knowing it, we all use statistics daily.

‘Why do statistics matter? In simple terms, they are the evidence on which policies are built. They help to identify needs, set goals and monitor progress. Without good statistics, the development process is blind: policy-makers cannot learn from their mistakes, and the public cannot hold them accountable.’ (World Bank, 2000, vii)

This quick guide provides an introduction to statistical concepts and measures commonly used in political governance. It is designed as a point of reference for readers, providing a brief explanation of each statistical term and practical examples of how they are used.

Definitions Administrativ e Data Administrative data, or administrative by-product, is a type of data that is

produced in the everyday workings of organisations. Examples include counts of births, deaths, marriages and divorces; hospital admissions; car sales; median house prices or information relating to case management, such as Centrelink statistics (the number of age pension or Jobseeker recipients).

Base Year The base year is the starting point in any time series.

If we had an index where 2015 was the base year, the value for 2015 would be 100, allowing simple comparability to the base year in the form of a percentage. The Australian Bureau of Statistics (ABS), Consumer Price Index has a base year which changes periodically, with the last change occurring in September 2012 from 1989-90 to 2011-12.

Commonly used statistical terms: a quick guide 2

Break in Series

A break in series occurs when a change is made to a data collection and the new data is no longer comparable to the previous data. However, a break may not necessarily jeopardise the reliability of a time series.

For example, the ABS Overseas Arrivals and Departures dataset has had a number of breaks in the series resulting in data not comparable across the years. In 2017 a review was undertaken leading to changes in methodology, where data was sourced and the way it was processed. These changes created a break in the series. The ABS re-released a 10 year time series based on the new methodology.

Census (population) Whilst a sample survey only counts a proportion of the population, the purpose of a population census is to count everyone within the assessed population.

Population censuses are conducted periodically by many countries to assess the demographic makeup of the population residing in that country. Australia’s official Census (of Population and Housing) is run every five years and provides a snapshot of the entire population.

S Surbhi, ‘Difference between Census and sampling’, Key Differences, updated 19 August 2017, accessed 29 May 2021.

Confidence Intervals Confidence intervals are a statistical measure of uncertainty, expressed as a range of likely or possible outcomes. This uncertainty derives from the fact that

statistics about large populations, for practical reasons, are usually drawn from samples of that population. A confidence interval provides a range in which the ‘true’ population mean is expected to sit, based on the sample.

The mean age of a sample population, for example, is only an estimate of the average age of that population. The average age of a sample of people might be 35, and based on that sample statistic and the variances of ages within the sample, a confidence interval over the mean might estimate that there is a 95 per cent probability that the true average age of the population is between 30 and 40.

Confidentialit y

Confidentiality refers to protecting the privacy of information collected from individuals and organisations. This means that when information is made available, it needs to be done in a way that is unlikely to allow individuals or organisations to be identified. Maintaining confidentiality is both a legal and

Commonly used statistical terms: a quick guide 3

ethical obligation, and a failure to maintain confidentiality is called a confidentiality breach, or disclosure.

The ABS apply confidentiality to their Causes of Death dataset to protect individuals. As a result, some totals will not equal the sum of their components. Where figures have been rounded, discrepancies may occur between totals and sums of the component items.

Constant and Current Prices Constant and current prices are also known as real and nominal prices. Constant prices are prices that have been adjusted for inflation, and as such

reflect the value of the price in present day terms. Current prices make no adjustment for inflation, reflecting the value of the price at the time it was measured.

For example, given a price valued at $130 in 2019 and an inflation rate of 3%, the constant price would be 130*1.03, or $133.90, in 2020. In this sense, $133.90 in 2020 is equal to $130 in 2019. However, the current price would be the given price in 2019, which is $130.

Using constant prices allows a comparison between two points in time, for example a basket of food in 1980 and 2020. Another example is measuring the change in real wages over a specified period of time.

Constant prices can be used to assess changes in values over time in real terms, which can be tied to funding allocations.

Correlation and Causation Correlation is a measure of the relationship between two variables, describing how one variable moves with another. Correlation ranges from -1, where two

variables are perfectly correlated with a negative relationship, to 1, where two variables are perfectly correlated with a positive relationship. Correlation does not mean that change in one variable causes the other to change.

When one variable causes another to change, this is called causation. There is no statistical measure that establishes causation, instead it is established using experiments and inference. It is important not to mistake correlation for causation.

Cross-Sectional Data Cross-sectional data are the result of a data collection, carried out at a single point in time. Cross-sectional data is different to time-series data which

observes changes over time.

Distributions A statistical distribution is a graph of the possible values of a random variable with the associated rate at which they are likely to occur. The most commonly used distribution is the ‘normal’ (or ‘bell curve’) distribution, which features higher occurrence closer to the mean (or average).

We can view the distributions of data samples to obtain important information about that data. For example, assessing the distribution of household income in Australia would highlight the number of Australians living on lower incomes, or the size of the middle class.

Equivalised The ‘equivalisation’ process is typically used in relation to measuring household incomes, to enable comparisons of the economic well-being of different households.

Commonly used statistical terms: a quick guide 4

If you simply looked at the household income of different households, you would assume that the household with the highest total incomes would be the most well off. However, this doesn’t take into account the fact that larger households require higher levels of income to maintain the same standard of living as smaller households. The equivalisation process adjusts disposable income by an ‘equivalence scale’ equal to 0.5 for a second adult and 0.3 for a child less than 15 years old. These scales take into account the economies of scale associated with sharing dwellings, and the fact that children require less resources than adults. A couple household with disposable income of $1,500 per week, for example, would have equivalised disposable income of $1,000 ($1,500 divided by 1.5).

Estimates Statistical inference can be used to apply estimates to populations. For example, if we have a sample and hold the assumption that it is representative of the population, we can use that sample to estimate characteristics of the population.

Index Indexes, or indices, are used to compare numbers as they develop over time. It is usual to fix the first observation to a base value of 100, then having all the following observations linked to this base to compare any relative changes over time.

For example, if we had an index measuring the price of cars, the initial year of the bundle of cars used in our measure would be set to 100. If the price of cars rises 10% over the year, the next year’s index would be 110.

Frequently used indexes include the Consumer Price Index (CPI) and Wage Price Index (WPI).

Longitudinal Data Longitudinal data, also known as panel data, is data collected over time which tracks the same sample of participants at different points in time. Longitudinal

data can be used to assess how different factors may influence opinions over time, amongst other things.

Commonly used statistical terms: a quick guide 5

One such example is the Household, Income and Labour Dynamics in Australia (HILDA) Survey, which commenced in 2001. HILDA is a household-based panel study that collects information about economic and personal well-being, labour market dynamics and family life. The HILDA Survey allows researchers to track the employment and income outcomes of participants and whether they progress onto better outcomes.

Mean The mean, also referred to as the average, is found by adding all data points and dividing by the number of data points.

For example: (10 + 10 + 20 + 40 + 70) / 5 = 30

Median The median is the middle number; found by ordering all data points and picking out the one in the middle (or if there are two middle numbers, taking the mean of those two numbers).

For example: 1, 2, 3, 4, 5, 6, 14

The median is useful for measuring the midpoint of, for example, income distributions. This is because the average of such measures may be influenced by outliers. In the example above, 14 can be identified as an outlier, as it is not consistent with the rest of the data.

Metadata Metadata is data that provides information about other data. For example, data relating to a document on the internet is metadata, such as the type of file it may be, the amount of people that downloaded it, and when it was accessed.

Mode The mode of a set of data values is the value that appears most often.

For example: 2, 5, 6, 2, 2, 2, 5, 6, 2 = 2

The mode can be used in initial assessments of data to obtain information. If we were assessing the ABS’s household income data, the mode allows us to identify which income group is most common in Australia. In this case, the most common group is $3,000-3,499 in 2017-18.

Moving averages

A moving average uses a set number of data points to create the mean of the data points, moving over time as new data is added and older data removed. There are various types of moving averages, including simple, weighted, and exponential. The most frequently used moving averages are four quarter and 12 month moving averages, which are used to smooth volatile data, such as regional labour force survey estimates.

For example, a moving average which considers 5 data points has the initial set of {3, 3, 4, 6, 8}. As such, the initial moving average is (3 + 3 + 4 + 6 + 8) / 5 = 4.8. However, suppose we get a new data point 10. Given our moving average only considers 5 data points, we add 10 to the set and consider the last 5 data points. As such, our new set is {3, 4, 6, 8, 10}, and our new moving average is (3 + 4 + 6 + 8 + 10) / 5 = 6.2

Original, seasonally adjusted and trend estimates

Original, seasonally adjusted and trend estimates are different forms of time series estimates.

Original estimates best capture actual movements in the data.

Commonly used statistical terms: a quick guide 6

Seasonally adjusted estimates take the originals and remove seasonal trends (including holiday periods such as Easter and Christmas) to create more consistent data which is less affected by irregular trends.

Trend estimates further smooths seasonally adjusted estimates to create a view of the data that reflects a long-term trend. Trend data is best used to create a view of how the future may play out but fails to perceive monthly movements in the data. The COVID-19 pandemic has resulted in the suspension of the trend series in many ABS data collections, due to the importance of month-to-month changes.

Outlier An outlier is a data value that is very different from most of the other values in a data set. Due to this difference the outlier may have significant impact on statistics drawn from the dataset. Outliers in datasets often require further examination to tell whether they are meaningful or not.

Commonly used statistical terms: a quick guide 7

Percentage A percentage (%) is the term used to express a number as a fraction of one hundred, it compares one value in relation to another.

If we held an election, and party A received 53 out of 90 votes, with party B receiving the other 37 votes, then party A would have received 59% of the vote. We can calculate party A’s vote share by taking their votes divided by the total votes, multiplied by 100.

In this case, party A’s vote share is 53/90 * 100 = 59%.

Projections A projection uses trends and other inputs to project how a set of data may change in the future.

For example, if a trend shows an increase in the purchase of a product by 50%, a projection can show how it would impact on our economy should the trend persist. The Reserve Bank of Australia and The Treasury produce a number of projections assessing the Australian economy.

Quantitative research Quantitative research is the process of collecting and analysing numerical data. Quantitative data collection methods are much more structured than

qualitative data collection methods. Quantitative data collection methods include various forms of surveys—online surveys, paper surveys, face-to-face interviews, telephone interviews, longitudinal studies, and online polls.

Qualitative research Qualitative research aims to gather an in-depth understanding of human behaviour via first-hand observation, face-to-face interviews, questionnaires,

focus groups, participant-observation etc. The data are generally nonnumerical.

Some other examples of qualitative data include

â¢ The reasons why people like eating at restaurants.

â¢ The problems people face when moving house.

Range The range represents the actual spread of data. It is the difference between the highest and lowest observed values.

For example, the lowest value of the following data set [2,3,7,9,1,4] is 1 and its highest value is 9, so its range is 9â’1=8. As with calculation of the median, it is helpful to order data observations to find the highest and lowest values.

Subject coach, Definition of Range (Statistics), 5 February 2018, accessed 30 April 2021

Commonly used statistical terms: a quick guide 8

Rate The rate simply refers to the frequency of the occurrence of an event.

For example, if an event is calculated to occur once every 100 opportunities, the rate would be 1 in 100. Rates are often used to represent statistics regarding mortality. If we were to assess deaths due to coronary heart disease in Australia, these can be expressed as a rate of deaths per 100,000 population. In Australia, men die from coronary heart disease at a rate of 119 deaths per 100,000 population and women at a rate of 33 deaths per 100,000 population.

Ratio A ratio is used to compare two quantities, referencing one against the other.

For example, if a poll of 30 people is taken and 20 vote for party A and 10 vote for party B, then party A has more votes than party B by a ratio of 2:1. Similarly, if the poll included 70 people, with 40 voting for party A and 30 for party B, party A now leads party B by a ratio of 4:3.

Sample Size When sample surveys are collected, the sample size indicates the number of participants within the survey. As the sample is a subsection of the population, the greater the sample size the more representative it will be of the population, assuming that bias is removed through random sampling.

Sample Survey A sample is part of subset population, often randomly selected for the purpose of studying the characteristics of the entire population. When it comes to data collection, a sample survey is one of the most prominently used methods, used to collect data relating to a population of interest.

See: Census

Significance Statistical significance is used to quantify whether a result is due to the relationship assessed within a study or due to random chance. When a result boasts statistical significance, it means that the relationship observed is likely not due to chance. When undertaking a study, statistical significance provides important validation for any results that may have been obtained. When tests are unable to obtain this validation, they are statistically non-significant.

Standard Deviation

Standard deviation is the measure of spread most commonly used in statistical practice when the mean is the measure of centre.

Thus it measures spread about the mean. Because of its close links with the mean, standard deviation can be seriously affected if the mean is a poor measure of location. The standard deviation is also influenced by outliers; it is a good indicator of the presence of outliers because it is so sensitive to them. Therefore, the standard deviation is most useful for symmetric distributions with no outliers (normal distributions).

In an asymmetrical distribution the two sides will not be mirror images of each other.

Commonly used statistical terms: a quick guide 9

The key features of a normal distribution as seen in the example above:

â¢ symmetrical shape â¢ mode, median and mean are the same and are together in the centre of the curve â¢ there can only be one mode (i.e. there is only one value which is most

frequently observed) â¢ most of the data are clustered around the centre, while the more extreme values on either side of the centre become less rare as the

distance from the centre increases (About 68% of values lie within one standard deviation (Ï) away from the mean; about 95% of the values lie within two standard deviations; and about 99.7% are within three standard deviations. This is known as the empirical rule or the 3-sigma rule.) Standard Error and Relative Standard Error

Standard Error (SE) measures the variability of a given sample. It is found by taking the standard deviation divided by the square root of the sample size. As the size of the sample increases, the standard error will decrease. An increase in sample size leads to a more accurate estimate, and this is reflected in the measurement of the standard error. The standard error can be used to obtain confidence in data, with a 95% chance that the true value of a measure lies within two standard errors of a survey estimate.

For example, a sample with the standard deviation 1.5 and sample of 15 will have a standard error of 0.39. However, if the standard deviation were 1.5 and the sample were 60, the standard error would be 0.19.

The Relative Standard Error (RSE) provides an expression of the standard error in a simpler form, using percentages. As the ABS notes, a RSE of 25% or greater is considered as unreliable. The relative standard error is taken by dividing the standard error of a measurement by the measurement itself, and then expressing the number as a percentage by multiplying it by 100. Where the standard error is expressed as a number, the relative standard error provides greater ability to assess the reliability of a measure.

For example, if the standard error of a measure was 0.02, and the measurement itself was 10, the relative error would be 0.02 / 10 = 0.002. To

Commonly used statistical terms: a quick guide 10

represent the RSE, this is multiplied by 100 to obtain a percentage. In this case, the RSE would be 0.002 * 100 = 0.2%.

Time Series A time series is a set of data which can show changes in a variable over time.

For example, the unemployment rate can be viewed as a time series, highlighting fluctuations over time. The ABS present time series data in three formats, original, seasonally-adjusted and trend estimates.

Variable A variable is a data point which provides a measurement. For example, household income data provides measures of a number of variables, such as gross household income per week, equivalised disposable household income per week, or labour force status. Each of these variables provides a different measure.

Variance The variance of a set of data is a measure of how spread out the data is from the mean value. When measuring variance, a higher number indicates the data is more spread.

Commonly used statistical terms: a quick guide 11

© Commonwealth of Australia

Creative Commons

With the exception of the Commonwealth Coat of Arms, and to the extent that copyright subsists in a third party, this publication, its logo and front page design are licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Australia licence.