Biostatistics Series Module 1: Basics of Biostatistics

Address for correspondence: Dr. Avijit Hazra, Department of Pharmacology, Institute of Postgraduate Medical Education and Research, 244B, Acharya J. C. Bose Road, Kolkata - 700 020, West Bengal, India. E-mail: ni.oc.oohay@snafwolb

Received 2015 Dec; Accepted 2015 Dec. Copyright : © Indian Journal of Dermatology

This is an open access article distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License, which allows others to remix, tweak, and build upon the work non-commercially, as long as the author is credited and the new creations are licensed under the identical terms.

Abstract

Although application of statistical methods to biomedical research began only some 150 years ago, statistics is now an integral part of medical research. A knowledge of statistics is also becoming mandatory to understand most medical literature. Data constitute the raw material for statistical work. They are records of measurement or observations or simply counts. A variable refers to a particular character on which a set of data are recorded. Data are thus the values of a variable. It is important to understand the different types of data and their mutual interconversion. Biostatistics begins with descriptive statistics that implies summarizing a collection of data from a sample or population. Categorical data are described in terms of percentages or proportions. With numerical data, individual observations within a sample or population tend to cluster about a central location, with more extreme observations being less frequent. The extent to which observations cluster is summarized by measures of central tendency while the spread can be described by measures of dispersion. The confidence interval (CI) is an increasingly important measure of precision. When we observe samples, there is no way of assessing true population parameters. We can, however, obtain a standard error and use it to define a range in which the true population value is likely to lie with a certain acceptable level of uncertainty. This range is the CI while its two terminal values are the confidence limits. Conventionally, the 95% CI is used. Patterns in data sets or data distributions are important, albeit not so obvious, component of descriptive statistics. The most common distribution is the normal distribution which is depicted as the well-known symmetrical bell-shaped Gaussian curve. Familiarity with other distributions such as the binomial and Poisson distributions is also helpful. Various graphs and plots have been devised to summarize data and trends visually. Some plots, such as the box-and-whiskers plot and the stem-and-leaf plot are used less often but provide useful summaries in select situations.

Keywords: Boxplot, confidence interval, data, descriptive statistics, measures of central tendency, measures of dispersion, normal distribution, stem-and-leaf plot, variable

Basics of Biostatistics

Application of statistical methods in biomedical research began more than 150 years ago. One of the early pioneers, Florence Nightingale, the icon of nursing, worked during the Crimean war of the 1850s to improve the methods of constructing mortality tables. The conclusions from her tables helped to change the practices in Army hospitals around the world. At the same time, John Snow in England applied simple statistical methods to support his theory that contaminated water from a single hand pump was the source of the London cholera epidemic in 1854. Today, statistics is an integral part of conducting biomedical research. In addition, knowledge of statistics is becoming mandatory to read and understand most biomedical literature.

But why is this so? Broadly speaking, statistics is the science of analyzing data and drawing conclusions thereby in the face of variability and uncertainty. Biomedical researchers carry out studies in various settings: In the laboratory, in the clinic, in the field or simply with data already archived in databases. Whatever the source, data tend to exhibit substantial variability. For instance, patients given the same antimicrobial drug may respond somewhat differently, laboratory rats maintained under identical condition may develop behavioral variations, individuals residing as neighbors in the same locality may differ greatly in their perception of stigma associated with a common skin disease like vitiligo. Often the degree of variability is substantial even when observational or interventional conditions are held as uniform and constant as possible. The challenge for the biomedical researcher is to unearth the patterns that are being obscured by the variability of responses in living systems. Further, the researcher is often interested in small differences or changes. For instance, if we give you two antibiotics and say that drug A has 10% cure rate in folliculitis with 7 days of treatment while drug B has 90% cure rate in the same situation, and ask you to choose one for your patient; the choice would be obvious. However, if we were to say that the cure rates for drugs A and B are 95% and 97% respectively, then your choice will not be so obvious. Very likely, you will be wondering whether the difference of 2% is worth changing practice if you are accustomed to using drug A or maybe you will look at other factors such as the toxicity profile, cost or ease of use. Statistics, gives us the tools, albeit mathematical, to make an appropriate choice by judging the “significance” of such small observed differences or changes.

Furthermore, it is important to remember that statistics is the science of generalization. We are generally not in a position to carry out “census” type of studies that cover entire populations. Therefore, we usually study subsets or samples of a population and hope that the conclusions drawn from studying such a subset can be generalized to the population as a whole. This process is fraught with errors, and we require statistical techniques to make the generalizations tenable.

Before the advent of computers and statistical software, researchers and others dealing with statistics had to do most of their analysis by hand, taking recourse to books of statistical formulas and statistical tables. This required one to be proficient in the mathematics underlying statistics. This is no longer mandatory since increasingly user-friendly software takes the drudgery out of calculations and obviates the need for looking up statistical tables. Therefore, today, understanding the applied aspects of statistics suffices for the majority of researchers and we seldom require to dig into the mathematical depths of statistics, to make sense of the data that we generate or scrutinize.

The applications of biostatistics broadly covers three domains – description of patterns in data sets through various descriptive measures (descriptive statistics), drawing conclusions regarding populations through various statistical tests applied to sample data (inferential statistics) and application of modeling techniques to understand relationship between variables (statistical modeling), sometimes with the goal of prediction. In this series, we will look at the applied uses of statistics without delving into mathematical depths. This is not to deny the mathematical underpinnings of statistics – these can be found in statistics textbooks. Our goal here is to present the concepts and look at the applications from the point of view of the applied user of biostatistics.

Data and Variables

Data constitute the raw material for statistical work. They are records of measurement or observations or simply counts. A variable refers to a particular character on which a set of data are recorded. Data are thus the values of a variable. Before a study is undertaken it is important to consider the nature of the variables that are to be recorded. This will influence the manner in which observations are undertaken, the way in which they are summarized and the choice of statistical tests that will be used.

At the most basic level, it is important to distinguish between two types of data or variables. The first type includes those measured on a suitable scale using an appropriate measuring device and is called quantitative variable. Since quantitative variables always have values expressed as numbers, and the differences between values have numerical meaning, they are also referred to as numerical variables. The second type includes those which are defined by some characteristic, or quality, and is referred to as qualitative variable. Because qualitative data are best summarized by grouping the observations into categories and counting the numbers in each, they are often referred to as categorical variables.

A quantitative variable can be continuous or discrete. A continuous variable can, in theory at least, take on any value within a given range, including fractional values. A discrete variable can take on only certain discrete values within a given range – often these values are integers. Sometimes variables (e.g., age of adults) are treated as discrete variables although strictly speaking they are continuous. A qualitative variable can be a nominal variable or an ordinal variable. A nominal variable covers categories that cannot be ranked, and no category is more important than another. The data is generated simply by naming, on the basis of a qualitative attribute, the appropriate category to which the observation belongs. An ordinal variable has categories that follow a logical hierarchy and hence can be ranked. We can assign numbers (scores) to nominal and ordinal categories; although, the differences among those numbers do not have numerical meaning. However, category counts do have numerical significance. A special case may exist for both categorical or numerical variables when the variable in question can take on only one of two numerical values or belong to only one of two categories; these are known as binary or dichotomous variables [ Table 1 ].

Table 1

Examples of the basic variable types

An external file that holds a picture, illustration, etc. Object name is IJD-61-10-g001.jpg

Numerical data can be recorded on an interval scale or a ratio scale. On an interval scale, the differences between two consecutive numbers carry equal significance in any part of the scale, unlike the scoring of an ordinal variable (“ordinal scale”). For example, when measuring height, the difference between 100 and 102 cm is the same as the difference between 176 and 178 cm. Ratio scale is a special case of recording interval data. With interval scale data the 0 value can be arbitrary, such as the position of 0 on some temperature scales – the Fahrenheit 0 is at a different position to that of the Celsius scale. With ratio scale, 0 actually indicates the point where nothing is scored on the scale (“true 0”), such as 0 on the absolute or Kelvin scale of temperature. Thus, we can say that an interval scale of measurement has the properties of identity, magnitude, and equal intervals while the ratio scale has the additional property of a true 0. Only on a ratio scale, can differences be judged in the form of ratios. 0°C is not 0 heat, nor is 26°C twice as hot as 13°C; whereas these value judgments hold with the Kelvin scale. In practice, this distinction is not tremendously important so far as the handling of numerical data in statistical tests is concerned.

Changing data scales is possible so that numerical data may become ordinal, and ordinal data may become nominal (even dichotomous). This may be done when the researcher is not confident about the accuracy of the measuring instrument, is unconcerned about the loss of fine detail, or where group numbers are not large enough to adequately represent a variable of interest. It may also make clinical interpretation easier. For example, the Dermatology Life Quality Index (DLQI) is used to assess how much of an adult subject's skin problem is affecting his or her quality of life. A DLQI score 20 indicates that the problem is severely degrading the quality of life. This categorization may be more relevant to the clinician than the actual DLQI score achieved. In contrast, converting from categorical to numerical will not be feasible without having actual measurements.

When exploring the relationship between variables, some can be considered as dependent (dependent variable) on others (independent variables). For instance, when exploring the relationship between height and age, it is obvious that height depends on age, at least until a certain age. Thus, age is the independent variable, which influences the value of the dependent variable height. When exploring the relationship between multiple variables, usually in a modeling situation, the value of the outcome (response) variable depends on the value of one or more predictor (explanatory) variables. In this situation, some variables may be identified that cannot be accurately measured or controlled and only serve to confuse the results. They are called confounding variables or confounders. Thus, in a study of the protective effect of a sunscreen in preventing skin cancer, the amount of time spent in outdoor activity could be a major confounder. The extent of skin pigmentation would be another confounder. There could even be confounders whose existence is unknown or effects unsuspected, for instance, undeclared consumption of antioxidants by the subjects which is quite possible because the study would go on for a long time. Such unsuspected confounders have been called lurking variables.

Numerical or categorical variables may sometimes need to be ranked, that is arranged in ascending order and new values assigned to them serially. Values that tie are each assigned average of the ranks they encompass. Thus, a data series 2, 3, 3, 10, 23, 35, 37, 39, 45 can be ranked as 1, 2.5, 2.5, 4, 5, 6, 7, 8, 9 since the 2, 3s encompass ranks 2 and 3, giving an average rank value of 2.5. Note that when a numerical variable is ranked, it gets converted to an ordinal variable. Ranking obviously does not apply to nominal variables because their values do not follow any order.

Descriptive Statistics

Descriptive statistics implies summarizing a raw data set obtained from a sample or population. Traditionally, summaries of sample data (“statistics”) have been denoted by Roman letters (e.g., x ̄ for mean, standard deviation [SD], etc.) while summaries of population data (“parameters”) have been denoted by Greek letters (e.g., μ for mean, σ for SD, etc.). The description serves to identify patterns or distributions in data sets from which important conclusions may be drawn.

Categorical data are described in terms of percentages or proportions. With numerical data, individual observations within a sample or population tend to cluster about a central location, with more extreme observations being less frequent. The extent to which observations cluster is summarized by measures of central tendency while the spread can be described by measures of dispersion.

Measures of Central Tendency

The mean (or more correctly, the arithmetic mean) is calculated as the sum of the individual values in a data series, divided by the number of observations. The mean is the most commonly used measure of central tendency to summarize a set of numerical observations. It is usually reliable unless there are extreme values (outliers) that can distort the mean. It should not, ordinarily be used, in describing categorical variables because of the arbitrary nature of category scoring. It may, however, be used to summarize category counts.

The geometric mean of a series of n observations is the n th root of the product of all the observations. It is always equal to or less than the arithmetic mean. It is not often used but is a more appropriate measure of central location when data recorded span several orders of magnitude, e.g. bacterial colony counts from a culture of clinical specimens. Interestingly, the logarithm of the geometric mean is the arithmetic mean of the logarithms of the observations. As such, the geometric mean may be calculated by taking the antilog of the arithmetic mean of the log values of the observations. The harmonic mean of a set of non-zero positive numbers is obtained as the reciprocal of the arithmetic mean of the reciprocals of these numbers. It is seldom used in biostatistics. Unlike the arithmetic mean, neither geometric nor harmonic mean can be applied to negative numbers.

Often data are presented as a frequency table. If the original data values are not available, a weighted average can be estimated from the frequency table by multiplying each data value by the number of cases in which that value occurs, summing up the products and dividing the sum by the total number of observations. A frequency table of numerical data may report the frequencies for class intervals (the entire range covered being broken up into a convenient number of classes) rather than for individual data values. In such cases, we can calculate the weighted average by using the mid-point of the class intervals. However, in this instance, the weighted mean may vary slightly from the arithmetic mean of all the raw observations. Apart from counts, there may be other ways of ascribing weights to observations before calculating a weighted average.

For data sets with extreme values, the median is a more appropriate measure of central tendency. If the values in a data series are arranged in order, the median denotes the middle value (for an odd number of observations) or the average of the two middle values (for an even number of observations). The median denotes the point in a data series at which half the observations are larger and half are smaller. As such it is identical to the 50 th percentile value. If the distribution of the data is perfectly symmetrical (as in the case of a normal distribution that we discuss later), the values of the median and mean coincide. If the distribution has a long tail to the right (a positive skew), the mean exceeds the median; if the long tail is to the left (a negative skew), the median exceeds the mean. Thus, the relationship of the two gives an idea of the symmetry or asymmetry (skewness) of the distribution of data.

Mode is the most frequently occurring value in a data series. It is not often used, for the simple reason that it is difficult to pinpoint a mode if no value occurs with a frequency markedly greater than the rest. Furthermore, two or more values may occur with equal frequency, making the data series bimodal or multimodal [ Box 1 ].

Box 1

An external file that holds a picture, illustration, etc. Object name is IJD-61-10-g002.jpg

Measures of Dispersion

The spread, or variability, of a data series can be readily described by the range, that is the interval between minimum and maximum values. However, the range does not provide much information about the overall distribution of observations and is obviously affected by extreme values.

A more useful estimate of the spread can be obtained by arranging the values in ascending order and then grouping them into 100 equal parts (in terms of the number of values) that are called centiles or percentiles. It is then possible to state the value at any given percentile, such as the 5 th or the 95 th percentile and to calculate the range of values between any two percentiles, such as the 10 th and 90 th or the 25 th and the 75 th percentiles. The median represents the 50 th percentile. Quartiles divide ordered data set into four equal parts, with the upper boundaries of the first, second, and third quartiles often denoted as Q1, Q2, and Q3, respectively. Note the relationship between quartiles and percentiles. Q1 corresponds to 25 th percentile while Q3 corresponds to 75 th percentile. Q2 is the median value in the set. If we estimate the range of the middle 50% of the observations about the median (i.e., Q1–Q3), we have the interquartile range. If the dispersion in the data series is less, we can use the 10 th to 90 th percentile value to denote spread.

A still better method of measuring variability about the central location is to estimate how closely the individual observations cluster about it. This leads to the mean square deviation or variance, which is calculated as the sum of the squares of individual deviations from the mean, divided by one less than the number of observations. The SD of a data series is simply the square root of the variance. Note that the variance is expressed in squared units, which is difficult to comprehend, but the SD retains the basic unit of observation.

The formulae for the variance (and SD) for a population has the value “n” as the denominator. However, the expression (n − 1) is used when calculating the variance (and SD) of a sample. The quantity (n − 1) denotes the degrees of freedom, which is the number of independent observations or choices available. For instance if a series of four numbers is to add up to 100, we can assign different values to the first three, but the value of the last is fixed by the first three choices and the condition imposed that the total must be 100. Thus, in this example, the degrees of freedom can be stated to be 3. The degrees of freedom is used when calculating the variance (and SD) of a sample because the sample mean is a predetermined estimate of the population mean, and, in the sample, each observation is free to vary except the last one that must be a defined value.

The coefficient of variation (CV) of a data series denotes the SD expressed as a percentage of the mean. Thus, it denotes the relative size of the SD with respect to the mean. CV can be conveniently used to compare variability between studies, since, unlike SD, its magnitude is independent of the units employed.

Measures of Precision

An important source of variability in biological observations is measurement imprecision and CV is often used to quantify this imprecision. It is thus commonly used to describe variability of measuring instruments and laboratory assays, and it is generally taken that a CV of

Another measure of precision for a data series is the standard error of the mean (SEM), which is simply calculated as the SD divided by the square root of the number of observations. Since, SEM is a much smaller numerical value than SD, it is often presented in place of SD as a measure of the spread of data. However, this is erroneous since SD is meant to summarize the spread of data, while SEM is a measure of precision and is meant to provide an estimate of a population parameter from a sample statistic in terms of the confidence interval (CI).

It is self-evident that when we make observations on a sample, and calculate the sample mean, this will not be identical to the population (“true”) mean. However, if our sample is sufficiently large and representative of the population, and we have made our observations or measurements carefully, and then the sample mean would be close to the true mean. If we keep taking repeated samples and calculate a sample mean in each case, the different sample means would have their own distribution, and this would be expected to have less dispersion than that of all the individual observations in the samples. In fact, it can be shown that the different sample means would have a symmetrical distribution, with the true population mean at its central location, and the SD of this distribution would be nearly identical to the SEM calculated from individual samples.

In general, however, we are not interested in drawing multiple samples, but rather how reliable our one sample is in describing the population. We use standard error to define a range in which the true population value is likely to lie, and this range is the CI while its two terminal values are the confidence limits. The width of the CI depends on the standard error and the degree of confidence required. Conventionally, the 95% CI (95% CI) is most commonly used. From the properties of a normal distribution curve (see below) it can be shown that the 95% CI of the mean would cover a range 1.96 standard errors either side of the sample mean, and will have a 95% probability of including the population mean; while 99% CI will span 2.58 standard errors either side of the sample mean and will have 99% probability of including the population mean. Thus, a fundamental relation that needs to be remembered is:

95% CI of mean = Sample mean ± 1.96 × SEM.

It is evident that the CI would be narrower if SEM is smaller. Thus if a sample is larger, SEM would be smaller and the CI would be correspondingly narrower and thus more “focused” on the true mean. Large samples therefore increase precision. It is interesting to note that although increasing sample size improves precision, it is a somewhat costly approach to increasing precision, since halving of SEM requires a 4-fold increase in sample size.

CIs can be used to estimate most population parameters from sample statistics (means, proportions, correlation coefficients, regression coefficients, odds ratios, relative risks, etc.). In all cases, the principles and the general pattern of estimating the CI remains the same, that is: 95% CI of a parameter = Sample statistic ± 1.96 × standard error for that statistic.

The formulae for estimating standard error however varies for different statistics, and in some instances is quite elaborate. Fortunately, we generally rely on computer software to do the calculations.

Frequency Distributions

It is useful to summarize a set of raw numbers with a frequency distribution. The summary may be in the form of a table or a graph (plot). Many frequency distributions are encountered in medical literature [ Figure 1 ] and it is important to be familiar with commonly encountered ones.