Accuracy of measurement:
Return to Epidemiology & Research Methods theme page
Return to Welcome page
Vers la page française
In past centuries, most medical knowledge was based on clinical observations – typically attributing the causes of a disease to a particular exposure, or its cure to a particular intervention. These observations may have been insightful, or they may have been misleading: the disease or its cure could have been due to something else the observer did not record. The only way to know is to repeat the observation under controlled circumstances in which the observer can limit the exposures to the single one that was claimed to be the cause (or the cure): experimental research.
To a growing extent medical practice is now based on scientific evidence of the effectiveness of therapies, or of the risk of an exposure. This is the field of Evidence-Based Medicine (EBM), which forms a future chapter in our curriculum. The current pages lay the foundation for this by introducing you to research methods. We open with an introduction to the topic being studied: concepts and measurements of health (i.e., "What is being studied?"). This leads to a discussion of methods ("How do we study health?") in terms of research designs and statistics.
What are we Studying? Conceptions of Health
Health is elusive to define and ways of thinking about it have evolved over the years. Three leading approaches include the "medical model", the "holistic model", and the "wellness model" of health:
The medical model was dominant in North America through the early 20th century.
- In its most extreme form, the medical model views the body as a machine, to be fixed when broken.
- It emphasizes treating specific physical diseases, does not accommodate mental or social problems well and, being concerned with resolving health problems, de-emphasizes prevention.
- This perspective led to measuring health by its absence, e.g., by disease or death rates. Therefore health is defined as the absence of disease and the presence of high levels of function.
- Applied to population health, the medical model might define a healthy population as one in which its members were had no physical health problems. Alternatively, the mechanical metaphor could be applied to the society itself: a healthy society is one in which the various systems (economic, legal, governmental, etc.) function smoothly.
The holistic model of health is exemplified by the 1946 WHO definition, "a state of complete physical, mental and social well-being and not merely the absence of disease or infirmity". The holistic model broadened the medical perspective, and also introduced the idea of positive health (although the WHO did not originally use that term).
- The WHO definition was long considered unmeasurable; the terms were vague. This was less because no-one could invent ways to measure "well-being," (indeed, psychologists had done so) but more because doing so required subjective assessments that contrasted sharply with the objective indicators favored by the medical model.
The wellness model was developed through the WHO health promotion initiative. In 1984, a WHO discussion document proposed moving away from viewing health as a state, toward a dynamic model that presented health as a process or a force.
- The 1986 Ottawa Charter for Health Promotion stated that health is "The extent to which an individual or group is able to realize aspirations and satisfy needs, and to change or cope with the environment. Health is a resource for everyday life, not the objective of living; it is a positive concept, emphasizing social and personal resources, as well as physical capacities." (Health promotion: a discussion document. Copenhagen, WHO, 1984.)
- Related definitions include some that view health in terms of resiliency (e.g., "the capability of individuals, families, groups and communities to cope successfully in the face of significant adversity or risk." (Vingilis & Sarkella, Social Indicators Research 1997; 40: 159)
- Applied to population health, the definition might include elements such as the success with which the population adapts to change such as shifting economic realities or natural disasters.
Link here to page “Disability Core Concepts and Definitions” for further discussion of health and quality of life.
Comparing the Different Models of Health
Each of these models has something to contribute, though none seems ideal.
The medical model highlights disease, a crucial issue facing society, and disease states are readily diagnosed and counted. But this approach is narrow, viewing health principally in terms of physical disease and function. In extreme form it implies that people with disabilities are "unhealthy". A further potential limitation is the omission of a time dimension. Should we consider as equally healthy two people in equal functional status, one of whom is carrying a fatal gene that will lead to early death? Further, if prognosis is not included, there is no virtue in prevention.
The holistic and wellness models have the advantages of discriminating among people at the higher end of functioning; they focus on mental as well as physical health and on broader issues of quality of life. They also allow for more subtle discrimination of people who succeed in living productive lives despite a physical impairment: blind people or amputees may still be able to satisfy aspirations, be productive, happy and so be viewed as healthy. The disadvantage is that these conceptions run the risk of excessive breadth, of incorporating all of life within the health system. Thus, they do not distinguish clearly between the state of being healthy and the consequences of being healthy. A further challenge is that by espousing a dynamic model of health (e.g., the capacity to rally from insults), healthiness predicts itself. Hence, we must also move from a strictly linear model of cause and effect toward a systems model in which health is a force, both input and output, and not merely an output of a linear process.
Measurement involves assigning numbers to an observation or attribute (weight, length) so as to represent the quantity of the attribute. But health is an abstract concept and cannot be measured directly using a mechanical scale as weight or length are measured. Instead, indicators of health have to be selected, and some form of numerical summary applied to quantify these.
Historically, the first approach to measuring health was to use counts of mortality and morbidity, so we will begin with these indicators. Thus, the health status of countries may be compared in terms of their death rates per thousand, or their average life expectancy, or any of a range of indicators of morbidity such as rates of reportable disease, hospital discharges, etc.
These indicators are presented numerically using concepts such as rates, ratios and proportions, so these must first be explained.
Rates, Proportions, Ratios
A Rate is a measure of the frequency with which an event occurs in a defined population in a defined time (e.g., number of deaths per hundred thousand Canadians in one year). It has a time dimension, whereas a Proportion (e.g., number of Canadians with cancer divided by the total population) does not.
A Ratio is the value obtained by dividing one quantity by another: the male to female ratio in your class. A ratio often compares two rates (the 'rate ratio'), for example comparing death rates for women and men at a given age.
The important difference between a rate and a ratio is that for a rate, the numerator is included in the denominator (e.g., number of new cases of a disease divided by the total population). In a ratio, the numerator and denominator are usually separate and distinct quantities, neither being included in the other (e.g., the rural to urban mortality ratio).
Incidence = Number of new cases in a fixed time period ÷ Number of people at risk.
Two types of incidence are commonly used — incidence proportion (also called cumulative incidence) and incidence rate.
Incidence proportion is the proportion (i.e. fraction or percentage) of an initially disease-free population that develops disease during a specified period of time.
Unfortunately there are a couple of practical difficulties in calculating the incidence proportion. First, everyone being studied has to be followed for the complete year, but some may die from another cause or be lost to follow-up, which makes the resulting calculation uncertain (can you be sure they would not have got the disease if they had lived?) Second, many diseases can occur more than once and we have to decide how to handle recurrences. If you include them, the incidence proportion could exceed 1.0, which feels uncomfortable. But if you accept only first diagnoses, you may underestimate the true burden of disease.
Therefore, a more practical alternative is to calculate an Incidence rate: the number of new cases per length of observation; typically the number of cases per person-year of observation. If the period of study is one year, we speak of the annual incidence.
Incidence rates are equivalent to recording speed in km per hour and, like speed, the incidence rate gives an instantaneous reading of the frequency with which the disease will occur, showing the expected time-delay until the next case. You may also see the term hazard rate.
Prevalence is thus a proportion, rather than a rate, although you may see it called a rate: confusing! Prevalence provides a way to indicate the burden of a chronic disease in the population: how many cases are there to be treated?
Prevalence is influenced by the incidence and duration of the condition: under most circumstances, prevalence = incidence × disease duration.
Mortality Rate: The number of deaths per thousand population per year – in effect, the incidence rate of death in a population. It can refer to all causes of death, or can be a cause-specific mortality rate. In comparing mortality rates in different populations, standardization is often used to correct for demographic differences between the populations (see below).
Case Fatality: The proportion of people with a specified condition (‘cases’) who die within a specified time. The time frame is typically the period during which the patient is sick from the disease.
- This works for an infectious disease but can be problematic for a chronic disease like a cancer that may remit for a period and then prove fatal after a recurrence (is this one or two 'cases' of the disease?) In such instances we tend to speak of mortality or survival rates rather than case fatality.
Attack Rate: The cumulative incidence (or incidence proportion) of infection over a period of time (in other words, the proportion of a population that gets infected in a specified time). Also called the case rate, a term typically used during an epidemic. The time period may not be indicated, but would typically refer to the period of the outbreak: "During the influenza outbreak the attack rate was 12%".
Crude Death Rate: An estimate of the rate at which members of a population die during a specified period. The numerator is the number of people dying during the period; the denominator is the size of the population, usually at the middle of the period (mid-year population).
Number of deaths during a specified period
Number of persons at risk of dying during the period.
(Note: the "10n" simply means that the rate may be multiplied by a thousand, or even hundred thousand for rare diseases, to bring the rate to a whole number.) The death rate is about 8 per 1000 population in Canada, but is likely to increase as the population ages.
Why "crude"? The term warns us that directly comparing rates of disease or death between different populations may be uninterpretable because so many factors will differ between the populations — see Standardization below.
Life expectancy: Life expectancy at birth is an estimate of the expected number of years to be lived by a newborn, based on current age-specific mortality rates. It is used as a summary indicator of current health and mortality conditions. Canadian life expectancy is about 81 for females and 75 for males. One problem is that where infant mortality is high, as in developing countries, this exerts a predominant influence on life expectancy statistics, and may not give a good indication of life expectancy at other ages. This leads us to the idea of specific rates.
Specific Rates: In contrast to crude rates, a specific rate refers to a particular segment of the population. It focuses attention on a more homogeneous group within the population and is expressed on the basis of any characteristic such as age, sex, marital status, race, etc. Rates may also be made specific for more than one characteristic of the population, such as age-, sex-, and race-specific death rates.
Infant Mortality Rate (IMR): The infant mortality rate is the total number of deaths in a given year of children less than one year old, divided by the total number of live births in the same year, multiplied by 1,000. It is an estimate of the number of deaths per 1,000 children born alive who will die within one year of birth.
Number of deaths among children less than 1 year of age
Number of births in the same year
The IMR is often quoted as a useful indicator of the level of health development in a community.
Fetal deaths (>28 weeks of gestation) + deaths occurring within 1 week postnatally
Fetal deaths (>28 weeks of gestation) + live births
The World Health Organization's definition, more appropriate in nations with less well-established vital records, is
Late fetal deaths (>28 weeks of gestation) + deaths in first week
Neonatal Mortality Rate (NMR):
Deaths in infants under 28 days of age in a year
Live births in same period
Child Mortality Rate: Child mortality rate refers to the annual number of deaths in the age group 1-4 years per 1,000 population in the age group.
We often compare health statistics between different populations – to show which is healthier, or to get clues on the causes of a disease, for example. But populations commonly differ in demographic structure (age, socioeconomic status, race, etc.), and these influence health. How do we then separate the effect of demographic differences from that of the factor (e.g., air pollution) we are really interested in?
For example, Victoria, B.C. has more elderly people than Whitehorse in the Yukon, so a crude comparison of overall mortality rates per thousand would not be helpful because we would expect higher death rates in Victoria simply because of its elderly population.
Standardization is used to remove the effect of a variable that you are not interested in studying – very often age differences between the two populations, since we already know that age is related to health.
Standardized mortality rates: calculated when comparing mortality in two populations that have different demographic structures. It removes the effect of differences in age (or other confounding variables that affect mortality rate) between the populations.
Standardization can be either direct (leading to an Age-Standardized Mortality Rate [ASMR]) or indirect (producing a Standardized Mortality Ratio [SMR]).
Direct Standardization: The ASMR is calculated in 4 steps:
1. Select a reference population (usually the country as a whole) and find out from the census how many people there are in each age group.
2. Calculate age-specific death rates in Victoria and Whitehorse.
3. For each city and for each age-group within the city, multiply these rates (e.g., 5 per thousand) by the number of people in that age-group in the standard population. This will give a big number that indicates the number of people who would be expected to die in that city if their population was the same size as the reference population.
4. To get the overall ASMR, add up the number of expected deaths for each age-group for Victoria, then for Whitehorse, and then divide each by the total number of people in the standard population to get the age-standardized mortality rate.
Indirect Standardization: This time we take the age-specific mortality rates from the standard population;
1. Multiply them by the numbers of people in Victoria and Whitehorse in each age group,
2. Add them up to obtain the total expected deaths in each city, and
3. Divide the expected deaths into the number of observed deaths to obtain the SMR for each city.
A SMR value of 1 indicates that the city is experiencing the same age-specific mortality as the standard population.
Survival analysis: records the time to an event according to categories of a predictor variable. For example, how long do people manage to live independently in the community following a diagnosis of cognitive impairment, according to the level of their cognitive impairment? The analysis produces graphs that begin at 100% and show lines sloping downward across time, representing the fraction of each diagnostic category still living in the community after set time intervals.
Where a study compares survival following an experimental therapy, survival analysis can illustrate the pattern of prognosis over time in the experimental and control groups and statistical methods such as the Cox Proportional Hazards model can be used to calculate the significance of the effect of the intervention (or of any other influence).
Life tables: A life table typically shows the probability of dying across different age groups. From the incidence proportions the table shows the survival probability for each age-group, and in a final column, the cumulative survival probability across age groups; this can be graphed.
While death and disease are central to the work of a physician, there are limitations of this type of measurement: they only apply to people who are seriously ill; they say nothing about a person's actual level of function and they omit mention of positive aspects of health.
These limitations led to the development of a range of subjective indicators of health which are termed "health measurement scales". These involve assembling indicators of health and applying a numerical judgment applied to quantify or "scale" these.
- For example, if health is defined in terms of physical, mental and social well-being, indicators of each of these themes could be selected and a scoring system for rating a person on each indicator devised. If needed, a second scoring system could weight the relative importance of the physical, mental and social areas in an overall score.
Health measurement scales can be disease specific (e.g. blood sugar levels; depression score) or generic, which can be applied to any disease or syndrome (e.g. emotional well-being or functional ability).
There are three main categories of health measurements:
- Diagnostic scales: These gather a wide variety of information from self-report and clinical ratings, and process these using algorithms that suggest differential diagnoses
- Prognostic measures: measurements from screening tests and examination of disease risk factors that predict future health
- Evaluative measurements: these record changes in health or disease status of the patient in order to measure the impact or outcome of care
Health measures may be recorded mechanically as in a treadmill test, or they may derive from expert judgment as in a physician’s assessment of a symptom. Alternatively, they may be recorded via self-ratings, as in a patient’s replies to a disability questionnaire. [Link to ADL scales]
Because of the complexity of developing a reliable and valid health measurement, there has been a steady growth over the past half-century in the range of standardized health measurement scales that are available for general use. Using the same instrument in separate studies enables direct comparisons to be drawn between them. The current repertory of health measurements is numbered in the hundreds, and these have been described in several books (see below).
Spilker B, ed. Quality of life assessment in clinical trials. New York: Raven Press, 1990.
Bowling A. Measuring disease: a review of disease-specific quality of life measurement scales. Buckingham, England: Open University Press, 1995.
McDowell I. Measuring health: a guide to rating scales and questionnaires. New York: Oxford University Press, 2006.
When we collect numerical data, numbers can be used in different ways. The way a number is being used affects the statistical analysis that can be applied to it, how you interpret results of tests, etc.
Ratio scale: Many variables in science are measured using a ratio scale. Here, zero represents the complete absence of the quantity being measured, and each unit change represents a constant increase or decrease in the quantity. Examples used in medicine are height, weight, blood pressure, etc. You can add, subtract, multiply and divide with these scales.
Interval scale: Here, the numbers represent a regular increase in the quantity, but the zero point is chosen arbitrarily: degrees Celsius or Fahrenheit are examples. Because 0°C does not imply a complete absence of heat, we cannot say that -10 degrees is twice as cold as -5 degrees. However, the change in temperature from -10° to -5° is the same increase as from 50° to 55°. With interval scales you can add or subtract, and so calculate averages. But you cannot legitimately multiply or divide.
Ordinal scale: Here, the sequence of numbers is meaningful, but not the distance between them. Consider house numbers on a street – as you go up a street, the numbers increase (or decrease) but they do not represent an accurate measurement of distance. Thus it is not really legitimate to add and subtract numbers in an ordinal scale. Another example is if you ask a patient to rate their pain on a 0 to 10 scale. We cannot really claim that an increase from 2 to 5 on the scale is the same change as a shift from 5 to 8. We don't know if the person is really using the pain scale in a truly linear manner.
Nominal or Categorical: Finally, numbers can be used as names (Nominal), as in your student number, or to represent categories (Categorical), as in pharmaceutical codes for drugs. These are not scales of measurement, and the actual choice of number is arbitrary (although they may have some logical structure as in grouping similar types of drug together in pharmaceutical codes). Hence, they cannot be used in any form of calculation: you can only count them (how many different antibiotics are there?)
Why Care About Scales of Measurement?For each of these scales of measurement, there are different statistical tests. While it doesn't seem important for a physician to know the details, knowing about scales of measurement may help you avoid misinterpreting numbers like a pain measurement.
Sources of Error
All measurements may contain some element of error; validity and reliability statistics refer to different types of error that typically occur, and they estimate the extent of error in a measurement.
There are three chief sources of error:
- In the thing being measured (my blood pressure may fluctuate so it's difficult to get an accurate picture of it);
- In the observer (different nurses at the clinic take your blood pressure each time);
- In the recording device (our blood pressure cuff in room 3 has been acting up; we really should get it recalibrated).
There are two types of error:
Random errors are not attributable to a specific cause. If sufficiently large numbers of observations are made, random errors average to zero, because some readings over-estimate and some under-estimate. So repeating BP measurements may give a more accurate picture. Random errors affect the reliability of a measurement.
Systematic errors tend to fall in a particular direction and are likely due to a specific cause. Because systematic errors fall in one direction (e.g., I always exaggerate my athletic abilities) they bias a measurement. Systematic errors affect the validity of a measurement.
Validity refers to how closely the results of a measurement correspond to the true value of the quantity being measured; validity is sometimes called “accuracy”. It answers questions like: “Is this an accurate screening test?” or “Is this measurement giving me the information that I want (or expect) it to be giving me?”
In medicine, the validity of a measure is often judged by comparing it with a more definitive assessment, such as a full diagnostic work-up. This is called criterion validation, comparing the measure against a 'gold standard' criterion. Statistically, this may be assessed with tests like specificity, sensitivity, predictive values and likelihood ratios.
Reliability refers to consistency or dependability – if you were to repeat the measurement, would you get the same results? More formally, it refers of the amount of random error that occurs in making a measurement.
You should be aware of two main forms of reliability: test-retest reliability and inter-rater agreement. The first is assessed by administering a measurement twice and comparing results. The second involves comparing measurements made by two (or more) people and comparing their results. For example, two radiologists may both read an x-ray and their diagnoses can be compared. In both cases, reliability is summarized using a statistic that measures agreement (these are discussed on the next page).
Diagnostic and screening tests should neither miss cases of disease (thereby giving false reassurance and missing a chance to treat the condition), nor falsely classify healthy people as diseased. (Sometimes one type of error is more important than the other; you can ponder their relative importance under different circumstances...)
Sensitivity refers to what fraction of all the actual cases of disease a test detects. If the test is not very good, it may miss cases it should detect. Its sensitivity is low and it generates "false negatives" (i.e., people score negatively on the test when they should have scored positive). This can be extremely serious if early treatment would have saved the person's life.
Specificity refers to whether the test identifies only those with the disease, or does it mistakenly classify some people without the condition as having it? Errors of this type are called "false positives." This can lead to worry, expensive further investigations and the possibility of treating the wrong condition.
Many measurements used to screen or diagnose provide a continuous score (as with blood pressure). To classify the person as hypertensive or not, a threshold score must be defined, forming a "cut point" on the scale. As always happens with life, if you adjust the threshold value of the test to reduce one type of error, you will find that the other type of error increases.
Note: the idea of two types of error in a screening test has a parallel in study designs. A study may falsely indicate that the two groups were different (e.g. patients treated with a new drug compared to the old drug), or it may fail to discover a real difference between two groups. The first is called a Type I (or alpha) error, and the second is a Type II or beta error.
Sensitivity and specificity for a test are calculated beginning with the actual diagnosis (“how many of the people who really have the disease does the test identify?"). However, as the physician using the test, you don't know who really has the disease: you only have the test result (which may or may not represent reality). In this situation, sensitivity and specificity do not really apply: we need to know the likelihood that someone with a positive score on the test really does have the disease. Look again at the diagram: this is called the "predictive value of a positive test result," or "positive predictive value" (PPV) for short. The equivalent measure for a negative result is the “negative predictive value” (NPV).
Another way to think of this is that the positive and negative predictive values are the proportions of positive and negative results that are true positive and true negative results.
Sensitivity and specificity can be considered fixed properties of a diagnostic test. [This is a slight simplification, but it's good enough for our purposes]. By contrast, predictive values are affected by the prevalence of the disease in a population. This is a crucial point: the same diagnostic test will have a different predictive accuracy according to the clinical setting in which you are applying it!. Let us walk through a descriptive example. This is a slightly modified version of an example created by Parikh et al in this article):
A new test has been applied to 1,000 patients who all had Disease X (disease positive) on gold standard and 1,000 normal persons as controls. The authors found that 900 were correctly classified as having Disease X by the new test and 950 were correctly labeled as non-diseased. The authors would report the sensitivity and specificity of a test as 90% and 95% respectively. With these results the new test appears to be excellent.
But let's apply this test to a million people where only 1% are affected with Disease X. Of the million people, 10,000 would truly be affected with Disease X. Since our new test is 90% sensitive, the test will detect 9,000 of these (True Positives) and will miss 1,000 (False Negatives). Looking at those numbers, we would think that our test is good because we have detected 9,000 out of the 10,000 affected people.
However, of the original 1 million, 990,000 are not affected. If we look at the test results on this normal population (recall the test specificity is 95%), we find that while 940,500 (95% of 990,000) are correctly classified as not having X (True Negatives). But we have 49,500 (5% of 990,000) who score positive by the test (False Positives).
This means that of all the positive results, 9,000 are true positives but 49,500 are false positives.
So, a patient may get a positive test result but if the prevalence in that population (e.g., in general practice) is very low, because of the small number of true cases mixed in with all those false positives, the test result may not mean very much.
If you prefer to see calculations, the following table illustrates this phenomenon.
It holds sensitivity and specificity constant, at 99% and 95% (this is a REALLY good test…) but the population prevalence of diabetes changes from 1% (e.g., diabetes among 30 year-olds) to 20% (e.g., among 70 year-olds).
The Impact on Positive Predictive Value (PPV) as Prevalence Changes,
for a test with 99% Sensitivity and 95% Specificity
Prevalence 1% 10% 20% a # in population 1,000 1,000 1,000 b Diseased 10 100 200 c Not diseased 990 900 800 d True Positives on the test (b x 0.99) 10 99 198 e False positives on the test (c x (1-0.95)) 50 45 40 f Total # positive on test (d + e) 60 144 238 PPV (d / f) 17% 69% 83%
(Source: Dr. Chan Shah: Public health and preventive medicine in Canada. Elsevier, Canada, 2003)
As prevalence rises from 1% to 20%, PPV will rise from 17% to 83%: a huge difference in the clinical interpretation of the same test result.
Hence, as prevalence goes up (or down), so does the predictive value of a positive score on the test, and this is a crucial finding for you, the clinician. Imagine you are a general practitioner and the disease is relatively rare among your patients. The pre-test probability of your patient having a disease will be low, and this will bring down the predictive value of a positive test result, even if the test itself is quite good.
Summary: Using a test in a population with a high prevalence increases the positive predictive value and decreases negative predictive value. Using a test in a population with a low prevalence increases the negative predictive value and decreases positive predictive value.
But juggling all these statistics is difficult. We need a way to combine sensitivity, specificity, and prevalence in interpreting test scores. This introduces likelihood ratios.
These show how much knowing the test result will improve on a diagnostic guess based simply on the pre-test probability of having the disease (i.e. prevalence). In other words, likelihood ratios (LRs) answer: "How much more likely is a person with the disease to score positive on the test than a person without the disease?"
The formula for the positive likelihood ratio ("LR+") considers both sensitivity and specificity: it is sensitivity divided by (1-specificity), or the true positives divided by the false positives. An LR+ higher than about 5 can be useful in ruling in a disease.
There is also a likelihood ratio for a negative test result ("LR-"). The formula is (1-sensitivity) ÷ specificity, or false negatives over true negatives. It should give a result below 1; values below about 0.2 are useful in ruling out a disease.
To bring in the prevalence piece there's a neat little nomogram (diagram below). You need to know the likelihood ratio for this particular test, and also the pre-test probability or prevalence of the condition. Draw a line through the pre-test probability on the left of the diagram, through the likelihood ratio in the central column, and then read off the post-test probability on the right-hand column. As an example, with a LR+ of 5 and an initial estimated probability of 20%, a positive test would imply a patient has a 60% probability of the disease.
Here is a display that you can manipulate yourself. Click here to explore what happens to test performance when prevalence changes, and when you alter the cut-point on the test. Note: you will need Excel 2007 for this. You may need to re-position the display on your screen to allow you to move the two slider bars.
Warm thanks to Paul Lee, PhD, University of Hong Kong, for programming this.
There are two broad categories of study design:
- Observational studies: the researcher studies, but does not (or cannot) alter, what occurs. Observational studies are used in describing the health status of a population, recording indicators of morbidity and health. This is used in planning health services, developing health policies, etc. Observational studies are also widely used in research on the causes of human disease when it would be unethical to undertake an experiment.
- Experimental studies: the researcher intervenes to change reality, then records the results. Experiments form the mainstay of clinical research on the effectiveness of therapies, using the randomized clinical trial.
Different study designs provide different types and qualities of information. Of course, we always try to use the best possible design, but sometimes this is not practical or ethically acceptable (you cannot do an experiment to expose some people to a harmful substance to see what effect it has).
As a clinician judging the quality of published information, or if you are going to become a researcher, you need to understand the strengths and limitations of each type of study design.
This is another application of the notion of validity, which can be applied to both measurements and study designs. The validity of a study design concerns whether the results and your conclusion may be biased by characteristics of the study design or its execution (for example, did the sickest people refuse to participate?) These "threats to validity" are addressed by improving the study design.
A Study Evaluating a Therapy
Participants tend to get better on their own
Add a no-treatment comparison group
The groups being compared differed in ways that affected their response to the treatments
Randomly allocate people to the two groups, i.e. a randomized trial
The improvement was due to some other intervention than the one you were testing
Keep groups secluded (e.g. in hospital); record other exposures: if both groups also received another intervention then any differences between them would still be due to the intervention being studied
People did not take the correct dose, or stopped taking the therapy
Monitor closely; include everyone in the final analysis using intention-to-treat analysis
The therapist evaluating the patients believed in the treatment and unconsciously biased his or her measurements
Blind the person evaluating patient progress to the group the patient is in
The fact of being intensively studied had an influence on their recovery
Ensure the control group receives the same observation protocol; try to reduce disturbance due to measurements
The people chosen for the study were unusual and we cannot draw general conclusions from them
Repeat the study using a representative sample
The last threat refers to external validity – the extent to which we can generalize from these results to other patients elsewhere. The other threats are considered part of internal validity which refers to the rigor of the study design.
Unfortunately, efforts to make the study more rigorous and improve internal validity (e.g. by limiting other therapies the patients may also take) can reduce its generalizability or external validity.
Case reports: description of an individual patient; this may provide valuable insights but is not necessarily representative. Case reports always have to be followed by more representative and larger studies.
Case series: a description of a series of unusual cases.
Cross-sectional surveys (aka prevalence studies): One-time assessment of a group, recording information about their health in that moment in a systemic way. May use a broad, population sample of people or a representative sample of a given group (e.g. patients with diabetes). Surveys may suggest links between risk factors and disease but they cannot determine causality. For example, given the snapshot nature of the cross-sectional survey, it is often not possible to know whether the risk factor or the disease came first.
Most medical study designs seek to identify causal relationships – between a risk factor and a disease, or between administering a therapy and subsequent health outcomes. The study designs vary in their ability to identify a truly causal relationship, so before describing them we need to make a detour to introduce two key concepts: confounding and the criteria for judging a causal relationship.
Confounding occurs when a factor other than the one you are studying is associated both with the disease and the factor you are studying. This can make it seem as though the factor you are studying causes the disease even if, in reality, it does not.
For example, we know that smoking is associated with cardiovascular disease. Now imagine you wish to test whether drinking coffee also increases the risk of cardiovascular disease. Imagine also that drinking coffee is associated with smoking (perhaps people who drink coffee also tend to smoke). Now the smoking forms a confounding variable that can make it falsely appear as though coffee drinking leads to heart disease - see the diagram:
The result is that there appears to be an association between coffee drinking and heart disease when, in reality, this may be occurring just because the people who drink coffee also smoke, and it is their smoking (not the coffee drinking) which actually causes the heart disease.
Why is this important? We have heard the phrase, "correlation does not prove causation", but could you explain to a patient exactly why not? You may be called on to do just this: a patient may ask you about "a newly discovered cause" of her ailment, waving a newspaper article under your nose, and will ask your opinion. Very commonly such findings later turn out to be spurious and due to confounding by known risk factors.
Addressing Confounding Variables
To identify the actual effect of coffee drinking in our example, the study design would need to include measures of smoking (and of other possible confounding factors). You would have to do an analysis that statistically adjusts (or controls) for the smoking. For example, you could examine the association between coffee drinking and heart disease at each level of smoking: first, among non-smokers, is there an association between coffee and heart disease? Next, among moderate smokers? And among heavy smokers? If the original association between coffee and heart disease was due to confounding by smoking, you would no longer see an association when you sub-divide people according to their smoking level.
Sadly, there is no definitive way to prove that an association between a factor and a disease is a causal relationship. But there are many indications, such as those identified by the philosopher John Stuart Mill in "A System of Logic" in 1843. In 1965 Austin Bradford Hill adapted these (with input from other sources) into a set of criteria for assessing epidemiological relationships (Bradford-Hill A. The environment and disease: association or causation? Proc R Soc Med 1965;58:295-300). These are widely quoted, but have also been disputed, so to illustrate I have added comments after each one. Please note that many authors have revised these criteria, so you will find different versions; sometimes 9 are listed and sometimes 7; please do not stress unduly over details.
1. Chronological relationship: Exposure to the presumed cause must predate the onset of the disease
This is widely accepted. But beware of the difficulty in knowing when some diseases actually began, if they have long latent periods.
2. Strength of association: If all those with the disease were exposed to the presumed causal agent, but very few in the comparison group were exposed, the association is a strong one. In quantitative terms, the larger the relative risk, the more likely the association is causal.
This criterion can be disputed: the strength depends very much on how many other factors are also considered, and how these are controlled in a study. A strong relationship may also still be a confounded result. An example is the strong link between birth order and risk of Down's syndrome. This is actually due to maternal age at the child's birth.
3. Intensity or duration of exposure (also called biological gradient, or dose-response relationship): If those with the most intense, or longest, exposure to the agent have the greatest frequency or severity of illness, while those with less exposure are not as sick, then it is more likely that the association is causal.
A reasonable criterion if present, but may not apply if there are threshold relationships. Hence the absence of a dose response does not disprove causality
4. Specificity of association: If an agent or risk factor is found that consistently relates only to this disease, then it appears more likely that it plays a causal role.
This is a weak criterion, and was derived from thinking about infectious diseases. Factors such as smoking or obesity are causally associated with several diseases; the absence of specificity does not undermine a causal interpretation
5. Consistency of findings. An association is consistent if it is confirmed by different studies; it is even more persuasive if these are in different populations.
A good criterion, although it may lead us to miss causal relationships that apply to only a minority of people
6. Coherent, or plausible findings: Do we have a biological (or behavioral, etc.) explanation for the observed association? Evidence from experimental animals, or analogous effects created by analogous agents are among the kinds of evidence to be considered.
A good criterion, but can be subjective: post hoc one can often supply an explanation for an unexpected result
7. Cessation of exposure. If the causal factor is removed from a population, then the incidence of disease should decline.
This may work for a population, but for an individual in a study the pathology is not always reversible
The more of these criteria are met in a given instance, the stronger the presumption that the association is causal.
Etiological Study Designs
Cohort study (aka longitudinal): a type of prospective observational study in which people without the disease of interest are selected according to the presence or absence of exposure (e.g. exposure to radiation) and are followed over a period of time to investigate outcomes. An indicator of risk compares the incidence of disease among exposed and unexposed people at the end of the study (perhaps years later). Cohort studies can be used to calculate the Relative Risk (RR) of disease according to exposure.
The advantages of this design are that it can establish temporal sequence — that exposure (e.g. radiation) predates the outcome (e.g. cancer), and it allows for accurate collection of exposure information. However, there are some problems: if the outcome (e.g. cancer) is rare, you will need a very large cohort; you will also need to keep in contact with them for a very long time and you will probably get very bored waiting for the results.
Q: Can you estimate prevalence from a cohort study?
A: No. To make sure you get the temporal sequence correct, a cohort study begins by selecting a sample of people who do not have the disease (so, prevalence = zero). You would then follow them over time. Incident cases would arise, and as these accumulate you would get an estimate of prevalence. However, because you omitted the existing cases at the beginning of the study, your estimate would be biased downwards. This would be less serious if you followed people for a long time, and if people die from the disease quickly, so none of the original cases would have survived anyway.
Case-control study: A retrospective design. This means that you begin at the end, studying people with the disease, and then work backwards, like a detective, to hunt for possible causes. People are selected according to whether they have the disease of interest (cases) or not (controls). The groups are then compared with respect to the presence or absence of exposures in the past. The indicator of risk is the Odds Ratio (OR) .
The advantages are that a case-control study can be done faster and more cheaply than a cohort study. However, it may be difficult to collect the information you require on past exposures, and there may be other ways in which the cases and controls differ, which could also be influencing the outcome. Sometimes you also have difficulty in being sure which came first: the disease or the exposure (the Law of Retrospection: "You cannot tell which way the train went by looking at the track").
Q: Can you calculate incidence or prevalence from a case-control study?
A: No. You cannot calculate incidence of the disease because the cases already have the disease, and the controls will not be followed over time to record possible new cases. Nor can you calculate prevalence, because it was you who decided how many cases and how many controls to choose, and this determined the apparent prevalence in the study.
Note: You may hear about "matched" and "unmatched" case-control studies. A problem with case-control studies is that the cases and controls may differ on a number of factors, including characteristics (such as age, or sex, or wealth) that you may not be interested in studying as potential causes. To ensure greater comparability between the two groups, and thereby avoid confounding, the controls could be matched for sex and age to the cases.
In experimental research, the investigator intervenes to manipulate the exposure (e.g. the dose of a drug) and observe the result or ‘outcome’. The goal is to establish whether there is a causal relationship between exposure and outcome. This requires controlling other factors that could have influenced the result. This is generally achieved using a comparison or control group whose only difference was that it did not receive the experimental exposure.
Our Western scientific tradition accepts an experiment as the highest form of proof. And we hold that it is ethically necessary to base therapies on scientific proof, rather than on personal opinion. We therefore use experiments to answer practical questions such as “If I give this 50 y.o. Filipino-Canadian male this new drug, will it reduce his blood pressure?” Experiments are designed to discover objective truths, free from personal bias, drug company hype, political or historical influence.
- A sample of patients with the condition, and who meet other selection criteria, are randomly allocated to receive either the experimental treatment, or the control treatment (commonly the standard treatment for the condition).
- Occasionally, a placebo or sham treatment will be used in the control group, but where there already is a standard treatment, it is unlikely to be ethical to use a placebo.
- The experimental and control groups are then followed for a set time, and relevant measurements are taken to indicate the results (or 'outcomes') in each group.
Note: The "random" of RCT refers to random allocation to either experimental or control group; it does not refer to random selection or sampling of the patients to include in the trial. Do not confuse these two concepts. Random selection of a sample ensures that the sample is representative of the broader population; it is typically used in a survey (an observational study). Random allocation ensures the experimental and control groups are equivalent, but does not ensure they are representative of all patients with the condition.
Why Use Random Allocation?
Random allocation is used mainly to avoid confounding. To make sure that any differences in the final outcome measurements were due to your experimental treatment and not to a confounding factor, you want the two groups to be comparable on all other factors (in other jargon, you want to control all other factors). In theory, if randomly allocated groups are sufficiently large, they will be equivalent (so, directly comparable) on all variables, including ones you do not even know about (like genetic characteristics).
If you know about a confounder before beginning the experiment, you could match the two groups on it (e.g., ensure equal numbers of males and females in each group). However, matching would not remove the effects of any confounder that you do not know about, such as a biochemical parameter that modifies the action of the drug. Herein lies the genius of random allocation: randomization protects against all potential confounding factors, both known and unknown. This is very convenient: you don't have to measure and control for each factor individually!
Intention to Treat Analyses
Sometimes participants in RCTs drop out of the study and do not complete the course of therapy. Discarding them from the final analysis of results may cause a bias. Perhaps people for whom the therapy was not working dropped out of the study, so basing the analysis only on those who completed the trial would produce a falsely positive impression of efficacy. Hence, a basic rule in running a trial is to include everyone in the analysis, including those who drop out. If they lack a final outcome measurement the last available measurement is taken; this may give a conservative impression of efficacy.
Evidence-based health care is founded on evaluative studies. Evaluation is the systematic process of determining the effectiveness and safety of an intervention that seeks to prevent or cure a health problem.
- Efficacy means "How well does the intervention work under ideal conditions?"
"Ideal" usually refers to a carefully controlled experimental study where, for example, the patients actually take the medication and receive optimal care.
- Effectiveness means "How well does the intervention work when it is applied in the community?"
This refers to the more typical setting of real life: whether the intervention will work when deployed in a normal practice setting.
- Efficiency considers the results achieved in relation to the effort expended in terms of money, resources, and time. Efficiency is "a measure of the economy (or cost in resources) with which a procedure of known efficacy and effectiveness is carried out".
With limited resources we must choose between services to fund; health economics offers systematic ways to guide these choices.
A basic economic principle is that a health service should deliver the greatest benefit per unit of cost. ("All effective health care should be free" - Archie Cochrane). There are different ways to estimate benefits, leading to four main types of economic evaluation. These use similar methods to assess costs but differ in the way they assess benefit:
Cost minimization analysis
This is the simplest form of economic analysis, and applies when the benefits of two interventions are the same, so the cheaper intervention ought to be chosen. An example of this would be the choice between a ‘name-brand’ and a generic drug.
Cost benefit analysis
This evaluation assesses benefit in units of monetary value (dollars). If a disease causes a worker to take time off work (an economic cost), preventing or curing the condition would confer direct economic benefit.
You will sometimes see estimates of the economic cost of diseases: for example, a 1999 estimate showed that obesity costs Canada around 1.8 billion dollars (Birmingham CL, et al. The cost of obesity in Canada. CMAJ 1999; 160: 483-8).
Cost effectiveness analysis
This assesses benefit in terms of health outcomes, such as improved symptoms or survival. An example is a study that compared standard in-vitro fertilization with a ‘mild’ approach, considering the different costs associated with these approaches and using the number of cumulative pregnancies resulting in live birth, as the outcome (Polinder S, et al. Cost-effectiveness of a mild compared with a standard strategy for IVF. Human Reproduction 2008 23(2):316-323).
Cost utility analysis
This form of economic analysis is a form of cost-effectiveness, where the measure of benefit is adjusted to include the utility of the benefit, via QALYs or DALYs or the health-adjusted life expectancy (HALE). A significant advantage of the cost-utility analysis is that it allows for the comparison of different procedures and their related outcomes. Note that many published papers purporting to be cost-effectiveness analyses are actually cost-utility analyses.
Costs. The first step in any economic analysis is to estimate programme costs. Provincial governments can reasonably accurately estimate the total cost of their health care system, but estimating the cost for particular types of care are more difficult. Consider the following:
- Distinguish between direct costs (medical and nursing salaries, medical supplies, etc.) and indirect costs (administration, research, training, construction & maintenance of the hospital, etc.).
- Direct costs are the most relevant as they are more likely to be influenced by health care policies and processes. But they only represent part of the overall bill.
- A common question is to compare the costs of providing care (for example, obstetrical care) in different types of hospital. This is complicated because different hospitals treat different types of medical condition, and also different levels of severity within each condition. Costs would be expected to be higher for more complex or severe cases, so it's not useful to point out that "it costs more to deliver a baby in the Ottawa Hospital than in the community hospital in Moose Baby Falls".
- Judging whether costs for a given procedure are reasonable requires creating an case severity adjustment (e.g., breech versus normal delivery).
- Hospitals, of course, treat a range of patients with varying levels of complexity. The 'case mix' may vary from hospital to hospital, typically involving more severe cases in a tertiary care referral hospital than in a small rural hospital. Just as we can estimate what cost is appropriate for an individual case, we can estimate cost weights for each case mix.
- Comparing costs across type of disease is made possible by a Resource Intensity Weight (RIW). These are assigned by the Canadian Institute of Health Information (CIHI).
A clinical trial is a research experiment that administers a new regimen to humans to test its safety and efficacy. It generally uses the Randomized Controlled Trial (RCT) format. There is a generally accepted sequence of undertaking research to bring a new pharmaceutical product to market.
Phase I studies follow animal experimentation, and primarily determine how the drug works in humans: range of dosage and safety. Generally undertaken in healthy people.
Phase II studies test and safety in small groups of patients with the condition (around 100).
Phase III trials are much larger (thousands of patients with the condition) and are randomized trials comparing the medication with established treatments to show if it is safe and . The results of several randomized trials are often combined, sometimes using a systematic review and sometimes a meta-analysis.
Phase IV studies come after a license has been issued, and provide more information on longer-term safety and side effects, and also on how well it works with other conditions.
Meta-analyses. Note that when several studies have been done on a topic, the results can be combined in a meta-analysis. The overall result is an average of the results from each study (actually, a complicated formula is used), and the confidence intervals become much narrower than they were for any single study: we can now make a much more precise estimate of the true outcome.
Challenges Facing Clinical Trials
Here is a web site that contains guidelines for critically appraising the quality of studies
A primer on Multivariable Analysis for readers of medical articles
Link to UK cancer research site on stages of clinical trials.
An example would be a study of the effects of removing ophthalmic services from the OHIP billing schedule: is there a decrease in eye tests after the change? A quasi-experimental study might record the number of eye exams per thousand population over the years up to the policy change, and compare this pattern with the pattern afterwards. This is an observational study, but there was also an intervention, although it was not the experimenter who decided when and how the change would occur and to whom it would be applied, so this is a "quasi-experiment." Typically, random allocation is not involved.
A "natural experiment" is similar, but refers to naturally occurring events (e.g., a study of mental health following an earthquake).
Summary: Advantages and Disadvantages of Study Designs
HELP! How Do I Remember Which Design is Which?
Here are some ways of keeping the case report, case-control, cohort, cross-sectional and RCT studies straight.
When someone says….
One person, one case. Like a police report of an incident.
Cases AND controls – the “case” people have the disease, the “control” people don’t. We are going to look back in time to figure out what made them different.
A group of people. Perhaps your class has been referred to as the MD20xx cohort because you started at the same time and continue through medical school together. A cohort study is the same way – recruited at the same time and observed over a long time.
Cross-section of a tree (or anything else of your choosing). It’s a one-time slice to get a better view.
The major type of experimental study; people are randomly sorted into control and experimental groups to try something out.
Why Do We Need Statistics?
- Statistics refers to the art and science of collecting, summarizing, and analyzing data that are subject to random variation.
- Medicine does not deal with deterministic phenomena in which a cause inevitably produces a consistent effect. Exposure to a harmful virus does not always mean a person will get sick, just as a given drug will not work on every patient with the same disease.
- These variations between patients are often very hard to understand. As a simplifying assumption we may treat them as random events, and use a probabilistic model of explanation: with a certain level of confidence, we predict statistically that such-and-such a proportion of patients will respond in the following way.
- The evidence in evidence-based medicine all came from published studies that were conducted onof patients. The patients were selected for various reasons and in different ways so may or may not have been representative of the broader population from which they were selected. A central concern of statistics is to estimate how accurately the results from a study sample may represent reality in the broader population from which the sample was drawn: how confidently can we generalize from the results? This is indicated by probabilities, represented by the letter p in statistical tests.
Six Main Uses For Statistics
Descriptive statistics cover ways of summarizing characteristics of study samples: how do we summarize data? Means, medians, standard deviations and variance fall in this category.
Inferential statistics describe how well conclusions based on studying a sample of people will apply to the broader population. We can never study everybody, so base our scientific understanding on samples of people (or animals, or reagents) and then generalize the findings to the broader world. The true value in the population is called a “parameter” and this is what a study is trying to estimate.
A common application of inferential statistics is to determine (in terms of probability) whether differences observed between two samples also apply to the broader populations that were sampled. In other terms, might a difference found in a study sample have been a chance result, due to characteristics of that particular sample? By extension, if the study were repeated on other, similar patients, would the results be substantially different? Statistical tests estimate the probability that results from a sample will differ from the tell us how large an observed difference needs to be before we can be confident that it represents a real difference in the population.
Statistics of agreement concern how strongly two (or more) variables are related.
Validity statistics consider how accurate a measurement is: does it measure what it is supposed to?
Analytical statistics provide us with ways to disentangle the influence of several variables. We all know that smoking, being obese, not exercising, eating the wrong foods (and many other factors) influence the risk of heart disease. But these factors themselves interrelate, so how do we separate the independent effect of each? Methods such as multiple regression and logistic regression lie here.
Link: How do we summarize the very large data sets that come from diagnostic imaging? TED talk by Anders Ynnerman.
Note: There are two ways to use this section:
- If you are already comfortable with statistics, you can use the “Choosing A Statistical Method” box below to find the appropriate statistical test for the question you are asking and the measurement scale of the variable involved.
- If you prefer a more textbook approach or if you are a beginner in statistics and want a complete overview, you can use the “Overview of Statistical Tests” section. By clicking through each of the “Measures of…” links, you will get a full overview of common statistical measures.
Choosing a Statistical Method
Dozens of statistical tests have been designed for different purposes. The starting point is a simple question: What are You Trying to Assess? The table below offers some options, but please realize there are dozens more. We are just trying to protect your sanity by presenting the ones you will see most often.
Choosing a Statistical Method Measurement scale of the variable(s) involved Question you are asking interval or ratio ordinal categorical How can I summarize the data? proportions How accurate is this sample estimate?
How strongly are these variables associated? Are these two values significantly different? Do these risk factors really affect health? Does this test measure what I hope it measures?
Overview of Statistical Tests
- Used for Interval or Ratio scale numbers
- Used chiefly for Ordinal numbers, but also for Ratio and Interval scales.
- Used for Ratio, Interval or Ordinal numbers
Variance: the average of the squared distance between each value in a distribution and the mean of all values. This improves on the range by including information from every observation, so is less influenced by extreme values. [Why ?]
- Used for Ratio or Interval numbers
Standard Deviation (SD): a statistical measure of the degree of scattering in a set of data; the square root of the variance. It has the property that, in a , roughly two-thirds of all observations fall within 1 SD of the mean (one-third above and one-third below). In addition, roughly 95% of all observations will fall within plus or minus 2 SD. The SD is dimensionless, so that an SD of 2.0 has a similar meaning whatever the original measurement scale.
- Used for Ratio or Interval numbers
If you prefer equations, these below might make “variance” (symbol = SD²) and “standard deviation” (SD) make more sense. These equations won’t be tested on SIM exams.
Inter-quartile range: the distance between the observation at the top of the bottom quarter of values and the one at the bottom of the last quarter. This is used when scores are measured on an ordinal scale (e.g., pain scores on a 0 to 10 scale). Also used where a variable that has a very skewed distribution, so that extreme values at the end of the distribution could distort calculation of the range. The inter quartile range side-steps this by calculating the range not of the overall scale, but the scale distance between the person on the 25th centile and the one at the 75th.
Standard error of the mean (SEM) - the SD divided by the square root of N.
Rationale: the mean calculated form a sample may not perfectly match the mean in the whole population. Hence, if you repeated a study lots of times using different samples from the same overall population, you would obtain a number of estimates of the true population mean. These estimates would form a distribution, centered on the true population mean, and with a variation called the 'standard error of the mean'. It is the equivalent of a standard deviation for individual measurements, but now it applies to mean values. The SEM is always smaller than the SD.
Beware! If you see a results like "2.6 +/- 0.9" you must read the article carefully to see whether the 0.9 is the SD or the SEM.
Estimating a Parameter
The idea is that you have calculated a value from studying a sample, and want to show how accurate this estimate is for the whole population. Research reports often contain statements such as “the prevalence of inebriation among law students was 13.5% (95% CI 10.1%, 16.9%). This means that your study found 13.5% of students were drunk, and you’re 95% confident that the true prevalence lies somewhere between 10.1 and 16.9%. The bigger the sample size (n), the narrower the CI, because with a bigger sample we have more confidence in the result.
- Used for Interval or Ratio numbers
By analyzing the variation within this sample, we can estimate the range within which the true parameter is likely to lie.
Improving an Estimate: Narrowing the Confidence Interval. The CI around a mean value is derived from the standard error of the mean: it is +/- 1.96 SEM. The formula for SEM includes sample size in the denominator, so the bigger the sample size, the narrower the confidence interval. This is intuitive — with a bigger sample we should have more confidence in the result.
So, to get a narrower CI you could increase the sample size, or else find a way to narrow the variance in the sample, perhaps by selecting a more homogeneous sample but this could limit the generalizability of the study. Alternatively you could improve the measurements made in the study to reduce random measurement errors. All in all, the simplest way to get a more precise estimate is to increase sample size.
Link: This article on interpretation of statistical Significance Tests has useful information on interpreting confidence intervals.
Studies of causal factors are concerned with measuring the association between variables – typically a risk factor and an outcome, or the association between risk factors.
Correlation coefficient: A number that describes the relationship, or co-relation, between two or more random variables or observed data values. It can also be thought of as a statistical measure that shows how closely two variables lie in a linear association: how accurately you could predict one value if you knew the other. The correlation coefficient can range from -1 to +1, the two extremes denoting a perfect linear relationship and 0 denoting a complete absence of relationship.
Rank Correlation: Imagine you arrange the children in a class in ascending order by their height, noting each child's position (first, second, third, ...) from shortest to tallest. Then, rearrange the same children in order of body weight, and again record their position. Is each child in the same rank position on both measurements? Perhaps in general they will be, although a short child could be heavy, or a tall child could be skinny and light, so they could be in different rank positions for each variable. The rank correlation shows how perfectly the ranking of a sample of observations, rank-ordered on two variables, correspond. Two common examples of rank-order correlations are the Kendall and Spearman formulas. Both give a value from -1.0 (perfect reverse rank-ordering) through 0.0 (no relationship between the two rank orders at all) to +1.0 (identical rank ordering on both variables).
Regression analyses: A family of analytic methods that extend correlations and show how much change occurs in the dependent variable for each unit change in a predictor variable. In medicine, the dependent variable is typically an aspect of health, like blood pressure, while the independent (or predictor) variable could be number of cigarettes smoked daily.
There are several variants. Multiple regression shows the joint and separate effect of several independent variables (e.g., smoking, weight, amount of exercise) on a dependent variable measured at the interval level (e.g., BP). Logistic regression does the same when the dependent variable is a category, such as alive or dead.
Relative Risk (RR): A ratio that describes the increased (or decreased) risk of disease among people with a risk factor, compared to those without. This indicates of the strength of a risk, or causal, factor. For example, a cohort study may compare the incidence of disease (i.e. the risk) among people exposed to a causal (or protective) factor, to the incidence among people not exposed; this is the relative risk.
An RR of 1.0 means that the two incidence rates are equal, so the factor has no effect. An RR of 2 would indicate that the exposed people were twice as likely to get the disease; an RR of 0.5 means they were half as likely, so the factor protected them from the disease.
Odds Ratio (OR): The Odds Ratio is the ratio of the probability of occurrence of an event to that of non-occurrence. In a medical context, it can be thought of as “the odds that an outcome will occur given a particular exposure, compared to the odds of the outcome occurring in the absence of that exposure” (reference). When calculating OR for an intervention, the numerator is the odds in the intervention arm and the denominator is the odds in the control or placebo group.
In a case-control study, the OR provides an estimate of the relative risk. This is required because you cannot calculate incidence in a case-control study. However, in some cases, interpretation of the Odds Ratio as equivalent to the Relative Risk can lead to underestimation or overestimation of the Relative Risk. If you are curious, read “When can odds ratios mislead?” by Davies et al.
Relative risk reduction: The reduction in risk for individuals taking a treatment vs. a control or placebo group, expressed as a percentage reduction compared to the control risk. If people taking a treatment had a risk of 10% for a negative event and people who hadn’t taken the treatment (control group) had a risk of 20%, there was a 50% relative risk reduction.
- RRR = (Control Event Rate – Experimental Event Rate) ÷ Control Event Rate
Absolute Risk Reduction: The absolute difference between control event rate and experimental event rate. Relative risk reduction can be misleading – a reduction from 2% to 1% is a 50% RRR but only an absolute risk reduction of 1%.
- ARR = Control Event Rate – Experimental Event Rate Hypothesis Testing
Attributable Risk (AR): Indicates the number of cases of a disease among exposed individuals that can be attributed to that exposure. It is calculated by comparing to the incidence of the disease in an unexposed population.
- AR: (Incidence in exposed - incidence in unexposed)
- Or as a fraction (Incidence in exposed - incidence in unexposed)/incidence in exposed
Population Attributable Risk (PAR): similar to AR except it is concerned not with the excess rate of disease in those exposed but the excess disease caused by the factor in the population. PAR indicates the number (or proportion) of cases that would not occur in a population if the factor were eliminated (e.g. how many lives would be saved if people no longer smoked?)
The attributable risk in a population depends on the prevalence of the risk factor and the strength of its association (the relative risk) with the disease. The formula is
PAR = Pe (RRe-1) / [1 + Pe (RRe-1)],
where Pe is the prevalence of the exposure (e.g., proportion who are overweight) and RRe is the relative risk of disease due to that exposure.
The Population Prevented Fraction refers to situations where exposure to a factor is protective. The prevented fraction is the proportion of the hypothetical total load of disease that has been prevented by exposure to the protective factor. The formula is
where Pe is the prevalence of the exposure (e.g., proportion who are overweight) and RRe is the relative risk of disease due to that exposure.
The Attributable Number refers to the number of cases attributable to an exposure. The formula is
AN = Ne (Ie - Iu),
where Ne is the number exposed, Ie is the incidence among those exposed, and Iu is the incidence among those unexposed to that factor.
Inferential statistics estimate the accuracy of conclusions concerning relationships between variables in a population, based on the results observed in a sample.
Before describing statistical tests, two key ideas must be understood: testing hypotheses and statistical significance.
Hypothesis testing is almost always concerned with estimating differences – differences between patients treated with two drugs, or differences between people exposed to a possible causal factor and those not exposed, etc. The purpose of the statistical test is again to estimate whether the difference observed is large enough to be considered “real” (called “statistically significant”) or whether it could just be a chance finding.
All research should be as rigorous and definitive as possible, so we must begin with a very clear statement of the research question. For etiological and experimental studies, this usually takes the form of a hypothesis. A hypothesis states the anticipated findings of the study; it is typically based on the results of previous studies and, ideally, on theory.
When you start an experiment, you begin with two hypotheses:
- Study Hypothesis: A statement that reflects the expected study outcome, e.g. “The new drug will lower blood pressure in 50 y.o. men by 4 mm Hg more than conventional treatment.”
- There is then a Null Hypothesis, a conflicting statement that declares no difference between the treatments (e.g. “There will be no difference in the blood pressures of 50 y.o. men taking this new drug, as compared to conventional treatment.”)
The null hypothesis was required because of a quirk of logic, which states that we can never definitively prove a hypothesis, because there may always be some better explanation that we haven't thought of. But we can disprove a hypothesis. Therefore, statisticians (a smart bunch, but devious) establish a pretend null hypothesis and then try to disprove that. The null hypothesis generally takes the form "Just (for a moment) pretend that there is no difference between the treatment and control groups... How confident can we be that the results we obtained from our study disprove this null hypothesis, so making the study hypothesis more plausible?”
Intuitively, your confidence in the results of any study will depend on the size of the sample used, and on the strength of the results observed. The bigger the study and the larger the difference between the experimental and control drugs, the more impressed you will be. Your intuition is correct, and statistical significance testing formalizes this common sense.
The Five Steps in Testing Hypotheses:
- Set up a study hypothesis
- Set up an alternative, null hypothesis
- Administer experiment and record results (e.g. record blood pressures for an experimental group on the new drug vs. a control group on conventional treatment)
- Select an appropriate statistical test that compares the results in the experimental and control groups; the test considers the size of the difference and also the range of variation in each group. The statistical test will produce a numerical value.
- You (or your computer) then compare this value with reference values that consider the number of people studied. The result indicates statistical significance of the difference you found; this tells you whether you can be confident in rejecting the null hypothesis, and in selecting the study hypothesis instead.
Statistical significance estimates the possibility that the results observed in a study may have been merely a chance finding, unique to that particular sample, and would likely not be repeated if the study were re-done.
Statistical tests estimate 'p' (for probability): the chance (hopefully, very small) that the difference you observed would occur if the null hypothesis were actually true. Put another way, it shows the likelihood that you will make an error in concluding that a difference exists in the broader population when in fact it does not. This is called a “Type I error” and is similar to a on a screening test. The “p” value depends on the sample size (the bigger the sample, the more confident you will be that it produces trustworthy results) and the size of the difference observed.
Results are said to be "statistically significant" if the probability that the result is compatible with the null hypothesis is very small. You can choose the cut-off for this probability, but a convention is to use 5%, or "p < 0.05", for declaring the result of a study as statistically significant. p < 0.05 implies that there is only a 5% chance that a difference of the size found in the study (or a greater difference) would occur by chance, if there was actually no difference in the whole population.
Crucial Point: Testing statistical significance is all about the likelihood of a chance finding that will not hold up in future replications. Significance does not tell us directly how big the difference was or whether this result will be useful for your patient. It simply says “It is not very likely this ia a chance finding”. See below, "Statistical Significance and Clinical Importance".
t-test - a test for comparing the mean values from two samples. It shows how confident we can be that the two mean values differ in the populations from which the samples were drawn. Using statistical jargon it shows how confident we can be in rejecting the null hypothesis that the two samples come from populations with identical mean values. The t-test first subtracts the means of the two groups. This is divided by the standard error of the difference in the means. Large values imply that the means differ by more than would be expected by chance.
ANOVA - "Analysis of Variance". An extension of the t-test used (for example) to compare the mean values from more than two samples. It compares the variation within each sample to the variation between the samples. For example, if you want to show that Med students are smarter than other students, you could record IQ for students in several faculties. There will naturally be some variation among students in each faculty (variance within), but the ANOVA tests whether there is greater variation between the faculties. [Kind of a dumb example, isn't it? I guess you don't need any statistics to know that Meds are far smarter...Sorry.]
Chi-square test - a statistical test of the association between two categorical variables: it compares the frequency distribution observed in a sample with what would be expected under the null hypothesis.
Using Chi-Square Test
Chi square is often used for evaluating the association of two variables presented in tables such as the familiar "2 x 2 table". For example, you could classify students into Medical students versus Other students, and also according to whether they were Beautiful People or not:
Our hypothesis is (obviously) that med students will be more beautiful than others, so the null hypothesis would be that medical students are no more likely to be beautiful.
Now comes the crucial idea: If the null hypothesis were true, and there is no association between the type of student and their looks, you would expect 50 in each cell in the table above. ("Why?", you ask. Well, if there was no association between beauty and type of student, the 200 people studied would be equally distributed into the 4 cells of the table, giving 50 in each)
But in fact we observe that 60 of the 100 medical students were beautiful, compared to only 40 of the others.
Statistical question: maybe this difference between expected and observed is just a chance finding... How likely is this, given a sample size of 200? The chi-squared test will tell you the probability this 60: 40 finding might arise merely due to chance.
Note that the data here are not scores on a measurement, but just numbers of people: that's what chi-square is good at.
The discussion of statistical tests and significance levels introduces the notion of the statistical power of a study. This refers to its ability to detect a difference, if one exists.
There are two main reasons why a study may not show a significant difference between groups being studied (e.g. in a randomized trial of a new drug, or a case-control study testing the effect of an exposure on a disease).
- There really was no significant difference (hence a true negative result)
- There was a difference but the study failed to detect it (false negative result). This may arise because the study was poorly designed (e.g. used imprecise measurements) or because the study was too small (in statistical jargon, it "lacked power").
The power of a study is its ability to detect a difference, if the difference in reality exists.
Additional Information Regarding the Power of a Study
One way to think about power is in terms of false negative results: power refers to the likelihood of avoiding a false negative. Real statisticians use more complex terms, of course, and speak of power as the probability of not making a beta, or a "Type II" error, which refers to falsely concluding that there was no difference (e.g., between experimental and control groups) when in fact there was a difference, but the study failed to show it.
Any study involves only a sample of people from the population of interest, and there are several reasons why the study may fail to detect the real difference that exists in the population. What factors influence whether or not a study will be able to detect a real difference?
The whole edifice of statistical significance testing only tells us how accurately the results obtained in a study may apply to the broader population from which the sample was drawn. However, your task as a clinician goes one step further: you also have to judge how likely the results may apply to your patients, who probably live in a different place and time and may never have been part of the population in which the study was undertaken. That is not a purely statistical question because the researcher cannot know where you practice and how your patients differ. But the p values, along with your skilled judgment, can nonetheless give some clues as to the likely relevance of the study findings to your patients. This will be explored in the EBM component of the curriculum.
A key idea to grasp in interpreting significance tests is that if a study is very large, its result may be statistically significant (= unlikely to be due to chance), and yet the deviation from themay be too small to be of clinical interest. Perhaps the new drug tested in a very large study is just a tiny bit better than the old one, and you may feel that it is trivial or clinically unimportant. Conversely, the result may not be statistically significant because the study was small (or "under powered"), but the difference is large and would seem potentially important from a clinical point of view.
Although you would almost certainly not apply a new treatment unless studies had shown it to be statistically superior, statistics alone cannot fully answer the question of whether a result is clinically significant or important. It is a question of clinical judgment, considering the magnitude of benefit of each treatment, the respective profiles of side effects of the two treatments, their relative costs, your comfort with prescribing a new therapy, the patient's preferences, and so on.
Clinical importance/clinical significance: This refers to the clinical relevance of an observed: will it alter your practice?
There are several ways of illustrating the benefit of treatments that may be useful to a clinician, beginning with the use of confidence intervals instead of significance levels. Confidence intervals show the likely range of results within which the true value is likely to lie.
• An example: a study showed a statistically significant impact (p < 0.03) of meditation on reducing systolic BP compared to controls. The mean reduction was 7 mm Hg (95% CI 4, 10). Instead of significance testing telling us that this study result could have occurred 3% of the time by chance alone, confidence intervals tell us what our best guess is for the size of the population effect, 95% of the time. This seems more informative for the clinician.
The next statistic that may assist the clinician in deciding whether to switch to a new treatment is the Number Needed to Treat.
The NNT is the number of patients with a condition who must follow a treatment regimen over a specified time in order to achieve the desired outcome for one person.
The NNT summarizes the effectiveness of a therapy, or a preventive measure, in achieving a desired outcome. It is one way to indicate the clinical significance of an intervention. The simple idea is that no treatment works for everybody, so how many would you need to treat to benefit one case?
- The NNT can be presented in negative terms, where the goal is to avoid a negative outcome such as taking medication to prevent a stroke; but it can also apply to achieving a cure.
How Is It Calculated? Let's work from an example:
Researchers tested a new drug that aims to decrease the chance of stroke in men who experience atrial fibrillation. The study included 1,000 who took the new drug for 5 years, and 1,000 were given the standard therapy. At the end of the trial, 6% of the men in the standard therapy group experienced a stroke, compared to only 2% in the group taking the new drug.
A simple way to express the benefit of the new drug would be the relative risk reduction (RRR), which compares the reduction in strokes with the rate in the standard therapy group: 6% − 2% divided by 6%, giving 0.66, or a two-thirds risk reduction.
• Recall: RRR = Control Event Rate − Experimental Event Rate ÷ Control Event Rate
This sounds impressive, but relative calculations can be misleading because they omit the risk of no treatment (the 'absolute risk'). An alternative statistic, the absolute risk reduction (ARR), is calculated by subtracting 2% from 6%, which gives a more modest (and less misleading) description of the benefit. In other words, 4% implies that only one 4 men in 100 actually benefit from the new drug.
• ARR = Control Event Rate – Experimental Event Rate
The NNT is the reciprocal of the absolute risk reduction, or 1 / ARR. This gives 25, meaning that 25 men had to receive the new drug for 5 years in order for one man to benefit (i.e. one less stroke to occur).
Now consider a study with the same parameters at the outset, but 20% of men suffered a stroke on the conventional therapy compared with 10% on the new drug. The RRR is 50% (less good than before), but the NNT is now just 10, suggesting a much greater advantage of the new drug.
This presentation of the absolute benefit of therapy is probably what patients are most interested in, so it more useful than the relative risk reduction.
Cast in positive terms, if a medication cures 35% of people who take it, while 20% improve using a placebo, the absolute improvement is 15%. So the NNT is 1 / 0.15 = 7. So on average you would need to treat 7 people to achieve 1 cure.
Some Cautions In Interpreting NNT
- You cannot use NNT figures from different studies to compare two or more therapies. For the comparison to be meaningful, the therapies must have been tested in similar population samples with the same condition, using the same comparator, time period, and outcomes.
- The NNT always has to include a time frame (5 years, 10 years, etc).
- The NNT should also be interpreted very carefully with patients who experience the same event numerous times, such as repeated asthma attacks, as it can lead doctors to over-estimate the benefit of therapy and even over-prescribe.
- Relying solely on NNT ignores other helpful sources of information. Information on side-effects, costs, cost-effectiveness, and patient preferences are also important for making informed health and health care decisions. And using NNT alone also doesn’t give patients an idea of their baseline risks.
Why Do We Need an NNT? Why Does a Treatment not Benefit Everyone?
Take the example of prescribing Coumadin as a blood thinner to prevent stroke in people with atrial fibrillation. First, because of biological variability, not all people who take the drug will benefit from it. But we don't know which ones will benefit, so we have to take a 'shotgun' approach.
Second, and more important, only a small fraction of people who don’t take the drug will actually ever develop a stroke. While we know that atrial fibrillation statistically increases the risk of a stroke, we don't yet know how to identify those who will actually get a stroke (it might even be a matter of pure chance – at present we don't know). So, among those who did take the Coumadin only a small number were actually at risk and so could potentially benefit from it.
Updated September 20, 2017