Introduction
� What is Statistics? � Empiricism and "The Scientific Method"
Measurement and Variables
� Measurement � Types of Variables � Independent and Dependent Variables
Data
� Collection and Storage � Information Quality
Sampling
� Population & Sample � Probability Samples � Simple Random Samples � Sampling With and Without Replacement
Notation & Vocabulary
Statistics is the discipline concerned with the collection, organization, and interpretation of numerical data, especially as it relates to the analysis of population characteristics by inference from sampling. The discipline of statistics addresses all elements of analysis, from study planning to the final presentation of results. Statistics is more than a compilation of computations techinques; it is a means of learning from data; it is "the servant of all sciences" (Neyman, 1955).
So what exactly do statisticians do? In brief, the job of the statistician is a combination of data detective and judge (Tukey, 1969, 1991). The detective explores data for the purpose of finding clues and patterns. The judge adjudicates and tests patterns for the purpose of verification. To concentrate on exploration without adjudication would be an obvious mistake, for facts must be objectively evaluated confirmed. On the other hand, to submerge detection to an inferior role would be equally erroneous, for for where does new knowledge come from if not from detection? Therefore, both detection and adjudication are important!
We often speak of two types of statistics: descriptive statistics and inferential statistics. Descriptive statistics include procedures for summarizing, organizing, graphing and otherwise describing data. Such statistics are particularly helpful during the initial stages of detection and discovery. In contrast, inferential statistics are used to generalize from a sample to a population and to confirm specific hypotheses. Such work is especially important when adjudicating the fact. In practice, it should be noted that descriptive statistics and inferential statistics tend to overlap.
Although methods and principals presented in this primer are applicable to all fields of statistics, most of its examples come from the field of biostatistics. Biostatistics (also called biometry, literally meaning "biological measurement"), is the application of statistics to biological and biomedical problems. There are many fields of biostatistics (e.g., theoretical biostatistics, laboratory biostatistics, epidemiologic biostatistics), each having the goal of making sense of data, while communicating complex ideas with accuracy, efficiency, and truthfulness. Einstein's three rules of work apply: "Out of clutter, find simplicity. From discord, find harmony. In the middle of difficulty lies opportunity."
Empiricism
and "The Scientific Method"
Our reliance on statistics can be examined against the backdrop of empiricism and "the scientific method." Empiricism (from the Greek empirikos - experience ) means "based on observation." The scientific method is not an actual method -- at least in the normal sense -- for there are no orderly rules of progress and no set procedures to follow. Nevertheless, it is based on a combination of empiricism and theory which uses several overlapping stages of reasoning. These stages of reasoning include:
Statistics seeks to make each of these stages more objective (so that things are observed as they are, without falsifying observations to accord with some preconceived world view) and reproducible (so that we judge things in terms of the degree to which observations can be repeated).
Measurement is the assigning of numbers or codes according to prior-set rules. It is how we get the numbers upon which we perform statistical operations.
Measurements that can vary or be expressed as more than one value throughout a study are called variables. For example, we may speak of the variable age, blood pressure, or height. In other words, variables represent the "thing" being measured.
In statistical formulae, variables are represented with capital letters (e.g., X, Y, Z). In computerized data bases, variables are denoted with short descriptive names (e.g., AGE BP, HEIGHT) which are usually 8 characters or less.
Although there are many ways to speak about variables, no standard taxonomies exist (Velleman & Wilkinson, 1993). For the current discussion, we need consider whether a variable is:
Continuous variables represent quantitative measurements. For example, AGE (in years) is continuous. Continuous variables are also called quantitative variables or scale variables.
Ordinal variable represent rank-ordered categories. For example, OPINION scales in which responses are graded 5 = strongly agree, 4 = agree, 3 = neutral, and so on, is an ordinal variable.
Categorical variables represent named attributes. For example, SEX (male or female) is a categorical variable. Categorical variables are also called qualitative variables or nominal ("named") variables.
Categorical data are often not directly translatable into numerical data. For many statistical procedures, therefore, categorical variables are limited to dichotomous (yes/no) classifications indicating the presence or absence of an attribute or condition. When only two categories are present, such as with gender, we might translate the categorical variable to a scale indicating the presence of absence of, say, being male (1 = male, 0 = not male). For categorical data consisting of more than two categories, more than one variable is needed to represent data. For example, let us consider race with four categories: black, Asian, white, and other. In this case, we could create three variables (one less than the number of categories) to represent race. The first variable could indicated the presence or absence of black race (1 = black, 0 = not black), the second variable could indicate the presence of absence of being Asian (1 = Asian, 0 = not Asian), and the third variable could indicated the presence or absence of being white (1 = white, 0 = not white). The fourth category need not be encoded, for an absence of the other three attributes translates into other. In general, if there are k categories that need encoding, then k - 1 indicator variables will be used to translate the original categorical variable into numeric terms.
We may also classify variables as being either independent or dependent. The independent variable of an analysis is the factor, intervention, or attribute that either defines groups or is thought to predict an outcome. The dependent variable is the measurement, outcome, or endpoint of the study. All variables other than the independent variable and dependent variable in a particular analysis are referred to as extraneous variables.
Many statistical analyses seek to determine the extent to which the dependent variable "depends" on the independent variable. If the level of the dependent variable depends on the level of the independent variable, a statistical association is said to exist. For example, in studying the effects of smoking on forced expiratory volume (a measure of respiratory function), smoking represents the independent variable and forced expiratory volume represents the dependent variable. All other variables (e.g., age, sex, height, etc.) would be considered extraneous.
Data may be collected experimentally or observationally. In experimental studies, the investigator maintains some control of the conditions under which data are collected. Most notably, the investigator may allocate a treatment or some other intervention to the experimental subjects. In contrast, observational studies investigate subjects as they are, without intervention. Clearly, the former (experimental studies) are preferable when the effect of a treatment is being evaluated. However, experimental studies are often impossible because of ethical and pragmatic reasons (e.g., it would not be ethical to expose human subjects to treatments in which the risks clearly outweigh the benifits). Also, experimental studies are often expensive and time-consuming to complete. Therefore, many epidemiologic and social science investigations are done observationally.
Regardless of whether the study is observational or experimental, data are usually collected on a data collection form before entered and stored on a computer. Data may come from abstracting existing records, from a survey questionnaire, by a direct exam, by collecting biospecimens, by environmental sampling, or by some other means. In the jargon of research, the data collection form is called an instrument, even if it is not an instrument in the normal sense of the word.
In collecting data, each sampled unit represents an observation, and each item on the data collection form represents a variable. For example, a data collection form that looks like this:
What is your age? [__] (years)
What is your gender: [__] (M/F)
Are you HIV positive: [__] (Y/N)
Have you been diagnosed with Kaposi's Sarcoma? [__] (Y/N)
Today's date: [___/___/___]
Have you ever had an opportunistic infection? [__] (Y/N)
Each question on the form translates to a variable, and each completed form represents an observation.
Data from the forms are compiled to form a data table with observations arranged along rows and variables forming columns. A data table based on the above form may look something like this:
AGE | SEX | HIV | KAPOSISARC | REPORTDATE | OPPORTUNIST |
27 | F | Y | Y | 04/25/89 | N |
30 | F | N | N | 09/11/89 | Y |
21 | F | Y | Y | 01/12/89 | N |
30 | N | Y | Y | 10/08/89 | Y |
showing 4 observations (n = 4) and 6 variables. The variables AGE and REPORTDATE are continuous variables. The variables SEX, HIV, KAPOSISARC, and OPPORTUNIS are categorical.
Notice that specific values representing realized measurements are stored in table cells. It is important to differentiate between variables and values. Variables represent the measurement in general. Values represent specific findings. For example, in the above table, the value of variable AGE for observation number 1 is 27.
"Garbage in and garbage out" (GIGO), or so the old saying goes. Since an analysis is only as good as its data, we must place great emphasis on collecting valid information and taking care of the data once it is collected. Researchers must continually search for more accurate assessments of what they wish to measure.
When data problems do occur, they can be classified as either:
Measurement errors can result from device defects, incomplete and erroneous data sources, improper diagnostic procedures, and problems in questionnaire design and administration. Since the quality of a study's measurements will determine the validity of the study, a great deal of effort is invested in developing and testing data collection instruments, such as valid questionnaires.
When collecting survey data, one must be careful to use simple, understandable, non-leading questions. Very little should be taken for granted and nothing should be implied or assumed. Questions should be carefully worded and non-leading (see Payne, 1951). Consider how subtle word choices can influence a respondent:
Suppose I ask you to remember the word 'jam.' I can bias the way in which you encode and remember the word by preceding it with the word 'traffic' or 'strawberry.' If I have initially biased your interpretation of the word in the direction of traffic jam, you are much less likely to recognize the word subsequently if it is accompanied by the word 'raspberry,' which biases you toward the other meaning of jam. This effect occurs even though the subject know full well that he is only supposed to remember the word 'jam' and not the contextual or biasing words. . . . We do not perceive or remember in a vacuum. (Baddeley cited in Gourevitch, 1999, p. 66)
Measurement errors may also take a more subtle, insidious form -- that of measuring something similar to what is intended and pretending it is the actual thing of interest. An illustration of this phenomenon is provided by Huff (1955, pp. 74 - 75):
You can't prove that your nostrum cures colds, but you can publish (in large type) a sworn laboratory report that half an ounce of the stuff killed 31,108 germs in a test tube in eleven seconds. While you are about it, make sure that the laboratory is reputable or has an impressive name. Reproduce the report in full. Photograph a doctor-type model in white clothes and put his picture alongside. But don't mention the several gimmicks in your story. It is not up to you - is it? - to point out that an antiseptic that works well in test tube may not perform in the human throat, especially after it has been diluted according to instructions to keep it from burning throat tissue. Don't confuse the issue by telling what kind of germ you killed. Who knows what germ causes colds, particularly since it probably isn't a germ at all?
I mention this problem because of its prevalence in today's poll-driven, focus-group oriented society. Perhaps our obsession with making our point or our desire to maximize "profits" has occasionally gotten in the way of objectivity. Perhaps it is just our desire to gain an answer as economically as possible. (You get what you pay for.) When a reliable answer is needed, however, it is usually best not to cut corners.
Let us now briefly consider the other general source of erroneous data: processing errors which occur during data handling. Examples of processing errors are transpositions (e.g., 19 becomes 91 during data entry), copying errors (e.g., the number 0 becomes the letter O during data entry, coding errors (e.g., racial groups get improperly coded), routing errors (e.g., the interviewer asks the wrong question or asks questions in the wrong order), consistency errors (contradictory and nonsensical responses; e.g., male hysterectomies), and range errors (responses outside of the range of plausible answers; e.g., an age of 200) (Bennett et al., 1996). The most effective way to deal with processing errors is to identify the stage at which they occur and attack the problem at that point. This may involve manual checks for completeness (e.g., checks for legible handwriting, coding errors, and routing errors), computerized checks during data entry (e.g., computer programs that make certain responses are within reasonable ranges), checks through statistical analysis (e.g., checking for "outliers" and data incompatibilities), and double entry and validation procedures during data entry.
In most statistical studies, we wish to quantify something about a population. For example, we may wish to know the prevalence of diabetes in a population, the typical age that teenagers begin to smoke, or the average birthweight of babies born in a particular community. When the population is small, it is sometimes possible to obtain information from the entire population. A study of the entire population is called a census. However, performing a census is usually impractical, expensive and time-consuming, if not downright impossible. Therefore, nearly all statistical studies are based on a subset of the population, which we will call the sample.
When selecting a sample, we need to know how many people to study and which people from the population to select. A study's sample size depends on many factors, and will be the topic of future study. Presently, let us consider how to select a valid sample.
A valid sample is one that represents the population to which inferences will be made. Although there is no fail-safe way to ensure sample representativeness, much has been learned over the past half century about sampling to maximize a sample's usefulness. One thing that has been learned is that, whenever possible, a probability sample should be used. A probability sample is a sample in which:
The most basic type of a probability sample is the simple random sample. A simple random sample as a sample in which each member of the population has an equal probability of entering the sample. This ensures that the sample will be:
These are two extremely important sampling features.
In order to select a simple random sample, it is best to start with a sampling frame of all potential
sampling units in which each population member is then assigned an identification number between 1 and
N. A random number generator is then used to determine which of the n individuals will be sampled.
(Random number generators can be found at www.random.org/nform.html or
www.randomizer.org/form.htm). Here, for example, is a list of 10 random numbers between 1 and 600:
35, 37, 43, 143, 321, 329, 337, 492, 494, 546. Let us use these random numbers to select 10 individuals
from the population (sampling frame) listed at the URL
www.sjsu.edu/faculty/gerstman/StatPrimer/populati.htm. This population (sampling frame) has 600
individuals (N = 600). Data for the variables AGE, SEX, HIV status, KAPOSISARComa status,
REPORTDATE and OPPORTUNIStic infection are listed. The above 10 random numbers determines the
IDs of people who are sampled. Our sample, therefore, is:
ID | AGE | SEX | HIV | KAPOSISARC | REPORTDATE | OPPORTUNIS |
35 | 21 | F | Y | N | 01/09/89 | Y |
37 | 42 | M | Y | Y | 10/21/89 | Y |
43 | 5 | M | N | Y | 01/12/90 | Y |
143 | 11 | F | Y | N | 02/17/89 | Y |
321 | 30 | M | Y | Y | 12/28/89 | Y |
329 | 50 | M | Y | Y | 12/29/89 | N |
337 | 28 | M | N | N | 08/19/89 | Y |
492 | 27 | . | N | N | 08/31/89 | N |
494 | 24 | M | Y | Y | 08/19/89 | Y |
546 | 52 | . | Y | Y | 10/13/89 | Y |
(Dots represent missing values.)
Let us review the procedure for selecting a simple random sample from a well-defined sampling frame:
(A) A sampling frame of all population members is compiled.
(B) Population members are identified with unique identification members between 1 and N.
(C) The researcher decides on an appropriate sample size for study.
(D) The researcher selects n random numbers between 1 and N.
(E) Persons with identification numbers determined by the random number generator are included in the sample.
Of course, in practice, selection of a simple random sample is seldom as "clean" as this. Still, this procedure helps to conceptualize an ideal sample.
Sampling can be done with replacement or without replacement. Sampling with replacement is done by "tossing" population member back into the population pool after they have been selected. This way, all N members of the population are given an equal chance of being selected at each draw, even if they have already been drawn. In contast, sampling without replacement is done so that once a population member has been drawn, this subject is removed from the population pool for all subsequent draws. This way, once a population member has been drawn, their subsequent probability of selection is zero. Most introductory statistical texts assume that sampling is done with replacement or from a very large population, so that the distinction between sampling with and without replacement is inconsequential.
The ratio of the sample size (n) to population size (N) is called the sampling fraction. Let f represent the sampling fraction:
f = n / N
For example, if we select a sample of n = 10 from a population in which N = 600, f = 10 / 600 = 0.0167. When n < 0.05, the distinction between sampling with replacement and sampling without replacement is inconsequential.
Biostatistics: the application of statistics to biological and biomedical problems (syn: biometry).
Categorical variable: a variable used to store discrete, classified attributes (syn: nominal variable, qualitative variable, discrete variable)
Continuous variable: a variable used to store numerical-scale measurements (syn: quantitative variable, scale variable, interval variable).
Dependent variable: the outcome or endpoint variable of a study.
Descriptive statistics: statistics that are used to describe, summarize, and explore data.
Hypothesis: an educated hunch or explanation for an observed finding.
Independent variable:the factor, intervention, or attribute that is thought to influence an outcome.
Inferential statistics: statistics that are used to generalize from a sample to a population.
Measurement: the assignment of numbers or codes according to prior-set rules.
Measurement error: differences between "true" answers and what appears on data collection instruments.
Objective: the extent to which things are observed as they are, without falsifying observations to accord with some preconceived world view.
Observation: data from an individual study subject or sampled unit.
Ordinal variable: a variable used to store discrete measurements that can be ordered from low to high, but do not have equal spacing among values; rank-ordered data.
Processing error: data errors that occur during data handling
Reproducible: the degree to which observations can be repeated.
Statistics: the discipline concerned with the collection, organization, and interpretation of numerical data, especially as it relates to the analysis of population characteristics by inference from sampling. Statistics is a way of learning from data.
Value: a realized measurement.
Variable: the generic designation for a measurement that can be expressed as more than one value.
Census: a survey of the entire population, so that the sampling fraction is 100%.
Experimental study: a study undertaken in which the researcher has control over the conditions in which the study takes place, allowing the investigator to allocate a treatment or some other experimental intervention.
Independence: sampling such that the selection of one unit into the sample has no influence over the selection of any other unit.
Observational study: a study undertaken in which the research has no control over the factors being studied.
Population: The universe of potential values from which a sample is drawn.
Probability sample: a sample in which every population member has a known probability of being included in the sample.
Sample: a subset of the population.
Sampling frame: a list of the population from which a sample is drawn.
Sampling fraction: the ratio of the sample size (n) to population size (N)
Sampling with replacement: a sample in which one can replace subjects into the sampling frame after each draw.
Sampling without replacement:a sample in which one cannot replace subjects into the sampling frame after each draw.
Simple random sample: a sample in which each member of the population has an equal, nonzero probability of entering the sample; simple random samples are characterized by independence and unbiasedness.
Unbiasedness: sampling so that each unit in the population has the same probability of entering the sample.
Baddeley, A.(1999, June 14) as cited in Gourevitch, P., The Memory Thief. The New Yorker.
Bennett, S., Myatt, M., Jolley, D., & Radalowicz. A. (1996). Data Management for Surveys and Trials. A Practical Primer Using Epi Info. Llanidloes, Powys, Great Britain: Brixton Books.
Payne, S. L. (1951). The Art of Asking Questions. Princeton, NJ: Princeton University Press.
Huff, D. (1954). How to Lie with Statistics. New York: Norton.
Tukey, J. W. (1969.) Analyzing data: Sanctification or detective work? American Psychologist, 24, 83 - 91.
Tukey, J. W. (1991). The philosophy of multiple comparisons. Statistical Science, 6, 100 - 116.
Wallis, W. A. & Roberts, H. V. (1962). The Nature of Statistics. New York: The Free Press.