CHAPTER 17 Statistics II: Examining Relationships between Two Variables
In this chapter you will learn:
How to define association.
What chi-square represents.
Which measure of association is most appropriate when comparing variables with different levels of measurement.
In political science research, we are generally less concerned with describing distributions on single variables than we are with determining whether, how, and to what extent two or more variables may be related to one another. It is these bivariate (two-variable) and multivariate (more-than-two-variable) relationships that usually cast light on the more interesting research questions.
When examining the relationship between two variables, we typically ask three important questions. The first is whether and to what extent changes or differences in the values of one variable—generally the independent variable—are associated with changes or differences in the values of the second, or dependent, variable. The second question examines the direction and form of any association that might exist. The third considers the likelihood that any association observed among cases sampled from a larger population is in fact a characteristic of that population and not merely an artifact of the smaller and potentially unrepresentative sample. In this chapter we introduce some of the statistics that are most commonly used to answer these questions, and we explain when it is appropriate to use them and what they tell us about relationships.
MEASURES OF ASSOCIATION AND STATISTICAL SIGNIFICANCE
An association is said to exist between two variables when knowing the value of one for a given case improves the odds of our guessing correctly the corresponding value of the second. If, for example, we examine the relationship between the size of a country’s population and the proportion of its adults who are college educated, we may variously find (1) that larger countries generally have a greater proportion of college-educated adults than smaller ones, (2) that smaller countries generally have a greater proportion of college-educated adults than larger ones, or (3) that there is no systematic difference between the two—that some countries from each group have relatively high proportions of such people but that some from each group have low proportions as well. If our research shows that either case 1 or case 2 holds, we can use our knowledge of values on the independent variable, size of population, to guess or predict values on the dependent variable, proportion of adults who are college educated, for any given country. In the first instance, for any heavily populated country, we predict a relatively high proportion of college-educated adults, and for a less populous nation, we predict a lower proportion. In the second, our prediction is precisely reversed. In either event, although we may not guess every case correctly, we will be right fairly often because of the underlying association between the two variables. Indeed, the stronger the association between the two variables (the more the individual countries’ educational level values tend to align on each in precisely the same order), the more likely we are to guess correctly in any particular instance. If there is total correspondence in the alignments on the two variables, high scores with high scores or, alternatively, high scores on one with low on the other, we can predict one from the other with perfect accuracy. This contrasts sharply with the third possibility, which permits no improved prediction of values on the education variable based on our knowledge of populations. In such instances, when cases are, in effect, randomly distributed on the two variables, there is said to be no association.
To get a mental picture of what a strong association might look like, consider the two maps presented in Figure 17.1, which relate to the murder rate in Washington, DC, during the “crack wars” of the 1980s. Figure 17.1(a) shows the location of known drug markets in the nation’s capital; Figure 17.1(b) shows the location of homicides. Both are based on information provided by the DC Metropolitan Police Department. The apparent similarity in the locations of clusters of drug dealing and murders suggests an association between the two phenomena.
Measuring Association
Clearly there can be more or less association between any two variables. The question in each instance then becomes, Just how much association is there? The answer is provided by a set of statistics known as coefficients of association. A coefficient of association is a number that summarizes the amount of improvement in guessing values on one variable for any case based on knowledge of the values of a second. In the example, for instance, such a measure would tell us how much our knowledge of a country’s population size helps us in guessing its proportion of college-educated adults. The higher the coefficient, the stronger the association and, by extension, the better our predictive or explanatory ability. In general, coefficients of association range from 0 to 1 or – to 1, with the values closest to unity indicating a relatively strong association and those closest to 0 a relatively weak one.
In addition to the magnitude of association, it is also useful to know the direction or form of the relationship between two variables. Take another look at the earlier example about level of education of a nation’s adults, and most particularly at options 1 and 2. We have already suggested that the closer we get to either case, the higher will be our coefficient of association and the better our chances of guessing a particular country’s proportion of college-educated adults based on our knowledge of its population size. It should be obvious, however, that our predictions in the cases are precisely opposite. In the first instance, higher values of one variable tend to be associated with higher values of the other, and in the latter instance, higher values of one tend to be associated with lower values of the other. Such relationships are said to display differences in direction. Those like the first, in which both variables rise and fall together, are termed direct, or positive, associations. Those like the second, in which scores move systematically in opposing directions, are termed inverse, or negative, associations. This additional bit of information, which is represented by a plus or a minus sign accompanying the coefficient of association, makes our guessing even more effective. Thus, a coefficient of –.87 (negative and relatively close to negative 1) might describe a relatively strong relationship in which the values on the two variables in question are inversely related (moving in opposite directions), whereas a coefficient of .20 (positive—the plus sign is usually omitted—and rather close to zero) might describe a weak direct association.
FIGURE 17.1 Drug markets and homicide locations, Washington, DC, 1988.
Source: Reprinted from the Washington Post, January 13, 1989, p. E1, with permission of the publisher.
Defining Statistical Significance
Finally, we should say a word about tests of statistical significance, though our discussion of the topic will be purposely limited.1 You will recall from our discussion of levels of confidence and sampling error in Chapter 7 that when we draw a presumably representative sample and use that sample to develop conclusions about the larger population from which it is drawn, we run some risk of coming to incorrect conclusions. This is true because there is a chance that the sample is not in fact representative and that the actual error in our measurement exceeds that specified for a given sample size (Tables A.2 and A.3 in Appendix A). The chance of such improper generalizing is known, but we cannot tell whether or not it has occurred in any particular instance. For a level of confidence of .95, that chance is .05, or 1—.95. For a level of confidence of .99, it is .01. These values represent the likelihood that any generalization from our sample to the larger population, even allowing for the estimated range of sampling error, is simply wrong.
Tests of statistical significance perform the same function in evaluating measures of association. They tell us just how likely it is that the association we have measured between two variables in a sample might or might not exist in the whole population. Let us see if we can clarify this point.
1 A full explanation of statistical significance is beyond the scope of this text; to pursue a deeper understanding of significance testing, you are encouraged to consult one of the statistics texts listed at the end of Chapter 18.
An Example
Suppose, to continue our example, we have a population of 200 nations for which we know for a fact that the coefficient of association between population size and the proportion of adults with a college education is 0. There is, in reality, no relationship between the two variables. But suppose further that we take a sample of only 30 of these countries and calculate the association between these two variables. It might come out as 0, but this is actually unlikely, because the strength of association is now based not on all the countries but on only 30 and will probably reflect their particular idiosyncrasies. In other words, the coefficient itself is determined by which 30 countries we pick. If, by chance, we pick 30 countries that are truly representative of all 200, we will in fact find no association. But chance might also lead us to pick 30 countries for which the association between population size and education level is unusually high, say, .60. In that case, our coefficient of association measures a characteristic of the particular sample in question, but if we generalize to the larger population, our conclusions will be incorrect. Knowing this, of course, we reject our measure of association based on this particular sample.
The problem is that in the real world we seldom know the underlying population parameter, which is the true degree of association in the whole population (as defined in Chapter 7). Indeed, the reason to draw samples in the first place is exactly because we often simply cannot study whole populations. It follows, then, that more often than not the only tests of association we will have will be those based on our sampling. Moreover, these calculations will usually be based on only one sample. Thus, the question becomes one of how confident we can be that a test of association based on a single subgroup of a population accurately reflects an underlying population characteristic. The job of the test of statistical significance is to pin a number on that confidence—that is, to measure the probability or likelihood that we are making an appropriate, or, conversely, an inappropriate, generalization.
To see how this works, let us continue our example. Suppose that we draw not one sample of 30 nations from our population of 200, but 1,000 separate and independent samples of the same size and that for each we calculate the coefficient of association. Because the true coefficient for the entire population is in fact 0, most of the coefficients for our 1,000 samples will also be at or relatively near 0. Some particular combinations of 30 countries may yield relatively higher values (that is, we might by chance happen to pick only countries scoring either high-high or low-low on the two variables), but the majority will be nearer to the population parameter. Indeed, the closer one gets to the true value, the more samples one finds represented. These distributions, in fact, often resemble the normal curve mentioned earlier. This is illustrated in Figure 17.2, where the height of the curve at any given point represents the number of samples for which the coefficient of association noted along the baseline has been calculated. As you can see, most of the sample coefficients cluster around the true population parameter.
What, then, is the likelihood that any particular coefficient is simply a chance variation around a true parameter of 0? Or, in other words, if we take a sample from some population and find in that sample a strong association, what are the chances that we will be wrong in generalizing so strong a relationship from the sample to the population? The normal curve has certain properties which enable us to answer this question with considerable precision.
FIGURE 17.2 Normal distribution of coefficients of association for samples of 30 cases.
Suppose, for example, we draw from our 200 nations a sample of 30 for which the coefficient of association is —.75. How likely is it that the corresponding coefficient for the population as a whole is 0? From Figure 17.2, the answer must be a resounding Not very! The area under the curve represents all 1,000 (actually any number of) sample coefficients when the true parameter is 0. The much smaller shaded area at and to the left of —.75 represents the proportion of such coefficients that are negative in direction and .75 or stronger in magnitude. Such cases constitute a very small proportion of the many sample coefficients. For this reason, the odds of drawing such a sample in any given try are quite slim. If 5 percent of all samples lie in this area, for instance, then only one time in twenty will we be likely to encounter a sample from a population with a true coefficient of 0 for which we find a coefficient in our sample of –.75. Yet that is precisely what we have found in this instance.
In other words, we have just drawn a sample with a characteristic that has a 5 percent likelihood of being an erroneous representation of a population in which the two variables in question are not associated with each other. Thus, if we claim on the basis of our sample that the two variables are in fact associated in the larger population (that is, if generalizing our results from the sample), we can expect to be wrong 5 percent of the time. That means, of course, that we will be right 95 percent of the time, and those are not bad odds. Indeed, levels of statistical significance of .05 (a 5 percent chance of erroneous generalization), .01 (a 1 percent chance of such error), and .001 (a 1/10 of 1 percent chance of such error) are commonly accepted standards in social science research.
If we look again at Figure 17.2, it should be apparent that more extreme values such as –.75 are less likely to give rise to this kind of error in generalization than are those closer to the center (for example, a greater proportion of samples from such a population will, by chance, show coefficients of –.50 or stronger, and so forth). It seems, then, that we can never be very confident of the trustworthiness of weaker associations, since we can never eliminate the heavy odds that they are simply chance occurrences in a population with a true coefficient of 0.
We can increase our confidence in our sample simply by increasing our sample size. If instead of 30 cases per sample we draw 100 or 150, each will be more likely to cluster around 0. In effect, the normal curve will be progressively squeezed toward the middle, as illustrated in Figure 17.3, until ultimately there is only one possible outcome—the true parameter. In the process, with a set of sufficiently large samples, even a coefficient of association of .10 or .01 can be shown to have acceptable levels of statistical significance. We can conclude, then, that some combination of sufficiently extreme scores and sufficiently large samples allows us to reduce to tolerable levels the likelihood of incorrectly generalizing from our data.
FIGURE 17.3 Sampling distribution for differing numbers of cases in a population of 200.
In the balance of this chapter we present a brief discussion of the most common measures of association and significance for each of the three levels of measurement. Although the procedures employed in calculating each of these measures differ, the purpose in each case, as well as the interpretation of the result, will remain relatively consistent, for each coefficient of association is designed to tell us to what extent our guessing of values on one variable is improved by knowledge of the corresponding values on another. Each test of significance tells us the probability that any observed relationships in a sample result from bias in the sample rather than from an underlying relationship in the base population.
The examples we use to illustrate these statistics involve comparisons of variables that are operationalized at the same level of measurement. However, researchers often want to look for relationships between variables that are at different levels of measurement (as in the case of an ordinal-level independent variable such as socioeconomic status and a nominal-level dependent variable such as party identification). To select the correct statistic in these situations, you need to be aware of a simple rule: You can use a statistic designed for a lower level of measurement with data at a higher level of measurement, but you may not do the reverse—doing so would produce statistically meaningless results. It would, for example, be legitimate to use a statistic designed for the nominal level with ordinal-level data, but illegitimate to use an ordinal-level statistic with nominal-level data. This means that when comparing variables that are measured at different levels of measurement, you must choose a statistic suitable to the lower of the two levels.
MEASURES OF ASSOCIATION AND SIGNIFICANCE FOR NOMINAL VARIABLES: LAMBDA
A widely used coefficient of association for two nominal variables where one is treated as independent and the other dependent is ? (lambda).2Lambda measures the percentage of improvement in guessing values on the dependent variable on the basis of knowledge of values on the independent variable when both variables consist of categories with neither rank, distance, nor direction.
An Example
Suppose we measure the party identification of 100 respondents and uncover the following frequency distribution:
Democrats
50
Republicans
30
Independents
20
Suppose further that we want to guess the party identification of each individual respondent, that we must make the same guess for all individuals, and that we want to make as few mistakes as possible. The obvious strategy is simply to guess the mode (the most populous category), or Democratic, every time. We will be correct 50 times (for the 50 Democrats) and incorrect 50 times (for the 30 Republicans and 20 Independents), not an especially noteworthy record but still the best we can do. For if we guess Republican each time, we will be wrong 70 times, and a guess of Independent will lead to 80 incorrect predictions. The mode, then, provides the best guess based on the available information.
2 Actually, the statistic we shall describe here is or lambda asymmetrical, a measure that tests association in only one direction (from the independent to the dependent variable). A test of mutual association, the true is also available.
TABLE 17.1 Paternal Basis for Party Identification
Respondent’s Party Identification
Father’s Party Identification
Dem.
Rep.
Ind.
Totals
Democratic
45
5
10
60
Republican
2
23
5
30
Independent
3
2
5
10
Total
50
30
20
100
But suppose we have a second piece of information—the party identification of each respondent’s father—with the following frequency distribution:
Democrats
60
Republicans
30
Independents
10
If these two variables are related to each other—that is, if one is likely to have the same party identification as one’s father—then knowing the party preference of each respondent’s father should help us to improve our guessing of that respondent’s own preference. This will be the case if, by guessing for each respondent not the mode of the overall distribution, as we did before, but simply that person’s father’s party preference, we can reduce our incorrect predictions to fewer than the 50 cases we originally guessed wrongly.
To examine a possible association between these variables, we construct a crosstab summarizing the distribution of cases on these two variables. In Table 17.1, the independent, or predictor, variable (father’s party identification) is the row variable, and its overall distribution is summarized to the right of the table. The dependent variable (respondent’s party identification) is the column variable, and its overall distribution is summarized below the table. The numbers in the cells have been assigned arbitrarily, although in the real world they would, of course, be determined by the research itself.
With this table we can use parental preference to predict respondent’s preference. To do this, we use the mode just as before, but apply it within each category on the independent variable rather than to the whole set of cases. Thus, for those respondents whose father is identified as a Democrat, we guess a preference for the same party. We are correct 45 times and incorrect 15 (for the 5 Republicans and 10 Independents). For those whose father is identified as a Republican, we guess Republican. We are correct 23 times and incorrect 7. And for those whose father is identified as an Independent, we guess a similar preference and are correct 5 out of 10 times. Combining these results, we find that we are now able to guess correctly 73 times and are still wrong 27 times. Thus, knowledge of the second variable has clearly improved our guessing. To ascertain the precise percentage of that improvement, we use the general formula for a coefficient of association:
In the present instance, this is
By using father’s party identification as a predictor of respondent’s party identification, we are able to improve (reduce the error in) our guessing by some 46 percent.
The formula for calculating ?, which will bring us to the same result though by a slightly different route, is
Lambda ranges from 0 to 1, with higher values (those closer to 1) indicating a stronger association. Because nominal variables have no direction, ? will always be positive.
Our next step is to decide whether the relationship summarized by ? arises from a true population parameter or from mere chance. That is, we must decide whether the relationship is statistically significant.
Chi-Square
The test of statistical significance for nominal variables is X2 (chi-square). This coefficient tells us whether an apparent nominal-level association between two variables, such as the one we have just observed, is likely to result from chance. It does so by comparing the results actually observed with those that would be expected if no real relationship existed. Calculating X2 too, begins from a crosstab. Consider Table 17.2, which resembles Table 17.1 in that the marginals for each variable are the same as those of Table 17.1, but Table 17.2 does not include any distribution of cases within the cells.
To begin the determination of X2 we ask ourselves what value is expected in each cell, given these overall totals, if there is no association between the two variables. Of the 60 cases whose father was a Democrat, for instance, we might expect half (50/100) to be Democrats, almost a third (30/100) to be Republicans, and one in five (20/100) to be Independents, or, in other words, 30 Democrats, 18 Republicans, and 12 Independents. Similarly, we might arrive at expected values for those with a Republican or Independent father. These expected values are summarized in Table 17.3.
TABLE 17.2 Paternal Basis for Party Identification: Marginal Values
Respondent’s Party Identification
Father’s Party Identification
Dem.
Rep.
Ind.
Totals
Democratic
60
Republican
30
Independent
10
Total
50
30
20
100
TABLE 17.3 Paternal Basis for Party Identification: Expected Values
Respondent’s Party Identification
Father’s Party Identification
Dem.
Rep.
Ind.
Totals
Democratic
30
18
12
60
Republican
15
9
6
30
Independent
5
3
2
10
Total
50
30
20
100
The question then becomes, are the values we have actually observed in Table 17.1 so different (so extreme) from those that Table 17.3 would lead us to expect if there were, in reality, no relationship between the two variables, that we can be reasonably confident of the validity of our result? Chi-square is a device for comparing the two tables to find an answer to this question. The equation for X2 is
We calculate X2 by filling in the values in Table 17.4 for each cell in a given table. The ordering of the cells in the table is of no importance, but fo from Table 17.1 and fe from Table 17.3 for any particular line must refer to the same cell. The rationale for first squaring the differences between fo and fe and then dividing by fe is essentially the same as that for the treatment of variations around the mean in determining the standard deviation. Chi-square is determined by adding together all the numbers in the last column. In the example, this yields a value of 56.07.
TABLE 17.4 Values Used in Deriving X2
fo
fe
fo – fe
(fo – fe)2
45
30
15
225
7.50
5
18
–13
169
9.39
10
12
–2
4
.33
2
15
–13
169
11.27
23
9
14
196
21.78
5
6
–1
1
.17
3
5
–2
4
.80
2
3
–1
1
.33
5
2
3
9
4.50
Degrees of Freedom
Before interpreting this number, we must make one further calculation, that of the so-called degrees of freedom. The degrees of freedom (df) in a table simply consist of the number of cells of that table that can be filled with numbers before the entries in all remaining cells are fixed and unchangeable. The formula for determining the degrees of freedom in any particular table is
df = (r – 1) (c – 1)
In the example, df = (3 – 1)(3 – 1) = 4
We are now ready to evaluate the statistical significance of our data. Table A.4 in Appendix A summarizes the significant values of for X2 different degrees of freedom at the .001, .01, and .05 levels. If the value of X2 we have calculated (56.07) exceeds that listed in the table at any of these levels for a table with the specified degrees of freedom (4), the relationship we have observed is statistically significant at that level. In the present instance, for example, in order to be significant at the .001 level—that is, if when we accept the observed association as representative of the larger population we run a risk of being wrong one time in 1,000—our observed X2 must exceed 18.467. Since it does so, we are quite confident in our result.
MEASURES OF ASSOCIATION AND SIGNIFICANCE FOR ORDINAL VARIABLES: GAMMA
A widely used coefficient of association for ordinal variables is G, or gamma, which works according to the same principle of error reduction as ? but focuses on predicting the ranking or relative position of cases rather than simply their membership in a particular class or category. The question treated by G is that of the degree to which the ranking of a case on one ordinal variable may be predicted if we know its ranking on a second ordinal variable.
When examining two such variables, there are two possible conditions of perfect predictability. The first, in which individual cases are ranked in exactly the same order on both variables (high scores with high scores, low scores with low), is termed perfect agreement. The second, in which cases are ranked in precisely the opposite order (highest scores on one variable with lowest on the other and the reverse), is termed perfect inversion. Therefore, predictability is a function of how close the rankings on these variables come to either perfect agreement (in which case G is positive and approaches 1) or perfect inversion (where G is negative and approaches –1). A value of G equal to 0 indicates the absence of association. The formula for calculating G is
G is based on the relative positions of a set of cases on two variables. The cases are first arranged in ascending order on the independent variable. Their rankings on the dependent variable are then compared. Those for which the original ordering is preserved are said to be in agreement, and those for which the original order is altered are said to be in inversion. Limitations of space do not permit us to consider this procedure in detail or to discuss the calculations of G when the number of cases is relatively small and/or no ties are present in the rankings. Rather, we shall focus on the procedures for calculating G under the more common circumstances, when ties (more than one case with the same rank) are present and the number of cases is large.3 Here, as before, we work from a crosstab, as shown in Table 17.5.
TABLE 17.5 Centralized Crosstabulation
Dependent Variable
Independent Variable
Low
Medium
High
Low
a
b
c
Medium
d
e
f
High
g
h
i
To measure the association between these two variables, we determine the number of agreements and inversions relative to each cell in the table. An agreement occurs in any cell below (higher in its score on the independent variable) and to the right (higher in its score on the dependent variable) of the particular cell in question. Thus, agreements with those cases in cell a include all cases in cells e, f, h, and i, since these cases rank higher than those in cell a on both variables. An inversion occurs in any cell below (higher in its score on the independent variable) and to the left (lower in its score on the dependent variable) of the particular cell in question. Thus, inversions with those cases in cell c include all cases in cells d, e, g, and h, since these cases rank higher on one variable than those in cell c, but lower on the other. The frequency of agreements (in the equation), then, is the sum for each cell of the number of cases in that cell multiplied by the number of cases in all cells below and to the right (a[e + f + h + i] + b[f + i] + d[h + i] + e[i]). The frequency of inversions (fi in the equation) is the sum for each cell of the number of cases in that cell multiplied by the number of cases in all cells below and to the left (b[d + g] + c[d + e + g + h] + e[g] + f[g + b]) The resulting totals are simply substituted into the equation.
If, for example, the variables in Table 17.1 were ordinal, we could calculate G as follows:
3 In such applications, G may be unreliable, but it is included here to facilitate the discussion of association as a concept. A related statistic, Kendall’s tau, may be more reliable, but its determination may be less intuitive to the beginning political scientist.
This tells us that there is 61 percent more agreement than disagreement in the rankings of the cases on the two variables. If fi exceeded fa the sign of G would be negative, in order to indicate the existence of an inverse relationship.
The test of the statistical significance of G is based on the fact that the sampling distribution of G is approximately normal for a population with no true association, as was the sampling distribution of the hypothetical coefficient of association discussed earlier. Since this is so, we can determine the probability that any particular value of G has occurred by chance by calculating its standard score (z), locating its position under the normal curve, and assessing the probabilities. The actual calculation of ZG (standard score of gamma) will not be presented here, because the formula is complex and its understanding requires a more detailed knowledge of statistics than we have provided. Suffice it to say that when ZG exceeds ±2.326 (when G lies at least 1.645 standard deviation units above or below the mean), G is sufficiently extreme to merit a significance level of .05, and that when ZG exceeds ±2.326 (when G lies at least 2.326 standard deviation units above or below the mean), G achieves significance at the .01 level. The interpretation of these results is precisely the same as that in the earlier and more general example.
MEASURES OF ASSOCIATION AND SIGNIFICANCE FOR INTERVAL/RATIO VARIABLES: CORRELATION
The measure of association between two interval variables is the Pearson product-moment correlation (r), also known as the correlation coefficient. This coefficient summarizes the strength and direction of a relationship using the same notion we have already presented—about proportionate reduction in error in guessing values on one variable on the basis of known values of another—though the procedure, like the data for which it is designed, is more sophisticated than others we have discussed to this point. Here, rather than using the mean of the dependent variable (usually designated Y) to predict the values of individual cases, we use its geometric relationship with the independent variable (usually designated X) for this purpose. More particularly, we focus on the degree to which the equation of a particular straight line can help us to predict values of Y based on knowledge of corresponding values of X.
PRACTICAL RESEARCH ETHICS
What is the link between level of measurement and analytic techniques?
In doing initial bivariate analysis of your data, you need to always keep level of measurement in mind.
One of your most important duties as a researcher is to employ the proper analytic techniques (e.g., statistical measures), while maintaining close attention to each variable’s level of measurement as you evaluate whether a hypothesized relationship achieves statistical significance.
Graphing the Variables
The determination of r begins with the examination of a scatter plot, which is a graphic summary of the distribution of cases on two variables, in which the base line, or X-axis, is denoted in units of the independent variable; the vertical line, or Y-axis, is denoted in units of the dependent variable; and each dot represents observations of one case on both variables. Such a plot is presented in Figure 17.4, in which the independent variable is age, the dependent variable is years of schooling completed, and the number of cases is twenty-five. The circled dot thus represents one case—a person thirty years old with ten years of schooling. The values in the figure have been arbitrarily assigned but would in reality be ascertained by the research itself.
FIGURE 17.4 Scatter diagram showing relationship between age and years of schooling.
The next step is to draw a straight line, called a regression line, through this field of dots so that no other line comes closer to touching all of the dots. This line of best fit for the relationship between two variables is analogous to the mean in univariate descriptive statistics. Just as the mean represents a most typical case in a frequency distribution, the regression line represents a most typical association between two variables. Just as we might use the mean to guess values of a variable in the absence of additional information, we can use the regression line to guess values of one variable on the basis of our knowledge of the values of another. If, for example, we know the value of X for a given case, we can project a vertical line from that point on the X-axis to the regression line, then a horizontal line from there to the Y-axis. The point of contact on the Y-axis gives us a predicted value of Y.
But just as a mean may be the single most typical value yet not be a good summary of a particular distribution, a regression line may be the best possible summary of a relationship between two variables yet not be a very useful summary. Accordingly, just as we use the standard deviation (s) as a measure of dispersion or goodness of fit around the mean, we use the correlation coefficient (r), or, more correctly, for purposes of interpretation, the square of that coefficient (r2), as a measure of the goodness of fit of the various data points around the regression line. It is, in effect, a measure of how typical that line is of the joint distribution of values of the two variables.
Closeness of Association
Where all points actually fall directly on the line, as in Figure 17.5(a) and (e), the line provides a perfect description of the relationship between the two variables. Where the points are generally organized in the direction indicated by the line but do not all fall upon it, as in Figure 17.5(b) and (d), the line provides an approximation of the relationship between the two variables. And where, as in Figure 17.5(c), there are multiple possible lines that equally fit the data, no association exists between the two variables. The problem, then, is twofold: First, what does this line of best fit look like? Second, how good a fit to the data does it in fact provide?
FIGURE 17.5 Summary of regression lines and values of r.
You may recall from your study of algebra that any straight line takes the form
Yi = a + bXi
The regression line is simply the one set of guessed values of this form that provides for the most accurate prediction of values of Y based on knowledge of values of X.
For reasons we shall not explore here, the slope b of that line will always take the form
where Xi and Yi are the corresponding values of the independent and dependent variables for case i, and and are the respective means. Applying this formula and using a chart similar to the one we used in computing X2 we are able to ascertain the slope of any particular linear relationship between two interval variables. This process is illustrated in Table 17.6 for the data reported in Table 17.4. For these data, = 37.08 and Substituting these values in the equation, we find
In a linear relationship—one described or summarized by a straight line—a particular change in the value of the independent variable X is always accompanied by a particular change in the value of the dependent variable Y. Moreover, in such a relationship the rate of change is constant; that is, no matter what the particular values of X and Y, each change of one unit in X will be accompanied by a change in Y of some fixed size determined by the slope of the regression line. Relationships in which slight changes in X are accompanied by relatively large changes in Y are summarized by lines that have a relatively steep slope (|b| < 1); this denotes the absolute value of b is greater than 1. Relationships in which large changes in X are accompanied by smaller changes in Y are summarized by lines that have a relatively flat slope (|b| > 1). Relationships in which one unit of change in X is accompanied by one unit of change in Y are summarized by lines for which b is equal to 1. Lines that slope upward from left to right, such as those in Figure 17.5(a) and (b), have a positive slope and represent relationships in which increases in X are accompanied by increases in Y. Those sloping downward from left to right, such as the lines in Figure 17.5(d) and (e), have a negative slope and represent relationships in which increases in X are accompanied by decreases in Y. Indeed, the slope of the line is simply the rate of change in Y for each unit of change in X. In our example, then, where b is equal to –.12, we know that the regression line will slope downward from left to right and will, if the two variables are drawn to the same scale, be relatively flat.
TABLE 17.6 Values Used in Deriving the Equation of the Regression Line
X1
(X1 – )
(X1 – )2
Y 1
(Y1 – )
(X1 – )(Y1 – )
30
– 7.08
50.13
10
–2.88
20.39
30
– 7.08
50.13
11
–1.88
13.31
30
–7.08
50.13
12
–.88
6.23
30
–7.08
50.13
14
1.22
–7.93
30
–7.08
50.13
16
3.12
–22.09
31
–6.08
36.97
14
1.12
–6.81
31
–6.08
36.97
15
2.12
–12.89
31
–6.08
36.97
16
3.12
–18.99
33
–4.08
16.15
15
2.12
–8.65
33
–4.08
16.15
16
3.12
–12.73
35
–2.08
4.33
12
–.88
1.83
35
–2.08
4.33
13
.12
–.25
35
–2.08
4.33
15
2.12
–4.41
36
–1.08
1.17
12
–.88
.95
36
–1.08
1.17
13
.12
–.13
37
–.08
.01
13
.12
–.01
40
– 2.92
8.53
10
–2.88
–8.41
40
– 2.92
8.53
12
–.88
–2.57
40
–2.92
8.53
14
1.12
3.27
42
–4.92
24.21
10
–2.88
–14.17
42
–4.92
24.21
12
–.88
–4.33
50
–12.92
166.93
9
–3.88
–50.13
50
–12.92
166.93
10
–2.88
–37.12
50
–12.92
166.93
12
–.88
–11.37
50
–12.92
166.93
16
3.12
40.31
Totals
0
1,151.93
0
–136.79
To arrive at the formula we used to compute the slope of the regression line, we had to assume that the line passes through the intersection of and (the means of the respective variables). This is a reasonable assumption, because the means represent the central tendencies of these variables and because we are, in effect, seeking a joint or combined central tendency. Because we know both means and have now determined the value of b, we can easily find the value of a (the point at which the regression line intercepts the Y-axis) and solve the equation. The general equation of the regression line is
Y’ = a + bXi
and at the point where the line passes through the intersection of the two means
= a + b
It must then follow that
a = – b
Because all of these values are now known, we can determine that
Thus, the equation of the regression line—the single best-fitting line—for the data reported in Figure 17.4 would be
Y’ = 17.33 – .12X
Using this equation, we can predict the value of Y for any given value of X.
Once this equation has been determined, we may use the correlation coefficient (r) to assess the utility of the regression line. The formula for rXY (the coefficient of correlation between X and Y) is
where
Although the assertion is certainly not obvious and although its algebraic proof lies beyond our present discussion, this working formula is derived from a comparison of the original error in guessing values of Y by using (the mean of the frequency distribution) with the error remaining when one guesses values of Y using Y’ (the equation of the regression line). Thus, the procedure for computing r is analogous to that for computing both ? and G. It may best be accomplished by setting up a chart of the type with which we are now familiar in which the columns include X, Y, XY, X2, and Y2 The sums required by the equation are then provided by the column totals. Thus, for the data represented in Figure 17.4, whose regression line we have already determined, the chart is completed as in Table 17.7.
TABLE 17.7 Values Used in Deriving the Correlation Coefficient (r)
X
Y
XY
X2
Y2
30
10
300
900
100
30
11
330
900
121
30
12
360
900
144
30
14
420
900
196
30
16
480
900
256
31
14
434
961
196
31
15
465
961
225
31
16
496
961
256
33
15
495
1,089
225
33
16
528
1,089
256
35
12
420
1,225
144
35
13
455
1,225
169
35
15
525
1,225
225
36
12
432
1,296
144
36
13
468
1,296
169
37
13
481
1,369
169
40
10
400
1,600
100
40
12
480
1,600
144
40
14
560
1,600
196
42
10
420
1,764
100
42
12
504
1,764
144
50
9
450
2,500
81
50
10
500
2,500
100
50
12
600
2,500
144
50
16
800
2,500
256
Totals
927
322
11,803
32,525
4,260
We substitute these totals in the equation
This tells us that the slope of the regression line is negative and that the points cluster weakly to moderately around it (because r ranges from +1 to –1 with the weakest association at 0).
Explained Variance
Although r itself is not easily interpreted, r2 may be interpreted as the proportion of reduction in the variance of Y attributable to our knowledge of X. In other words, r2 is the proportion of variation in Y that is predictable (or explainable) on the basis of X. The quantity r2 is often referred to as the percentage of explained variance, and the quantity 1 – r2 is often termed the percentage of unexplained variance. Thus, in our example, the r of –.38 means that differences on the independent variable age account for some 14 percent, or (–.38)2, of the variance in the dependent variable years of schooling for the cases under analysis.
Computing Statistical Significance
For reasons that lie beyond the scope of the present discussion, we are able to specify the statistical significance of r only when both the independent and dependent variables are normally distributed. This is accomplished by using Table A.5 in Appendix A, for which purpose two pieces of information are needed. The first is r itself, which, of course, is known. The second, analogous to the X2 test, is the number of degrees of freedom of the regression line because two points determine a line (in this case, the intersection of) and was the first and the intercept with the Y- axis the second), all other data points may fall freely, so df will always equal (N – 2) where N is the number of cases. To use the table, then, we locate the appropriate degrees of freedom (in the example, N – 2 = 25 – 2 = 23) and the desired level of significance (for example, .05), just as we did for X2, and identify the threshold value of r necessary to achieve that level of significance; and evaluate our actual observation. In the present instance, this requires interpolating values in the table between df = 20 and df = 25. For df = 23 these values would be .3379, .3976, .5069, and .6194, respectively. Thus, our r of –.38 is statistically significant at the .10 level (it exceeds .3379), but not at the .05 level (it does not exceed .3976). The interpretation of this result is the same as those for other measures of statistical significance.
Conclusion
In this chapter we have introduced some of the more common statistics that are used to summarize the relationship between two variables. As in Chapter 16, we found that differing measures of association and statistical significance were appropriate, depending on the level of measurement that characterized the data being analyzed. Together with the techniques presented earlier, these various coefficients provide the researcher with some very useful basic tools with which to summarize research results. In the next chapter we outline some more sophisticated statistical techniques, which can further enrich our ability to analyze and understand what we have discovered.
(Brians 290-310)
Brians, Craig Leonard, Lars Willnat, Jarol B. Manheim, and Richard C. Rich. Empirical Political Analysis for Ashford University, 8th Edition. Pearson Learning Solutions. <vbk:9780558770730#outline(20)>.