How can outliers affect correlation




















The smaller the sample size, the greater the effect of the outlier. At some point the outlier will have little or no effect on the size of the correlation coefficient. When a researcher encounters an outlier, a decision must be made whether to include it in the data set.

It may be that the respondent was deliberately malingering, giving wrong answers, or simply did not understand the question on the questionnaire. On the other hand, it may be that the outlier is real and simply different. The decision whether to include or not include an outlier remains with the researcher; he or she must justify deleting any data to the reader of a technical report, however.

It is suggested that the correlation coefficient be computed and reported both with and without the outlier if there is any doubt about whether or not it is real data. In any case, the best way of spotting an outlier is by drawing the scatter plot. No discussion of correlation would be complete without a discussion of causation. It is possible for two variables to be related correlated , but not have one variable cause another.

For example, suppose there exists a high correlation between the number of Popsicles sold and the number of drowning deaths on any day of the year. Does that mean that one should not eat Popsicles before one swims? Not necessarily. Both of the above variables are related to a common variable, the heat of the day. The hotter the temperature, the more Popsicles sold and also the more people swimming, thus the more drowning deaths.

This is an example of correlation without causation. Much of the early evidence that cigarette smoking causes cancer was correlational. It may be that people who smoke are more nervous and nervous people are more susceptible to cancer. It may also be that smoking does indeed cause cancer.

The cigarette companies made the former argument, while some doctors made the latter. In this case I believe the relationship is causal and therefore do not smoke. Sociologists are very much concerned with the question of correlation and causation because much of their data is correlational.

Sociologists have developed a branch of correlational analysis, called path analysis, precisely to determine causation from correlations Blalock, Before a correlation may imply causation, certain requirements must be met. These requirements include: 1 the causal variable must temporally precede the variable it causes, and 2 certain relationships between the causal variable and other variables must be met.

If a high correlation was found between the age of the teacher and the students' grades, it does not necessarily mean that older teachers are more experienced, teach better, and give higher grades.

Neither does it necessarily imply that older teachers are soft touches, don't care, and give higher grades. Some other explanation might also explain the results. The correlation means that older teachers give higher grades; younger teachers give lower grades.

It does not explain why it is the case. A simple correlation may be interpreted in a number of different ways: as a measure of linear relationship, as the slope of the regression line of z-scores, and as the correlation coefficient squared as the proportion of variance accounted for by knowing one of the variables.

All these interpretations are correct and in a certain sense mean the same thing. A number of qualities which might affect the size of the correlation coefficient were identified. They included missing parts of the distribution, outliers, and common variables. The CPI affects nearly all Americans because of the many ways it is used. One of its biggest uses is as a measure of inflation. In the following table, x is the year and y is the CPI.

In the example, notice the pattern of the points compared to the line. Although the correlation coefficient is significant, the pattern in the scatterplot indicates that a curve would be a more appropriate model to use than a line. In this example, a statistician should prefer to use other methods to fit a curve to this data, rather than model the data with the line we found. In addition to doing the calculations, it is always important to look at the scatterplot when deciding whether a linear model is appropriate.

For example you could add more current years of data. See how it affects the model. Is r significant? Is the fit better with the addition of the new points?

If any point is above y 2 or below y 3 then the point is considered to be an outlier. Use the following information to answer the next four exercises. The scatter plot shows the relationship between hours spent studying and exam scores. The line shown is the calculated line of best fit. The correlation coefficient is 0. A point is removed, and the line of best fit is recalculated.

The new correlation coefficient is 0. Does the point appear to have been an outlier? The potential outlier flattened the slope of the line of best fit because it was below the data set.

It made the line of best fit less accurate is a predictor for the data. Are you more or less confident in the predictive ability of the new line of best fit? The Sum of Squared Errors for a data set of 18 numbers is What is the standard deviation?

What is the cutoff for the vertical distance that a point can be from the line of best fit to be considered an outlier? The height sidewalk to roof of notable tall buildings in America is compared to the number of stories of the building beginning at street level.

Ornithologists, scientists who study birds, tag sparrow hawks in 13 different colonies to study their population. They gather data for the percent of new sparrow hawks in each colony and the percent of those that have returned from migration. Percent return: 74; 66; 81; 52; 73; 62; 52; 45; 62; 46; 60; 46; 38 Percent new: 5; 6; 8; 11; 12; 15; 16; 17; 18; 18; 19; 20; The slope of the regression line is The slope tells us that for each percentage increase in returning birds, the percentage of new birds in the colony decreases by If we examine r2, we see that only The ordered pair 66, 6 generates the largest residual of 6.

If we remove this data pair, we see only an adjusted slope of -. In other words, even though this data generates the largest residual, it is not an outlier, nor is the data pair an influential point.

The following table shows data on average per capita coffee consumption and heart disease rate in a random sample of 10 countries. A researcher is investigating whether population impacts homicide rate. He uses demographic data from Detroit, MI to compare homicide rates and the number of the population that are white males. Using the data to determine the linear-regression line equation with the outliers removed.

Is there a linear correlation for the data set with outliers removed? Justify your answer. If we remove the two service academies the tuition is? This allows us to say there is a fairly strong linear association between tuition costs and salaries if the service academies are removed from the data set. The average number of people in a family that attended college for various years is given in Figure.

The percent of female wage and salary workers who are paid hourly rates is given in Figure for the years to Use the following information to answer the next two exercises. The cost of a leading liquid laundry detergent in different sizes is given in Figure. According to a flyer by a Prudential Insurance Company representative, the costs of approximate probate fees and taxes for selected net taxable estates are as follows:.

Figure shows the average heights for American boy s in We are interested in whether there is a relationship between the ranking of a state and the area of the state. Skip to content Linear Regression and Correlation. Identifying Outliers We could guess at outliers by looking at a graph of the scatterplot and best fit-line. We will call these lines Y2 and Y3: As we did with the equation of the regression line and the correlation coefficient, we will use technology to calculate this standard deviation for us.

Try It. Numerical Identification of Outliers In Figure , the first two columns are the third-exam and final-exam data. The standard deviation of the residuals is calculated from the SSE as: Note. How does the outlier affect the best fit line? Numerical Identification of Outliers: Calculating s and Finding Outliers Manually If you do not have the function LinRegTTest, then you can calculate the outlier in the first example by doing the following.

Data x y x y Calculate the least squares line. Draw the line on the scatterplot. Find the correlation coefficient. Is it significant? What is the average CPI for the year ? See Figure. The correlation between the original 10 data points is 0.

But when this outlier is removed, the correlation drops to 0. Also, notice how the regression equation originally has a slope greater than 0, but with the outlier removed the slope is practically 0, i. This example is somewhat exaggerated, but the point illustrates the effect of an outlier can play on the correlation and regression equation.

Such points are referred to as influential outliers. As this example illustrates you can see the influence the outlier has on the regression equation and correlation. Identify the potential outlier in the scatter plot.

The standard deviation of the residuals or errors is approximately 8. The outlier appears to be at 6, Fifty-eight is 24 units from In Table, the first two columns are the third-exam and final-exam data.

Rather than calculate the value of s ourselves, we can find s using the computer or calculator. Compare these values to the residuals in column four of the table. The only such data point is the student who had a grade of 65 on the third exam and on the final exam; the residual for this student is Numerically and graphically, we have identified the point 65, as an outlier. We should re-examine the data for this point to see if there are any problems with the data.

If there is an error, we should fix the error if possible, or delete the data. If the data is correct, we would leave it in the data set. For this problem, we will suppose that we examined the data and found that this outlier data was an error. Therefore we will continue on and delete the outlier, so that we can explore how it affects the results, as a learning experience. Compute a new best-fit line and correlation coefficient using the ten remaining points.

Using the LinRegTTest, the new line of best fit and the correlation coefficient are:. This means that the new line is a better fit to the ten remaining data values. The line can better predict the final exam score given the third exam score. If you do not have the function LinRegTTest, then you can calculate the outlier in the first example by doing the following.

We call that point a potential outlier. For this example, we will delete it. Remember, we do not always delete an outlier. When outliers are deleted, the researcher should either record that data was deleted, and why, or the researcher should provide results both with and without the deleted data.

If data is erroneous and the correct values are known e. The next step is to compute a new best-fit line using the ten remaining points. The new line of best fit and the correlation coefficient are:.

Is this the same as the prediction made using the original line?



0コメント

  • 1000 / 1000