Correlation and Regression

We often want to examine the relationship between two continuous variables. Correlation and regression methods help us to do this. For example consider the data taken from the bp100.sav file. Figure 49 shows a plot of diastolic blood pressure (dia) against age (age). We may like to know how close to a straight line this relationship is. To do this, we calculate a value known as the correlation coefficient r:

Where x refers to age, and y refers to diastolic blood pressure. In our example r = 0.322. The correlation coefficient varies between -1 and +1. A value of 0 would indicate no relationship between the variables at all. A value of -1 would indicate that as one variable increased the other would decrease at a constant rate, and that their relationship could be perfectly described by a straight line. Similarly a r value = +1 would indicate that as one variable increased the other would increase at a constant rate. So in our example there does appear to be some positive relationship between diastolic blood pressure and age.

We may also want to describe the straight line which best fits our data. To do this we use regression techniques. The process of regression is very similar to ANOVA, in that we attempt to explain the variance in our dependent variable by using potential explanatory variables. Again we would obtain an ANOVA table with a similar interpretation to before. This time we take the null hypothesis to be that there is no relationship between blood pressure and age, and thus the best straight line describing this would be a horizontal line with slope 0. Our alternative hypothesis would be that the slope of the line is other than 0. We find the reduction in SS by finding how much the observed values deviate from this line.

In SPSS this regression line can be fitted as follows. Choose Statistics, Regression, Linear, in the Dependent box place dia and in the Independent box place age, OK. You should obtain the output shown in Figure 50.


Figure 49

From Figure 50, we can form an ANOVA table. Here we need to sum the two SS in order to give our total SS. It can be seen that the regression line explains 1607.1 of the total SS. This gives an F value of 12.36 with 1,107 df (p<0.001). We are also given the equation of the best fitting line. This is:

y = x + c

Where = 0.269, se = 0.076
c = 68.983, se = 3.554

The standard errors show us how precise the estimates are. We can get SPSS to produce the associated confidence intervals for us, but you can see already (using 2se) that the slope of this line may vary from around 0.127 to 0.411.

The R square value of 0.103, says that we have explained about 10% of the total variance.


Figure 50
 

To add an additional variable to the regression equation we would again use the residual SS as a representation of the variance left to be explained.

In a lot of respects regression modeling is very similar to the ANOVA. Figure 51 illustrates the effect of fitting separate regression equations for males and females. If there is no effect of sex we would obtain two lines with a similar intercept and a similar slope. If the effect of sex variable is constant over age, then we would obtain two parallel lines, with a different intercept, and the same slope.


Figure 51

In SPSS we cannot fit categorical variables directly. We must create what are called indicator variables. The number of indicator variables should be one less than the number of groups the particular grouping variable has. They are each coded as 1 for each group it represents and 0 for all other groups. So for sex, we choose Transform, Recode, Into Different Variables..., put age in Numeric Variable -> Output Variable box. In the Output Variable box, in the Name box, enter sex1, Change. Then click on Old and New Values.... In the Old Value box next to Value: enter 1, in the New Value box, next to Value: enter 1, Add. Then return to the Old Value box and click on the All other values option, then in the New Value box, next to Value enter 0, Add, Continue, OK (see Figure 52).


Figure 52

If we now repeat the regression with the addition of sex1 to the Independent list we will obtain the output shown in Figure 53.


Figure 53
 

Figure 53 suggests that there is no effect of sex. It gives the mean blood pressure of males as being slightly (0.901, se 2.350, p=0.702) less than that of females. However, from Figure 51, we saw that there could be some interaction effect. So we should fit an interaction term. In SPSS to do this we multiply sex1 by age to create another new variable, say agesex. We now add agesex to the Independent list in the regression procedure. This will gives us Figure 54.


Figure 54

Figure 54 shows that there is a significant interaction effect. It also gives us two regression equations for diastolic blood pressure:

Females: = 0.437age + 61.770

Males: = (0.437 - 0.383)age + (61.770 + 15.990).


Figure 55

Before we can accept any model we should check how well it fits over all the values. From ANOVA, we can calculate the residual values as the observed value - predicted value. These values should have a mean of zero and a standard deviation given in the MS column for the residual row. Again these values may be transformed or standardized to the standard normal distribution. If we plot these values against the predicted values, then we should see that most of the standardized residuals lie between -2 and +2. Also, if there should be no pattern to these residuals. Any pattern will indicate that we have not fitted our model correctly. It could be that the relationship is not linear, in which case there will be some pattern to the plotted residuals. Figure 55 gives such a residual plot. We can see that there does not appear to be any pattern in the residuals and that most do lie within the 2 limit. How to obtain these plots is described in the later.


Figure 56

We can also produce a normal cumulative percentage probability plot for the residuals. This gives the actual proportion of values that should be less than each residual increment, assuming a standard normal distribution; against the observed cumulative proportion. This plot should follow a straight line. Any divergence from this line is a sign that the residuals are not normally distributed and that there is some further work required on the model. Figure 56 gives this normal plot for our regression model. We can see that the residuals do follow the straight line fairly well. SPSS also gives a number of diagnostic tests, which give a p value for those more interested in mechanical statistics. These tests are not rigorous, and even when some of these tests indicate a problem with the residuals, an actual plot may reveal that the problem is not as great as the test suggests. Of course the converse may also be true.

We may have a large number of possible explanatory variables. We would like to know which ones are the best for predicting blood pressure. SPSS lets you perform what is called a stepwise regression procedure. Stepwise regression works by finding that variable which is most strongly significantly related to the dependent variable and then fitting it. Having fitted this variable it now searches for the next strongest significantly related variable and fits that. It now checks to see whether the first variable is still significantly related to the dependent variable. If not, it is removed. The process continues again, until no variable may be added or removed from the equation. In SPSS to perform a stepwise multiple regression procedure, we should enter all our potential explanatory variables into the Independent box, then from the Method: box, change enter to stepwise. In order to obtain the residual diagnostics, click on Plots, Normal Probability Plot, also place *zresid in the Y: box and *adjpred in the X: box, continue. For confidence intervals, click on Statistics, Confidence Intervals and Descriptives.

SPSS choose an inclusion significance level of 0.05 for entry and 0.1 for exit by default.

Unfortunately it does not deal with factor or categorical variables very well using this procedure and we have to set up indicator or dummy variables for all categorical variables. Once a final model has been obtained you may go back to the ANOVA procedure and fit an identical model, and obtain categorical parameter estimates more easily.

Having chosen the above commands you will obtain the following SPSS output in courier font. We shall go through this output and explain what it all means in italics. You need to set up two indicator variables (place1 & place2) for place, then include age, height, place1, palce2, sex1 and height as potential explanatory variables.

First we are given the means and standard deviation for each explanatory variable. For the indictor variables the mean represents the proportion having that value. eg 32.1% are males.

Listwise Deletion of Missing Data 
Mean   Std Dev   Label  
DIA   80.872   11.988   Tensao Diastolica  
AGE   44.257   14.360   Age  
HEIGHT   1578.339   72.852   height  
PLACE1   .495   .502  
PLACE2   .220   .416  
SEX1   .321   .469  
WEIGHT   662.661   120.566   Weight  
N of Cases = 109

We now have a correlation matrix, showing how all the variables are related,. If there is a very strong relationship between two or more of the explanatory variables, it may be better to include only one in the final model.

Correlation, 1-tailed Sig:
DIA   AGE  HEIGHT   PLACE1  PLACE2  SEX1   WEIGHT  
DIA   1.000   .322  -.013   -.217  .082  -.044   .230  
.   .000  .448   .012  .199  .326   .008  
AGE   .322   1.000  -.218   -.101  -.002  -.026   .155  
.000   .  .011   .147  .493  .394   .054  
HEIGHT   -.013   -.218  1.000   -.236  .105  .625   .439  
.448   .011  .  .007  .138   .000   .000  
PLACE1   -.217   -.101  -.236   1.000  -.527  .026   -.356  
.012   .147  .007   .  .000   .394   .000  
PLACE2   .082   -.002  .105   -.527  1.000  -.223   .212  
.199   .493  .138   .000  .  .010   .013  
SEX1   -.044   -.026  .625   .026  -.223  1.000   .103  
.326   .394  .000   .394  .010  .   .144  
WEIGHT   .230   .155  .439   -.356  .212  .103   1.000  
.008   .054  .000   .000  .013  .144   .  
 

Equation Number 1 Dependent Variable.. DIA Tensao Diastolica

Descriptive Statistics are printed on Page 118

Block Number 1. Method: Stepwise Criteria PIN .0500 POUT .1000 

AGE HEIGHT PLACE1 PLACE2 SEX1 WEIGHT
 
 

The variable most strongly related to dia is selected and found to be age. From the R Square, age explains 10.3% of the total variation or 1607.1 of the total SS. The regression coefficient for age is found to be 0.269, se=0.0764 p=0.001. (Note T = 3.515.= F = 12.35779)

Variable(s) Entered on Step Number 

1.. AGE Age
Multiple R  .32177 
R Square  .10354 
Adjusted R Square  .09516 
Standard Error  11.40384 
Analysis of Variance
DF  Sum of Squares  Mean Square 
Regression  1  1607.10199  1607.10199 
Residual  107  13915.09984  130.04766 
F = 12.35779 Signif F = .0006

---------------------- Variables in the Equation -----------------------
Variable  B  SE B  95% Confdnce Intrvl B  Beta 
AGE  .268630  .076416  .117144  .420115  .321770 
(Constant)  68.982855  3.553944  61.937575  76.028135 
----------- in ------------
Variable  T  Sig T 
AGE  3.515  .0006 
(Constant)  19.410  .0000 
SPSS now gives a list of those variables not in the regression equation, and gives a T value indicating by how much their SS are likely to reduce the SS left to explain having fitted age. It seems place1 explains the most of the remaining variables. 

------------- Variables not in the Equation -------------
 
Variable  Beta In  Partial  Min Toler  T  Sig T 
HEIGHT  .060524  .062382  .952339  .644  .5213 
PLACE1  -.186238  -.195688  .989749  -2.054  .0424 
PLACE2  .082369  .086995  .999997  .899  .3706 
SEX1  -.035266  -.037234  .999318  -.384  .7020 
WEIGHT  .184968  .193008  .976092  2.025  .0454 
Adding place1, we now have 13.8% of the total variation explained. The Regression SS are now 2139.97. Which means that place1 accounted for a reduction of 2139.97 - 1607.10 = 532.87 in the residual SS. We are again given further regression coefficients.

* * * * M U L T I P L E R E G R E S S I O N * * * *

Equation Number 1 Dependent Variable.. DIA Tensao Diastolica

Variable(s) Entered on Step Number

2.. PLACE1
Multiple R  .37130 
R Square  .13786 
Adjusted R Square  .12160 
Standard Error  11.23599 
Analysis of Variance
DF  Sum of Squares  Mean Square 
Regression  2  2139.96642  1069.98321 
Residual  106  13382.23541  126.24750 
F = 8.47528 Signif F = .0004

---------------------- Variables in the Equation -----------------------
Variable  B  SE B  95% Confdnce Intrvl B  Beta 
AGE  .252887  .075680  .102844  .402930  .302913 
PLACE1  -4.445093  2.163635  -8.734710  -.155477  -.186238 
(Constant)  71.881712  3.775233  64.396944  79.366480 
----------- in ------------
Variable  T  Sig T 
AGE  3.342   .0012 
PLACE1  -2.054  .0424 
(Constant)  19.040  .0000 
Now we can see that no more variables can reduce the residual SS by a significant amount and so the process stops with age and place1 as the only independent predictive variables. We have a regression equation of:
 
for those for whom place1=0:  dîa = 71.9 + 0.25age 
for those for whom place1 = 1:  dîa = (71.9 - 4.4) + 0.25age = 67.5 + 0.25age 
------------- Variables not in the Equation -------------
 
Variable  Beta In  Partial  Min Toler  T  Sig T 
HEIGHT  .010883  .011028  .885209  .113  .9102 
PLACE2  -.021846  -.019960  .712336  -.205  .8383 
SEX1  -.030938  -.033299  .989197  -.341  .7335 
WEIGHT  .136394  .136156  .859131  1.408  .1620 
End Block Number 1 PIN = .050 Limits reached.
 
 
 

* * * * M U L T I P L E R E G R E S S I O N * * * *

Equation Number 1 Dependent Variable.. DIA Tensao Diastolica

Residuals Statistics:
Min  Max  Mean  Std Dev  N 
*PRED  72.7473  90.8483  80.8716  4.4513  109  Predicted values 
*ZPRED  -1.8251  2.2413  .0000  1.0000  109  Standardized predicted values 
*SEPRED  1.5152  2.8771  1.8363  .3220  109  Studentized predicted values 
*ADJPRED  72.7353  91.0855  80.8735  4.4464  109  adjusted predicted values 
*RESID  -22.5838  35.1517  .0000  11.1315  109  raw residuals 
*ZRESID  -2.0100  3.1285  .0000  .9907  109  standardized residuals 
*SRESID  -2.0568  3.2219  -.0001  1.0073  109  Studentized residuals 
*DRESID  -23.6478  37.2825  -.0019  11.5086  109 
*SDRESID  -2.0891  3.3763  .0031  1.0202  109 
*MAHAL  .9733  6.0905  1.9817  1.1202  109 
*COOK D  .0000  .2098  .0114  .0252  109 
*LEVER  .0090  .0564  .0183  .0104  109  influential observations 
 


Figure 57

Finally SPSS gives some plots of the residuals. You can see that the residuals are fairly satisfactory, with no great divergence from normality. Thus you can accept the model as being adequate.

Total Cases = 109

Hi-Res Chart # 15:Normal p-p plot of *zresid

Hi-Res Chart # 14:Scatterplot of *zresid with *adjpred
 


Figure 58


Introduction | Summary Statistics | Descriptive Statistics | Sampling | Normal Distribution | The t-Student Distribution |
Correlation and Regression | Analysis of Variance  | Contingency Tables | Non-Parametric Statistics