We often want to examine the relationship between two continuous variables. Correlation and regression methods help us to do this. For example consider the data taken from the bp100.sav file. Figure 49 shows a plot of diastolic blood pressure (dia) against age (age). We may like to know how close to a straight line this relationship is. To do this, we calculate a value known as the correlation coefficient r:
Where x refers to age, and y refers to diastolic blood pressure. In our example r = 0.322. The correlation coefficient varies between -1 and +1. A value of 0 would indicate no relationship between the variables at all. A value of -1 would indicate that as one variable increased the other would decrease at a constant rate, and that their relationship could be perfectly described by a straight line. Similarly a r value = +1 would indicate that as one variable increased the other would increase at a constant rate. So in our example there does appear to be some positive relationship between diastolic blood pressure and age.
We may also want to describe the straight line which best fits our data. To do this we use regression techniques. The process of regression is very similar to ANOVA, in that we attempt to explain the variance in our dependent variable by using potential explanatory variables. Again we would obtain an ANOVA table with a similar interpretation to before. This time we take the null hypothesis to be that there is no relationship between blood pressure and age, and thus the best straight line describing this would be a horizontal line with slope 0. Our alternative hypothesis would be that the slope of the line is other than 0. We find the reduction in SS by finding how much the observed values deviate from this line.
In SPSS this regression line can be fitted as follows. Choose Statistics, Regression, Linear, in the Dependent box place dia and in the Independent box place age, OK. You should obtain the output shown in Figure 50.
From Figure 50, we can form an ANOVA table. Here we need to sum the two SS in order to give our total SS. It can be seen that the regression line explains 1607.1 of the total SS. This gives an F value of 12.36 with 1,107 df (p<0.001). We are also given the equation of the best fitting line. This is:
y =
x
+ c
Where
=
0.269, se = 0.076
c = 68.983, se = 3.554
The standard errors show us how precise the estimates are. We can get SPSS to produce the associated confidence intervals for us, but you can see already (using 2se) that the slope of this line may vary from around 0.127 to 0.411.
The R square value of 0.103, says that we have explained about 10% of the total variance.
To add an additional variable to the regression equation we would again use the residual SS as a representation of the variance left to be explained.
In a lot of respects regression modeling is very similar to the ANOVA. Figure 51 illustrates the effect of fitting separate regression equations for males and females. If there is no effect of sex we would obtain two lines with a similar intercept and a similar slope. If the effect of sex variable is constant over age, then we would obtain two parallel lines, with a different intercept, and the same slope.
In SPSS we cannot fit categorical variables directly. We must create what are called indicator variables. The number of indicator variables should be one less than the number of groups the particular grouping variable has. They are each coded as 1 for each group it represents and 0 for all other groups. So for sex, we choose Transform, Recode, Into Different Variables..., put age in Numeric Variable -> Output Variable box. In the Output Variable box, in the Name box, enter sex1, Change. Then click on Old and New Values.... In the Old Value box next to Value: enter 1, in the New Value box, next to Value: enter 1, Add. Then return to the Old Value box and click on the All other values option, then in the New Value box, next to Value enter 0, Add, Continue, OK (see Figure 52).
If we now repeat the regression with the addition of sex1 to the Independent list we will obtain the output shown in Figure 53.
Figure 53 suggests that there is no effect of sex. It gives the mean blood pressure of males as being slightly (0.901, se 2.350, p=0.702) less than that of females. However, from Figure 51, we saw that there could be some interaction effect. So we should fit an interaction term. In SPSS to do this we multiply sex1 by age to create another new variable, say agesex. We now add agesex to the Independent list in the regression procedure. This will gives us Figure 54.
Figure 54 shows that there is a significant interaction effect. It also gives us two regression equations for diastolic blood pressure:
Females: = 0.437age + 61.770
Males: = (0.437 - 0.383)age + (61.770 + 15.990).
Before we can accept any model we should check how well it fits over all the values. From ANOVA, we can calculate the residual values as the observed value - predicted value. These values should have a mean of zero and a standard deviation given in the MS column for the residual row. Again these values may be transformed or standardized to the standard normal distribution. If we plot these values against the predicted values, then we should see that most of the standardized residuals lie between -2 and +2. Also, if there should be no pattern to these residuals. Any pattern will indicate that we have not fitted our model correctly. It could be that the relationship is not linear, in which case there will be some pattern to the plotted residuals. Figure 55 gives such a residual plot. We can see that there does not appear to be any pattern in the residuals and that most do lie within the 2 limit. How to obtain these plots is described in the later.
We can also produce a normal cumulative percentage probability plot for the residuals. This gives the actual proportion of values that should be less than each residual increment, assuming a standard normal distribution; against the observed cumulative proportion. This plot should follow a straight line. Any divergence from this line is a sign that the residuals are not normally distributed and that there is some further work required on the model. Figure 56 gives this normal plot for our regression model. We can see that the residuals do follow the straight line fairly well. SPSS also gives a number of diagnostic tests, which give a p value for those more interested in mechanical statistics. These tests are not rigorous, and even when some of these tests indicate a problem with the residuals, an actual plot may reveal that the problem is not as great as the test suggests. Of course the converse may also be true.
We may have a large number of possible explanatory variables. We would like to know which ones are the best for predicting blood pressure. SPSS lets you perform what is called a stepwise regression procedure. Stepwise regression works by finding that variable which is most strongly significantly related to the dependent variable and then fitting it. Having fitted this variable it now searches for the next strongest significantly related variable and fits that. It now checks to see whether the first variable is still significantly related to the dependent variable. If not, it is removed. The process continues again, until no variable may be added or removed from the equation. In SPSS to perform a stepwise multiple regression procedure, we should enter all our potential explanatory variables into the Independent box, then from the Method: box, change enter to stepwise. In order to obtain the residual diagnostics, click on Plots, Normal Probability Plot, also place *zresid in the Y: box and *adjpred in the X: box, continue. For confidence intervals, click on Statistics, Confidence Intervals and Descriptives.
SPSS choose an inclusion significance level of 0.05 for entry and 0.1 for exit by default.
Unfortunately it does not deal with factor or categorical variables very well using this procedure and we have to set up indicator or dummy variables for all categorical variables. Once a final model has been obtained you may go back to the ANOVA procedure and fit an identical model, and obtain categorical parameter estimates more easily.
Having chosen the above commands you will obtain the following SPSS output in courier font. We shall go through this output and explain what it all means in italics. You need to set up two indicator variables (place1 & place2) for place, then include age, height, place1, palce2, sex1 and height as potential explanatory variables.
* * * * M U L T I P L E R E G R E S S I O N * * * *
Listwise Deletion of Missing Data
| Mean | Std Dev | Label | |
| DIA | 80.872 | 11.988 | Tensao Diastolica |
| AGE | 44.257 | 14.360 | Age |
| HEIGHT | 1578.339 | 72.852 | height |
| PLACE1 | .495 | .502 | |
| PLACE2 | .220 | .416 | |
| SEX1 | .321 | .469 | |
| WEIGHT | 662.661 | 120.566 | Weight |
We now have a correlation matrix, showing how all the variables are related,. If there is a very strong relationship between two or more of the explanatory variables, it may be better to include only one in the final model.
Correlation, 1-tailed Sig:
| DIA | AGE | HEIGHT | PLACE1 | PLACE2 | SEX1 | WEIGHT | |
| DIA | 1.000 | .322 | -.013 | -.217 | .082 | -.044 | .230 |
| . | .000 | .448 | .012 | .199 | .326 | .008 | |
| AGE | .322 | 1.000 | -.218 | -.101 | -.002 | -.026 | .155 |
| .000 | . | .011 | .147 | .493 | .394 | .054 | |
| HEIGHT | -.013 | -.218 | 1.000 | -.236 | .105 | .625 | .439 |
| .448 | .011 | . | .007 | .138 | .000 | .000 | |
| PLACE1 | -.217 | -.101 | -.236 | 1.000 | -.527 | .026 | -.356 |
| .012 | .147 | .007 | . | .000 | .394 | .000 | |
| PLACE2 | .082 | -.002 | .105 | -.527 | 1.000 | -.223 | .212 |
| .199 | .493 | .138 | .000 | . | .010 | .013 | |
| SEX1 | -.044 | -.026 | .625 | .026 | -.223 | 1.000 | .103 |
| .326 | .394 | .000 | .394 | .010 | . | .144 | |
| WEIGHT | .230 | .155 | .439 | -.356 | .212 | .103 | 1.000 |
| .008 | .054 | .000 | .000 | .013 | .144 | . |
Descriptive Statistics are printed on Page 118
Block Number 1. Method: Stepwise Criteria PIN .0500 POUT .1000
AGE HEIGHT PLACE1 PLACE2 SEX1
WEIGHT
The variable most strongly related to dia is selected and found to be age. From the R Square, age explains 10.3% of the total variation or 1607.1 of the total SS. The regression coefficient for age is found to be 0.269, se=0.0764 p=0.001. (Note T = 3.515.= F = 12.35779)
Variable(s) Entered on Step Number
1.. AGE Age
| Multiple R | .32177 |
| R Square | .10354 |
| Adjusted R Square | .09516 |
| Standard Error | 11.40384 |
| DF | Sum of Squares | Mean Square | |
| Regression | 1 | 1607.10199 | 1607.10199 |
| Residual | 107 | 13915.09984 | 130.04766 |
---------------------- Variables in the Equation -----------------------
| Variable | B | SE B | 95% Confdnce Intrvl B | Beta | |
| AGE | .268630 | .076416 | .117144 | .420115 | .321770 |
| (Constant) | 68.982855 | 3.553944 | 61.937575 | 76.028135 | |
| Variable | T | Sig T |
| AGE | 3.515 | .0006 |
| (Constant) | 19.410 | .0000 |
------------- Variables not in
the Equation -------------
| Variable | Beta In | Partial | Min Toler | T | Sig T |
| HEIGHT | .060524 | .062382 | .952339 | .644 | .5213 |
| PLACE1 | -.186238 | -.195688 | .989749 | -2.054 | .0424 |
| PLACE2 | .082369 | .086995 | .999997 | .899 | .3706 |
| SEX1 | -.035266 | -.037234 | .999318 | -.384 | .7020 |
| WEIGHT | .184968 | .193008 | .976092 | 2.025 | .0454 |
* * * * M U L T I P L E R E G R E S S I O N * * * *
Equation Number 1 Dependent Variable.. DIA Tensao Diastolica
Variable(s) Entered on Step Number
2.. PLACE1
| Multiple R | .37130 |
| R Square | .13786 |
| Adjusted R Square | .12160 |
| Standard Error | 11.23599 |
| DF | Sum of Squares | Mean Square | |
| Regression | 2 | 2139.96642 | 1069.98321 |
| Residual | 106 | 13382.23541 | 126.24750 |
---------------------- Variables in the Equation -----------------------
| Variable | B | SE B | 95% Confdnce Intrvl B | Beta | |
| AGE | .252887 | .075680 | .102844 | .402930 | .302913 |
| PLACE1 | -4.445093 | 2.163635 | -8.734710 | -.155477 | -.186238 |
| (Constant) | 71.881712 | 3.775233 | 64.396944 | 79.366480 | |
| Variable | T | Sig T |
| AGE | 3.342 | .0012 |
| PLACE1 | -2.054 | .0424 |
| (Constant) | 19.040 | .0000 |
| for those for whom place1=0: | dîa = 71.9 + 0.25age |
| for those for whom place1 = 1: | dîa = (71.9 - 4.4) + 0.25age = 67.5 + 0.25age |
| Variable | Beta In | Partial | Min Toler | T | Sig T |
| HEIGHT | .010883 | .011028 | .885209 | .113 | .9102 |
| PLACE2 | -.021846 | -.019960 | .712336 | -.205 | .8383 |
| SEX1 | -.030938 | -.033299 | .989197 | -.341 | .7335 |
| WEIGHT | .136394 | .136156 | .859131 | 1.408 | .1620 |
* * * * M U L T I P L E R E G R E S S I O N * * * *
Equation Number 1 Dependent Variable.. DIA Tensao Diastolica
Residuals Statistics:
| Min | Max | Mean | Std Dev | N | |||
| *PRED | 72.7473 | 90.8483 | 80.8716 | 4.4513 | 109 | Predicted values | |
| *ZPRED | -1.8251 | 2.2413 | .0000 | 1.0000 | 109 | Standardized predicted values | |
| *SEPRED | 1.5152 | 2.8771 | 1.8363 | .3220 | 109 | Studentized predicted values | |
| *ADJPRED | 72.7353 | 91.0855 | 80.8735 | 4.4464 | 109 | adjusted predicted values | |
| *RESID | -22.5838 | 35.1517 | .0000 | 11.1315 | 109 | raw residuals | |
| *ZRESID | -2.0100 | 3.1285 | .0000 | .9907 | 109 | standardized residuals | |
| *SRESID | -2.0568 | 3.2219 | -.0001 | 1.0073 | 109 | Studentized residuals | |
| *DRESID | -23.6478 | 37.2825 | -.0019 | 11.5086 | 109 | ||
| *SDRESID | -2.0891 | 3.3763 | .0031 | 1.0202 | 109 | ||
| *MAHAL | .9733 | 6.0905 | 1.9817 | 1.1202 | 109 | ||
| *COOK D | .0000 | .2098 | .0114 | .0252 | 109 | ||
| *LEVER | .0090 | .0564 | .0183 | .0104 | 109 | influential observations |
Finally SPSS gives some plots of the residuals. You can see that the residuals are fairly satisfactory, with no great divergence from normality. Thus you can accept the model as being adequate.
Total Cases = 109
Hi-Res Chart # 15:Normal p-p plot of *zresid
Hi-Res Chart # 14:Scatterplot
of *zresid with *adjpred
Introduction
|
Summary
Statistics |
Descriptive Statistics
|
Sampling |
Normal Distribution
| The t-Student Distribution
|
Correlation and Regression |
Analysis of Variance
|
Contingency Tables |
Non-Parametric Statistics