Regression and correlation

Table of contents

Problem

You want to perform linear regressions and/or correlations.

Solution

Some sample data to work with:

# Make some data
# X increases (noisily)
# Z increases slowly
# Y is constructed so it is inversely related to xvar and positively related to xvar*zvar
set.seed(955)
xvar <- 1:20 + rnorm(20,sd=3)
zvar <- 1:20/4 + rnorm(20,sd=2)
yvar <- -2*xvar + xvar*zvar/5 + 3 + rnorm(20,sd=4)

# Make a data frame with the variables
df <- data.frame(x=xvar, y=yvar, z=zvar)
#            x           y           z
# -4.252354091   4.5857688  1.89877152
#  1.702317971  -4.9027824 -0.82937359
#  4.323053753  -4.3076433 -1.31283495
#  1.780628408   0.2050367 -0.28479448
#  ... 

Correlation

# Correlation coefficient
cor(df$x, df$y)
# -0.7695378

Correlation matrices (for multiple variables)

It is also possible to run correlations between many pairs of variables, using a matrix or data frame.

# A correlation matrix of the variables
cor(df)
#            x            y           z
# x  1.0000000 -0.769537849 0.491698938
# y -0.7695378  1.000000000 0.004172295
# z  0.4916989  0.004172295 1.000000000

# Print with only two decimal places
round(cor(df),2)
#       x     y    z
# x  1.00 -0.77 0.49
# y -0.77  1.00 0.00
# z  0.49  0.00 1.00

To visualize a correlation matrix, see ../../Graphs/Correlation matrix.

Linear regression

Linear regressions, where df$x is the predictor, and df$y is the outcome. This can be done using two columns from a data frame, or with numeric vectors directly.

# These two commands will have the same outcome:
fit <- lm(y ~ x, data=df)  # Using the columns x and y from the data frame
fit <- lm(df$y ~ df$x)     # Using the vectors df$x and df$y
fit
# Call:
# lm(formula = y ~ x, data = df)
# 
# Coefficients:
# (Intercept)            x  
#     -0.2278      -1.1829  

# This means that the predicted yvar = -.2278 - 1.1829*x

# Get more detailed information:
summary(fit)
# Call:
# lm(formula = y ~ x, data = df)
#
# Residuals:
#      Min       1Q   Median       3Q      Max 
# -15.8922  -2.5114   0.2866   4.4646   9.3285 
#
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  -0.2278     2.6323  -0.087    0.932    
# x            -1.1829     0.2314  -5.113 7.28e-05 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
#
# Residual standard error: 6.506 on 18 degrees of freedom
# Multiple R-squared: 0.5922,   Adjusted R-squared: 0.5695 
# F-statistic: 26.14 on 1 and 18 DF,  p-value: 7.282e-05 

To visualize the data with regression lines, see ../../Graphs/Scatterplots (ggplot2) and ../../Graphs/Scatterplot.

Linear regression with multiple predictors

Linear regression with yvar as the outcome, and xvar and zvar as predictors.

Note that the formula specified below does not test for interactions between x and z.

# These have the same result
fit2 <- lm(y ~ x + z, data=df)  # Using the columns x, y, and z from the data frame
fit2 <- lm(df$y ~ df$x + df$z)    # Using the vectors xvar, yvar, and zvar
fit2
# Call:
# lm(formula = y ~ x + z, data = df)
#
# Coefficients:
# (Intercept)            x            z  
#      -1.382       -1.564        1.858  

summary(fit2)
# Call:
# lm(formula = y ~ x + z, data = df)
#
# Residuals:
#    Min     1Q Median     3Q    Max 
# -7.974 -3.187 -1.205  3.847  7.524 
#
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  -1.3816     1.9878  -0.695  0.49644    
# x            -1.5642     0.1984  -7.883 4.46e-07 ***
# z             1.8578     0.4753   3.908  0.00113 ** 
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
#
# Residual standard error: 4.859 on 17 degrees of freedom
# Multiple R-squared: 0.7852,   Adjusted R-squared: 0.7599 
# F-statistic: 31.07 on 2 and 17 DF,  p-value: 2.1e-06 

Interactions

The topic of how to properly do multiple regression and test for interactions can be quite complex and is not covered here. Here we just fit a model with x, z, and the interaction between the two.

To model interactions between x and z, a x:z term must be added. Alternatively, the formula x*z expands to x+z+x:z.

# These are equivalent; the x*z expands to x + z + x:z
fit3 <- lm(y ~ x * z, data=df) 
fit3 <- lm(y ~ x + z + x:z, data=df) 
# Call:
# lm(formula = y ~ x + z + x:z, data = df)
#
# Coefficients:
# (Intercept)            x            z          x:z  
#      2.2820      -2.1311      -0.1068       0.2081  

summary(fit3)
# Call:
# lm(formula = y ~ x + z + x:z, data = df)
#
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -5.3045 -3.5998  0.3926  2.1376  8.3957 
#
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  2.28204    2.20064   1.037   0.3152    
# x           -2.13110    0.27406  -7.776    8e-07 ***
# z           -0.10682    0.84820  -0.126   0.9013    
# x:z          0.20814    0.07874   2.643   0.0177 *  
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
#
# Residual standard error: 4.178 on 16 degrees of freedom
# Multiple R-squared: 0.8505,   Adjusted R-squared: 0.8225 
# F-statistic: 30.34 on 3 and 16 DF,  p-value: 7.759e-07