Recoding data

Table of contents

Problem

You want to recode data or calculate new data columns from existing ones.

Solution

Three methods of recoding data are shown below. The first method uses R's built-in commands, the second uses the cut function, and the third uses the recode function in the car package.

The examples below will use this data:

data <- read.table(header=T, con <- textConnection('
 subject sex control cond1 cond2
       1   M     7.9  12.3  10.7
       2   F     6.3  10.6  11.1
       3   F     9.5  13.1  13.8
       4   M    11.5  13.4  12.9
 '))
close(con)

Recoding a categorical variable

Code Male as 1 and Female as 2, and put it in a new column.

data$scode[data$sex=="M"] <- "1"
data$scode[data$sex=="F"] <- "2"

# Convert the column to a factor
data$scode <- factor(data$scode)
# subject sex control cond1 cond2 scode
#       1   M     7.9  12.3  10.7     1
#       2   F     6.3  10.6  11.1     2
#       3   F     9.5  13.1  13.8     2
#       4   M    11.5  13.4  12.9     1

Another way to do it is to use the match function:

oldvalues <-        c("M", "F")
newvalues <- factor(c("g1","g2"))  # Make this a factor

data$scode <- newvalues[ match(data$sex, oldvalues) ]
# subject sex control cond1 cond2  scode
#       1   M     7.9  12.3  10.7     g1
#       2   F     6.3  10.6  11.1     g2
#       3   F     9.5  13.1  13.8     g2
#       4   M    11.5  13.4  12.9     g1

If, instead of creating a new column, you just want to rename the levels from "M" and "F" to something else, see ../Renaming levels of a factor.

Recoding a continuous variable into categorical variable

Mark those whose control measurement is <7 as "low", and those with >=7 as "high":

data$category[data$control< 7] <- "low"
data$category[data$control>=7] <- "high"
# Convert the column to a factor
data$category <- factor(data$category)
# subject sex control cond1 cond2 scode category
#       1   M     7.9  12.3  10.7    g1     high
#       2   F     6.3  10.6  11.1    g2      low
#       3   F     9.5  13.1  13.8    g2     high
#       4   M    11.5  13.4  12.9    g1     high

With the cut function, you specify boundaries and the resulting values:

data$category <- cut(data$control,
                     breaks=c(-Inf, 7, 9, Inf),
                     labels=c("low","medium","high"))
# subject sex control cond1 cond2 scode category
#       1   M     7.9  12.3  10.7    g1   middle
#       2   F     6.3  10.6  11.1    g2      low
#       3   F     9.5  13.1  13.8    g2     high
#       4   M    11.5  13.4  12.9    g1     high

By default, the ranges are open on the left, and closed on the right, as in (7,9]. To set it so that ranges are closed on the left and open on the right, like [7,9), use right=FALSE.

Calculating a new continuous variable

Suppose you want to add a new column with the sum of the three measurements.

data$total <- data$control + data$cond1 + data$cond2
# subject sex control cond1 cond2 scode category total
#       1   M     7.9  12.3  10.7    g1   middle  30.9
#       2   F     6.3  10.6  11.1    g2      low  28.0
#       3   F     9.5  13.1  13.8    g2     high  36.4
#       4   M    11.5  13.4  12.9    g1     high  37.8