Converting data between wide and long format
Problem
You want to do convert data from a wide format to a long format.
Many functions in R expect data to be in a long format rather than a wide format. Programs like SPSS, however, often use wide-formatted data.
Solution
There are two methods for this. One uses the functions melt()
and dcast()
from a package called reshape2
. The other uses a function called reshape()
, which is confusingly not part of the reshape2 package; it is part of the base install of R. The melt and cast methods are very powerful and simpler to use, so only they will be covered here.
Sample data
These data frames hold the same data, but in wide and long formats. They will each be converted to the other format below.
origdata.wide <- read.table(header=T, con <- textConnection(' subject sex control cond1 cond2 1 M 7.9 12.3 10.7 2 F 6.3 10.6 11.1 3 F 9.5 13.1 13.8 4 M 11.5 13.4 12.9 ')) close(con) # Make sure the subject column is a factor origdata.wide$subject <- factor(origdata.wide$subject)
origdata.long <- read.table(header=T, con <- textConnection(' subject sex condition measurement 1 M control 7.9 1 M cond1 12.3 1 M cond2 10.7 2 F control 6.3 2 F cond1 10.6 2 F cond2 11.1 3 F control 9.5 3 F cond1 13.1 3 F cond2 13.8 4 M control 11.5 4 M cond1 13.4 4 M cond2 12.9 ')) close(con) # Make sure the subject column is a factor origdata.long$subject <- factor(origdata.long$subject)
From wide to long
origdata.wide # subject sex control cond1 cond2 # 1 M 7.9 12.3 10.7 # 2 F 6.3 10.6 11.1 # 3 F 9.5 13.1 13.8 # 4 M 11.5 13.4 12.9 library(reshape2) data.long <- melt(origdata.wide, # ID variables - all the variables to keep but not split apart on id.vars=c("subject","sex"), # The source columns measure.vars=c("control", "cond1", "cond2" ), # Name of the destination column that will identify the original # column that the measurement came from variable.name="condition" ) # subject sex condition value # 1 M control 7.9 # 2 F control 6.3 # 3 F control 9.5 # 4 M control 11.5 # 1 M cond1 12.3 # 2 F cond1 10.6 # 3 F cond1 13.1 # 4 M cond1 13.4 # 1 M cond2 10.7 # 2 F cond2 11.1 # 3 F cond2 13.8 # 4 M cond2 12.9
Optional: A few things to make the data look nicer.
# Rename the "value" column to "measurement" names(data.long)[names(data.long)=="value"] <- "measurement" # Rename factor names from "cond1" and "cond2" to "first" and "second" levels(data.long$condition)[levels(data.long$condition)=="cond1"] <- "first" levels(data.long$condition)[levels(data.long$condition)=="cond2"] <- "second" # Sort by subject first, then by condition data.long <- data.long[ order(data.long$subject, data.long$condition), ] # subject sex condition measurement # 1 M control 7.9 # 1 M first 12.3 # 1 M second 10.7 # 2 F control 6.3 # 2 F first 10.6 # 2 F second 11.1 # 3 F control 9.5 # 3 F first 13.1 # 3 F second 13.8 # 4 M control 11.5 # 4 M first 13.4 # 4 M second 12.9
From long to wide
The following code uses dcast
to reshape the data. This function is meant for data frames; if you are working with arrays or matrices, use acast
instead.
origdata.long # subject sex condition measurement # 1 M control 7.9 # 1 M cond1 12.3 # 1 M cond2 10.7 # 2 F control 6.3 # 2 F cond1 10.6 # 2 F cond2 11.1 # 3 F control 9.5 # 3 F cond1 13.1 # 3 F cond2 13.8 # 4 M control 11.5 # 4 M cond1 13.4 # 4 M cond2 12.9 # From the source: # "subject" and "sex" are columns we want to keep the same # "condition" is the column that contains the names of the new column to put things in # "measurement" holds the measurements library(reshape2) data.wide <- dcast(origdata.long, subject + sex ~ condition, value_var="measurement") # subject sex cond1 cond2 control # 1 M 12.3 10.7 7.9 # 2 F 10.6 11.1 6.3 # 3 F 13.1 13.8 9.5 # 4 M 13.4 12.9 11.5
Optional: A few things to make the data look nicer.
# Rename cond1 to first, and cond2 to second names(data.wide)[names(data.wide)=="cond1"] <- "first" names(data.wide)[names(data.wide)=="cond2"] <- "second" # Reorder the columns data.wide <- data.wide[, c(1,2,5,3,4)] # subject sex control first second # 1 M 7.9 12.3 10.7 # 2 F 6.3 10.6 11.1 # 3 F 9.5 13.1 13.8 # 4 M 11.5 13.4 12.9
Note that the mapping from factor levels (e.g., first
) to column names (cond1
) is not explicitly set. The argument varying=...
maps the factor levels in their natural order to the new column names. The order of factor levels can be changed before reshaping, or the columns can be re-ordered afterward.