1.4. Migrating from SAS to R: A Skill Conversion Guide
1.4.15. Recoding Variables in SAS vs R
1. Introduction
Recoding variables is a common data management task in clinical trials, especially for SDTM domains such as AE (Adverse Events), DM (Demographics), and LB (Laboratory). Recoding allows you to:
- Collapse categories (e.g., recode AESEV from "MILD"/"MODERATE"/"SEVERE" to "NON-SERIOUS"/"SERIOUS")
- Reverse code or group lab results (e.g., recode LBORRES to "LOW"/"NORMAL"/"HIGH")
- Assign new values for analysis or reporting
It is best practice to create new variables for recoded values to preserve the original data.
2. Basic Recoding: SAS vs R
| Task | SAS | R (base/car) |
|---|---|---|
| Recode with IF/THEN | IF AESEV="SEVERE" THEN AESER="Y"; |
ifelse(AESEV == "SEVERE", "Y", "N") |
| Recode with array/loop | ARRAY aesev{*} AESEV1-AESEV4; ... |
lapply() or mutate(across()) |
| Recode with value labels | PROC FORMAT |
factor() or recode() (car package) |
3. Input Example (SDTM AE Domain)
Suppose you have the following AE data:
| USUBJID | AEDECOD | AESEV | AESER |
|---|---|---|---|
| 01-001 | HEADACHE | MILD | N |
| 01-002 | NAUSEA | MODERATE | N |
| 01-003 | DIZZINESS | SEVERE | Y |
| 01-004 | FATIGUE | SEVERE | Y |
| 01-005 | NAUSEA | MILD | N |
4. Recoding a Single Variable
SAS Example:
data ae;
set ae;
if AESEV = "SEVERE" then AESER = "Y";
else AESER = "N";
run;
- Recodes AESER: sets to "Y" if AESEV is "SEVERE", otherwise "N".
R Example (base):
# Dummy AE data
ae <- data.frame(
USUBJID = c("01-001", "01-002", "01-003", "01-004", "01-005"),
AEDECOD = c("HEADACHE", "NAUSEA", "DIZZINESS", "FATIGUE", "NAUSEA"),
AESEV = c("MILD", "MODERATE", "SEVERE", "SEVERE", "MILD"),
AESER = c("N", "N", "Y", "Y", "N")
)
ae$AESER <- ifelse(ae$AESEV == "SEVERE", "Y", "N")
- Uses
ifelse()for recoding.
R Example (car package):
library(car)
# Dummy AE data already defined above
ae$AESEV_REC <- recode(ae$AESEV, "'MILD'='NON-SERIOUS'; 'MODERATE'='NON-SERIOUS'; 'SEVERE'='SERIOUS'")
- Collapses AESEV into "NON-SERIOUS" and "SERIOUS".
Expected Output:
| AESEV | AESER | AESEV_REC |
|---|---|---|
| MILD | N | NON-SERIOUS |
| MODERATE | N | NON-SERIOUS |
| SEVERE | Y | SERIOUS |
| SEVERE | Y | SERIOUS |
| MILD | N | NON-SERIOUS |
5. Recoding Multiple Variables (e.g., LBORRES in SDTM LB Domain)
Suppose you want to recode multiple lab result columns (e.g., LBORRES1, LBORRES2) to "LOW", "NORMAL", "HIGH" based on reference ranges.
SAS Example:
data lb;
set lb;
array lborres{2} LBORRES1 LBORRES2;
array lbnorm{2} LBNORM1 LBNORM2;
do i = 1 to 2;
if lborres{i} < 70 then lbnorm{i} = "LOW";
else if lborres{i} > 110 then lbnorm{i} = "HIGH";
else lbnorm{i} = "NORMAL";
end;
run;
R Example (vectorized):
# Dummy LB data
lb <- data.frame(
LBORRES1 = c(65, 90, 115),
LBORRES2 = c(120, 80, 60)
)
lb$LBNORM1 <- ifelse(lb$LBORRES1 < 70, "LOW",
ifelse(lb$LBORRES1 > 110, "HIGH", "NORMAL"))
lb$LBNORM2 <- ifelse(lb$LBORRES2 < 70, "LOW",
ifelse(lb$LBORRES2 > 110, "HIGH", "NORMAL"))
Or, for many columns:
lab_cols <- c("LBORRES1", "LBORRES2")
lb[paste0("LBNORM", 1:2)] <- lapply(lb[lab_cols], function(x)
ifelse(x < 70, "LOW", ifelse(x > 110, "HIGH", "NORMAL")))
Expected Output:
| LBORRES1 | LBORRES2 | LBNORM1 | LBNORM2 |
|---|---|---|---|
| 65 | 120 | LOW | HIGH |
| 90 | 80 | NORMAL | NORMAL |
| 115 | 60 | HIGH | LOW |
6. Recoding Continuous to Categorical (e.g., Age Groups in DM Domain)
R Example:
# Dummy DM data
dm <- data.frame(
AGE = c(12, 34, 70)
)
dm$AGEGRP <- cut(dm$AGE, breaks = c(0, 18, 65, Inf),
labels = c("Child", "Adult", "Senior"), right = FALSE)
- Categorizes AGE into "Child", "Adult", "Senior".
Expected Output:
| AGE | AGEGRP |
|---|---|
| 12 | Child |
| 34 | Adult |
| 70 | Senior |
7. Reverse Coding (e.g., Questionnaire Scores)
Suppose QSVAL is a 1–5 scale, and you want to reverse it.
R Example:
# Dummy QS data
qs <- data.frame(
QSVAL = 1:5
)
qs$QSVAL_REV <- 6 - qs$QSVAL
- 1→5, 2→4, 3→3, 4→2, 5→1
Expected Output:
| QSVAL | QSVAL_REV |
|---|---|
| 1 | 5 |
| 2 | 4 |
| 3 | 3 |
| 4 | 2 |
| 5 | 1 |
8. Beyond Basics: Advanced Recoding
- Multiple recodes at once: Use
dplyr::mutate(across(...))for many columns. - Custom functions: Write your own recoding logic for complex SDTM domains.
- Factor recoding: Use
forcats::fct_recode()for factor levels (e.g., DM$SEX).
R Example: dplyr/across
library(dplyr)
# Dummy AE data already defined above
ae <- ae %>%
mutate(across(starts_with("AESEV"), ~recode(., "'MILD'='NON-SERIOUS';'MODERATE'='NON-SERIOUS';'SEVERE'='SERIOUS'")))
R Example: forcats
library(forcats)
# Dummy DM data for SEX
dm <- data.frame(
SEX = c("F", "M", "F", "M")
)
dm$SEX <- fct_recode(dm$SEX, Female = "F", Male = "M")
Expected Output:
| SEX | recoded_SEX |
|---|---|
| F | Female |
| M | Male |
9. Input and Output Table: Recoding Example (AE Domain)
Input Table:
| USUBJID | AEDECOD | AESEV |
|---|---|---|
| 01-001 | HEADACHE | MILD |
| 01-002 | NAUSEA | SEVERE |
R Recoding:
# Dummy AE data for recoding example
ae <- data.frame(
USUBJID = c("01-001", "01-002"),
AEDECOD = c("HEADACHE", "NAUSEA"),
AESEV = c("MILD", "SEVERE")
)
ae$AESER <- ifelse(ae$AESEV == "SEVERE", "Y", "N")
Output Table:
| USUBJID | AEDECOD | AESEV | AESER |
|---|---|---|---|
| 01-001 | HEADACHE | MILD | N |
| 01-002 | NAUSEA | SEVERE | Y |
10. Key Points and Best Practices
- Always create new variables for recoded values to preserve the original SDTM data.
- Use vectorized functions (
lapply,mutate(across())) for efficiency. - For categorical/factor variables, use
factor()orforcatsfor level recoding. - Document your recoding logic for traceability and regulatory compliance.
- For complex recoding, custom functions or lookup tables can be helpful.
**Resource download links**
1.4.15.-Recoding-Variables-in-SAS-vs-R.zip