1.5. SAS PROCEDURES -> Equivalent in R
1.5.6. SAS PROC FORMAT and its equivalent in R Using dplyr::recode() + Named Vectors
Example Dataset: status_codes
| SUBJID | STATUS |
|---|---|
| 001 | C |
| 002 | W |
| 003 | L |
| 004 | S |
| 005 | NA |
Where:
- C = Completed
- W = Withdrawn
- L = Lost to Follow-up
- S = Screen Failure
We want to generate:
| SUBJID | STATUS | STATUS_DESC | STATUS_NUM |
|---|---|---|---|
| 001 | C | Completed | 1 |
| 002 | W | Withdrawn | 2 |
| 003 | L | Lost to Follow-up | 3 |
| 004 | S | Screen Failure | 4 |
| 005 | NA | MISSING | NA |
**PROC FORMAT SAS Code**
proc format;
value $statuslbl
'C' = 'Completed'
'W' = 'Withdrawn'
'L' = 'Lost to Follow-up'
'S' = 'Screen Failure';
value $statusnum
'C' = 1
'W' = 2
'L' = 3
'S' = 4;
run;
data mapped;
set status_codes;
length STATUS_DESC $25;
STATUS_DESC = put(STATUS, $statuslbl.);
STATUS_NUM = input(put(STATUS, $statusnum.), 8.);
run;
Explanation (SAS)
proc format: Defines named mappings from coded values to full text or numbers.value $statuslbl: Character format for descriptive labels.value $statusnum: Character-to-numeric mapping.put(STATUS, ...): Applies the format.input(..., 8.): Converts formatted strings into numeric values.- No explicit handling of missing codes here—they pass through unchanged unless specified.
**R code**
library(dplyr)
# Input data
status_codes <- tibble(
SUBJID = c("001", "002", "003", "004", "005"),
STATUS = c("C", "W", "L", "S", NA)
)
# Named vectors (like PROC FORMAT lookup tables)
status_label_map <- c(
"C" = "Completed",
"W" = "Withdrawn",
"L" = "Lost to Follow-up",
"S" = "Screen Failure"
)
status_num_map <- c(
"C" = 1,
"W" = 2,
"L" = 3,
"S" = 4
)
# Transformation
status_mapped <- status_codes %>%
mutate(
STATUS_DESC = recode(STATUS, !!!status_label_map, .missing = "MISSING"),
STATUS_NUM = recode(STATUS, !!!status_num_map)
)
Explanation (R)
named vectors: Define mappings like SAS formats, e.g.,"C" = "Completed".!!!status_label_map: The!!!"unquote-splices" the vector into therecode()call..missing = "MISSING": Explicitly assigns "MISSING" when STATUS is NA..default = "ERROR"(optional): Use this to flag values not included in the mapping.- Result: Adds new columns
STATUS_DESCandSTATUS_NUMbased on rawSTATUS.
Output Table
| SUBJID | STATUS | STATUS_DESC | STATUS_NUM |
|---|---|---|---|
| 001 | C | Completed | 1 |
| 002 | W | Withdrawn | 2 |
| 003 | L | Lost to Follow-up | 3 |
| 004 | S | Screen Failure | 4 |
| 005 | NA | MISSING | NA |
**Creating Multi-format codelist table**
Let’s build a systematic, scalable way in R to handle a multi-format codelist table—mimicking how SAS stores multiple PROC FORMAT entries (with different source and target types like character-to-character, numeric-to-character, etc.).
We'll walk through:
- Input Format Table Structure
- Logic to Apply: Based on type pairs
- Working Example (with code and outputs)
- Further exploration: scale, match with dplyr, wrap into a function
1. Input Codelist Format Table
We'll define a master codelist like this:
| FORMAT_NAME | TYPE_IN | TYPE_OUT | INPUT_VALUE | OUTPUT_VALUE |
|---|---|---|---|---|
| SEXLABEL | char | char | M | Male |
| SEXLABEL | char | char | F | Female |
| SEVCODE | char | num | MILD | 1 |
| SEVCODE | char | num | MODERATE | 2 |
| SEVCODE | char | num | SEVERE | 3 |
| ARMCDLAB | num | char | 1 | Placebo |
| ARMCDLAB | num | char | 2 | Drug A |
| LABSHIFT | num | num | -1 | 1 |
| LABSHIFT | num | num | 0 | 2 |
| LABSHIFT | num | num | 1 | 3 |
This is akin to a metadata codelist repository.
2. Logic for Reading and Applying Formats
We'll:
- Extract lookup pairs from the format table
- Detect
TYPE_INandTYPE_OUT - Apply mapping using
recode()for one-to-one translation, orleft_join()for more scalable operations
3. Working Example in R
Let's apply SEVCODE (char → numeric) on an AE dataset.
Dataset
ae <- tibble::tibble(
USUBJID = c("101-001", "101-002", "101-003"),
AESEV = c("MILD", "MODERATE", "SEVERE")
)
Codelist Filter for SEVCODE
format_table <- tibble::tibble(
FORMAT_NAME = c("SEVCODE", "SEVCODE", "SEVCODE"),
TYPE_IN = "char",
TYPE_OUT = "num",
INPUT_VALUE = c("MILD", "MODERATE", "SEVERE"),
OUTPUT_VALUE = c(1, 2, 3)
)
**Convert Codelist to Mapping Vector
sev_map <- setNames(format_table$OUTPUT_VALUE, format_table$INPUT_VALUE)
Apply Mapping
library(dplyr)
# Dummy data for execution
ae <- tibble::tibble(
USUBJID = c("101-001", "101-002", "101-003"),
AESEV = c("MILD", "MODERATE", "SEVERE")
)
format_table <- tibble::tibble(
FORMAT_NAME = c("SEVCODE", "SEVCODE", "SEVCODE"),
TYPE_IN = "char",
TYPE_OUT = "num",
INPUT_VALUE = c("MILD", "MODERATE", "SEVERE"),
OUTPUT_VALUE = c(1, 2, 3)
)
sev_map <- setNames(format_table$OUTPUT_VALUE, format_table$INPUT_VALUE)
ae_fmt <- ae %>%
mutate(SEV_CODE = recode(AESEV, !!!sev_map))
Output
| USUBJID | AESEV | SEV_CODE |
|---|---|---|
| 101-001 | MILD | 1 |
| 101-002 | MODERATE | 2 |
| 101-003 | SEVERE | 3 |
4. Automate for Any Format Name
Create a function that takes:
input_vectorformat_tableformat_name
apply_format <- function(values, format_tbl, format_name, default = NA) {
mapping <- format_tbl %>%
filter(FORMAT_NAME == format_name) %>%
select(INPUT_VALUE, OUTPUT_VALUE) %>%
deframe()
recode(values, !!!mapping, .default = default)
}
Use It:
ae %>%
mutate(SEV_CODE = apply_format(AESEV, format_table, "SEVCODE", default = as.numeric(NA)))
**Handling Missing Values with Factors in R**
library(dplyr)
library(forcats)
# Input data
dm <- tibble::tibble(
USUBJID = c("01-001", "01-002", "01-003", "01-004", "01-005"),
SEX = c("M", "F", "M", "F", NA),
ARMCD = c(1, 2, 2, 1, 1)
)
# Assign value labels using factors, handle missings explicitly
dm <- dm %>%
mutate(
SEX_LABEL = factor(SEX, levels = c("M", "F"), labels = c("Male", "Female")),
SEX_LABEL = forcats::fct_explicit_na(SEX_LABEL, na_level = "Missing"),
ARMCD_LABEL = factor(ARMCD, levels = c(1, 2), labels = c("Placebo", "Drug A"))
)
dm
factor(): Assigns value labels to codes.forcats::fct_explicit_na(): Makes missing values explicit as "Missing" in the factor.- Result:
SEX_LABELwill show "Missing" for NA values inSEX.
| USUBJID | SEX | ARMCD | SEX_LABEL | ARMCD_LABEL |
|---|---|---|---|---|
| 01-001 | M | 1 | Male | Placebo |
| 01-002 | F | 2 | Female | Drug A |
| 01-003 | M | 2 | Male | Drug A |
| 01-004 | F | 1 | Female | Placebo |
| 01-005 | NA | 1 | Missing | Placebo |
**Dropping Unused Factor Levels in R (with SDTM Examples)**
When working with SDTM datasets, you may assign factor levels (labels) that do not appear in your current data. This is useful for consistency across studies or when merging datasets. However, for analysis or reporting, you may want to remove these unused (empty) levels.
Example: SDTM DM Dataset with Extra Levels
library(dplyr)
library(forcats)
# Example DM data with only "M" and "F" present, but extra levels defined
dm <- tibble::tibble(
USUBJID = c("01-001", "01-002", "01-003"),
SEX = c("M", "F", "M")
)
# Assign factor with all possible SDTM levels (including "U" for Unknown)
dm <- dm %>%
mutate(
SEX_FACTOR = factor(SEX, levels = c("M", "F", "U"), labels = c("Male", "Female", "Unknown"))
)
levels(dm$SEX_FACTOR)
# [1] "Male" "Female" "Unknown"
Notice that "Unknown" is a level, but not present in the data.
Dropping Unused Levels
You can drop unused levels using droplevels():
dm$SEX_FACTOR_DROPPED <- droplevels(dm$SEX_FACTOR)
levels(dm$SEX_FACTOR_DROPPED)
# [1] "Male" "Female"
Or, when subsetting data, use the drop = TRUE argument:
# Subset to only Male subjects and drop unused levels
dm_male <- dm %>%
filter(SEX_FACTOR == "Male") %>%
mutate(SEX_FACTOR = droplevels(SEX_FACTOR))
levels(dm_male$SEX_FACTOR)
# [1] "Male"
Summary Table Example
summary(dm$SEX_FACTOR)
# Male Female Unknown
# 2 1 0
summary(dm$SEX_FACTOR_DROPPED)
# Male Female
# 2 1
- Use
droplevels()to remove empty levels from a factor variable. - When subsetting,
drop = TRUEin base R (e.g.,df[rows, cols, drop=TRUE]) or usedroplevels()after filtering in dplyr.
**Resource download links**
1.5.6.-SAS-PROC-FORMAT-and-its-equivalent-in-R-Using-recode()-Named-Vectors.zip