contact@a2zlearners.com

1.5. SAS PROCEDURES -> Equivalent in R

1.5.6. SAS PROC FORMAT and its equivalent in R Using dplyr::recode() + Named Vectors

Example Dataset: status_codes

SUBJID STATUS
001 C
002 W
003 L
004 S
005 NA

Where:

  • C = Completed
  • W = Withdrawn
  • L = Lost to Follow-up
  • S = Screen Failure

We want to generate:

SUBJID STATUS STATUS_DESC STATUS_NUM
001 C Completed 1
002 W Withdrawn 2
003 L Lost to Follow-up 3
004 S Screen Failure 4
005 NA MISSING NA

**PROC FORMAT SAS Code**

proc format;
  value $statuslbl
    'C' = 'Completed'
    'W' = 'Withdrawn'
    'L' = 'Lost to Follow-up'
    'S' = 'Screen Failure';
    
  value $statusnum
    'C' = 1
    'W' = 2
    'L' = 3
    'S' = 4;
run;

data mapped;
  set status_codes;
  length STATUS_DESC $25;
  STATUS_DESC = put(STATUS, $statuslbl.);
  STATUS_NUM = input(put(STATUS, $statusnum.), 8.);
run;

Explanation (SAS)

  • proc format: Defines named mappings from coded values to full text or numbers.
  • value $statuslbl: Character format for descriptive labels.
  • value $statusnum: Character-to-numeric mapping.
  • put(STATUS, ...): Applies the format.
  • input(..., 8.): Converts formatted strings into numeric values.
  • No explicit handling of missing codes here—they pass through unchanged unless specified.

**R code**

library(dplyr)

# Input data
status_codes <- tibble(
  SUBJID = c("001", "002", "003", "004", "005"),
  STATUS = c("C", "W", "L", "S", NA)
)

# Named vectors (like PROC FORMAT lookup tables)
status_label_map <- c(
  "C" = "Completed",
  "W" = "Withdrawn",
  "L" = "Lost to Follow-up",
  "S" = "Screen Failure"
)

status_num_map <- c(
  "C" = 1,
  "W" = 2,
  "L" = 3,
  "S" = 4
)

# Transformation
status_mapped <- status_codes %>%
  mutate(
    STATUS_DESC = recode(STATUS, !!!status_label_map, .missing = "MISSING"),
    STATUS_NUM  = recode(STATUS, !!!status_num_map)
  )

Explanation (R)

  • named vectors: Define mappings like SAS formats, e.g., "C" = "Completed".
  • !!!status_label_map: The !!! "unquote-splices" the vector into the recode() call.
  • .missing = "MISSING": Explicitly assigns "MISSING" when STATUS is NA.
  • .default = "ERROR" (optional): Use this to flag values not included in the mapping.
  • Result: Adds new columns STATUS_DESC and STATUS_NUM based on raw STATUS.

Output Table

SUBJID STATUS STATUS_DESC STATUS_NUM
001 C Completed 1
002 W Withdrawn 2
003 L Lost to Follow-up 3
004 S Screen Failure 4
005 NA MISSING NA

**Creating Multi-format codelist table**

Let’s build a systematic, scalable way in R to handle a multi-format codelist table—mimicking how SAS stores multiple PROC FORMAT entries (with different source and target types like character-to-character, numeric-to-character, etc.).

We'll walk through:

  • Input Format Table Structure
  • Logic to Apply: Based on type pairs
  • Working Example (with code and outputs)
  • Further exploration: scale, match with dplyr, wrap into a function

1. Input Codelist Format Table

We'll define a master codelist like this:

FORMAT_NAME TYPE_IN TYPE_OUT INPUT_VALUE OUTPUT_VALUE
SEXLABEL char char M Male
SEXLABEL char char F Female
SEVCODE char num MILD 1
SEVCODE char num MODERATE 2
SEVCODE char num SEVERE 3
ARMCDLAB num char 1 Placebo
ARMCDLAB num char 2 Drug A
LABSHIFT num num -1 1
LABSHIFT num num 0 2
LABSHIFT num num 1 3

This is akin to a metadata codelist repository.


2. Logic for Reading and Applying Formats

We'll:

  • Extract lookup pairs from the format table
  • Detect TYPE_IN and TYPE_OUT
  • Apply mapping using recode() for one-to-one translation, or left_join() for more scalable operations

3. Working Example in R

Let's apply SEVCODE (char → numeric) on an AE dataset.

Dataset

ae <- tibble::tibble(
  USUBJID = c("101-001", "101-002", "101-003"),
  AESEV = c("MILD", "MODERATE", "SEVERE")
)

Codelist Filter for SEVCODE

format_table <- tibble::tibble(
  FORMAT_NAME = c("SEVCODE", "SEVCODE", "SEVCODE"),
  TYPE_IN = "char",
  TYPE_OUT = "num",
  INPUT_VALUE = c("MILD", "MODERATE", "SEVERE"),
  OUTPUT_VALUE = c(1, 2, 3)
)

**Convert Codelist to Mapping Vector

sev_map <- setNames(format_table$OUTPUT_VALUE, format_table$INPUT_VALUE)

Apply Mapping

library(dplyr)
# Dummy data for execution
ae <- tibble::tibble(
  USUBJID = c("101-001", "101-002", "101-003"),
  AESEV = c("MILD", "MODERATE", "SEVERE")
)
format_table <- tibble::tibble(
  FORMAT_NAME = c("SEVCODE", "SEVCODE", "SEVCODE"),
  TYPE_IN = "char",
  TYPE_OUT = "num",
  INPUT_VALUE = c("MILD", "MODERATE", "SEVERE"),
  OUTPUT_VALUE = c(1, 2, 3)
)
sev_map <- setNames(format_table$OUTPUT_VALUE, format_table$INPUT_VALUE)
ae_fmt <- ae %>%
  mutate(SEV_CODE = recode(AESEV, !!!sev_map))

Output

USUBJID AESEV SEV_CODE
101-001 MILD 1
101-002 MODERATE 2
101-003 SEVERE 3

4. Automate for Any Format Name

Create a function that takes:

  • input_vector
  • format_table
  • format_name
apply_format <- function(values, format_tbl, format_name, default = NA) {
  mapping <- format_tbl %>%
    filter(FORMAT_NAME == format_name) %>%
    select(INPUT_VALUE, OUTPUT_VALUE) %>%
    deframe()
  recode(values, !!!mapping, .default = default)
}

Use It:

ae %>%
  mutate(SEV_CODE = apply_format(AESEV, format_table, "SEVCODE", default = as.numeric(NA)))

**Handling Missing Values with Factors in R**

library(dplyr)
library(forcats)

# Input data
dm <- tibble::tibble(
  USUBJID = c("01-001", "01-002", "01-003", "01-004", "01-005"),
  SEX = c("M", "F", "M", "F", NA),
  ARMCD = c(1, 2, 2, 1, 1)
)

# Assign value labels using factors, handle missings explicitly
dm <- dm %>%
  mutate(
    SEX_LABEL = factor(SEX, levels = c("M", "F"), labels = c("Male", "Female")),
    SEX_LABEL = forcats::fct_explicit_na(SEX_LABEL, na_level = "Missing"),
    ARMCD_LABEL = factor(ARMCD, levels = c(1, 2), labels = c("Placebo", "Drug A"))
  )

dm
  • factor(): Assigns value labels to codes.
  • forcats::fct_explicit_na(): Makes missing values explicit as "Missing" in the factor.
  • Result: SEX_LABEL will show "Missing" for NA values in SEX.
USUBJID SEX ARMCD SEX_LABEL ARMCD_LABEL
01-001 M 1 Male Placebo
01-002 F 2 Female Drug A
01-003 M 2 Male Drug A
01-004 F 1 Female Placebo
01-005 NA 1 Missing Placebo

**Dropping Unused Factor Levels in R (with SDTM Examples)**

When working with SDTM datasets, you may assign factor levels (labels) that do not appear in your current data. This is useful for consistency across studies or when merging datasets. However, for analysis or reporting, you may want to remove these unused (empty) levels.

Example: SDTM DM Dataset with Extra Levels

library(dplyr)
library(forcats)

# Example DM data with only "M" and "F" present, but extra levels defined
dm <- tibble::tibble(
  USUBJID = c("01-001", "01-002", "01-003"),
  SEX = c("M", "F", "M")
)

# Assign factor with all possible SDTM levels (including "U" for Unknown)
dm <- dm %>%
  mutate(
    SEX_FACTOR = factor(SEX, levels = c("M", "F", "U"), labels = c("Male", "Female", "Unknown"))
  )
levels(dm$SEX_FACTOR)
# [1] "Male"    "Female"  "Unknown"

Notice that "Unknown" is a level, but not present in the data.

Dropping Unused Levels

You can drop unused levels using droplevels():

dm$SEX_FACTOR_DROPPED <- droplevels(dm$SEX_FACTOR)
levels(dm$SEX_FACTOR_DROPPED)
# [1] "Male"   "Female"

Or, when subsetting data, use the drop = TRUE argument:

# Subset to only Male subjects and drop unused levels
dm_male <- dm %>%
  filter(SEX_FACTOR == "Male") %>%
  mutate(SEX_FACTOR = droplevels(SEX_FACTOR))

levels(dm_male$SEX_FACTOR)
# [1] "Male"

Summary Table Example

summary(dm$SEX_FACTOR)
#   Male Female Unknown 
#      2      1      0 

summary(dm$SEX_FACTOR_DROPPED)
#   Male Female 
#      2      1 
  • Use droplevels() to remove empty levels from a factor variable.
  • When subsetting, drop = TRUE in base R (e.g., df[rows, cols, drop=TRUE]) or use droplevels() after filtering in dplyr.

**Resource download links**

1.5.6.-SAS-PROC-FORMAT-and-its-equivalent-in-R-Using-recode()-Named-Vectors.zip