contact@a2zlearners.com

1.4. Migrating from SAS to R: A Skill Conversion Guide

1.4.2. Applying Metadata and Outputting SAS/XPT Files with haven

This guide compares SAS and R approaches to handling metadata and exporting datasets, covering everything from basic labels to advanced metadata preservation across systems.

1. Metadata Fundamentals: SAS vs R

Feature SAS Native R with haven/xportr
Dataset Labels Automatic support Via attributes
Variable Labels Automatic support Stored as attributes
Value Labels PROC FORMAT and labeled values labelled class and attributes
Formats Built-in formatting system Custom attributes
Length Control Explicit variable lengths SASlength attribute
Persistence Stored in dataset Needs explicit preservation

2. Basic Metadata Application

This section demonstrates how to apply and manage metadata in both SAS and R, focusing on dataset labels, variable labels, and value labels. In R, the recommended approach for production and regulatory workflows is to use metadata tables with functions like xportr_label() and xportr_df_label() for consistent, auditable, and scalable metadata management. Manual assignment is possible for quick tasks, but metadata-driven methods ensure reproducibility and compliance across multiple datasets and domains.

1. Dataset Labels

Capability SAS R
Add dataset label data mydata(label="Subject Data"); attr(df, "label") <- "Subject Data"
View dataset label proc contents data=mydata; attr(df, "label")
Preserve on export Automatic Via haven::write_sas() or xportr::xportr_write()

SAS Example

/* Create a dataset with a label */
data dm(label="Demographics Domain");
  input USUBJID $ SUBJID $ SEX $ AGE RFSTDTC $10. RFENDTC $10.;
  format RFSTDTC RFENDTC yymmdd10.;
  label
    USUBJID = "Unique Subject Identifier"
    SUBJID = "Subject Identifier for the Study"
    SEX = "Sex"
    AGE = "Age"
    RFSTDTC = "Subject Reference Start Date/Time"
    RFENDTC = "Subject Reference End Date/Time";
  datalines;
STUDY1-001 001 M 45 2022-03-15 2022-09-30
STUDY1-002 002 F 62 2022-03-20 2022-10-05
STUDY1-003 003 M 57 2022-03-22 2022-09-28
;
run;

/* View dataset label */
proc contents data=dm;
run;

Explanation (SAS):

  • Adds a dataset label directly in the data statement using the label= option
  • The label is stored as part of the dataset's metadata
  • proc contents displays the dataset label and other metadata

R Example

# Load required libraries
library(tibble)
library(dplyr)
library(xportr)

# Create Demographics dataset
dm <- tibble(
  USUBJID = c("STUDY1-001", "STUDY1-002", "STUDY1-003"),
  SUBJID = c("001", "002", "003"),
  SEX = c("M", "F", "M"),
  AGE = c(45, 62, 57),
  RFSTDTC = c("2022-03-15", "2022-03-20", "2022-03-22"),
  RFENDTC = c("2022-09-30", "2022-10-05", "2022-09-28")
)

# Create metadata table for dataset labels
df_labels <- tribble(
  ~dataset, ~label,
  "dm", "Demographics Domain",
  "vs", "Vital Signs Domain",
  "ae", "Adverse Events Domain"
)

# Apply the dataset label using xportr_df_label
dm <- dm %>%
  xportr_df_label(df_labels, domain = "dm") 

# Check result
attr(dm, "label")

Output:

[1] "Demographics Domain"

Explanation (R):

  • Two methods shown: direct attribute assignment or using xportr_df_label()
  • The xportr_df_label() function takes a metadata table with dataset names and labels
  • Labels are stored as attributes which can be viewed with attr() function
  • Using a metadata table allows for centralized label management across multiple SDTM domains

2. Variable (Column) Labels

Capability SAS R
Add variable label label age="Patient Age"; var_label(df$age) <- "Patient Age"
View variable labels proc contents data=mydata; map(df, attr, "label")
Apply multiple labels label statement with multiple vars xportr_label() with metadata table

SAS Example

/* Create dataset with variable labels */
data lb;
  input USUBJID $ LBTEST $ LBORRES LBDTC $10.;
  format LBDTC yymmdd10.;
  label 
    USUBJID = "Unique Subject Identifier"
    LBTEST = "Laboratory Test"
    LBORRES = "Result or Finding in Original Units"
    LBDTC = "Date/Time of Specimen Collection";
  datalines;
STUDY1-001 GLUCOSE 105.2 2022-03-16
STUDY1-002 GLUCOSE 98.6 2022-03-21
STUDY1-003 GLUCOSE 112.4 2022-03-23
;
run;

/* View variable labels */
proc contents data=lb;
run;

Explanation (SAS):

  • Uses the label statement to assign descriptive labels to variables in the LB (Laboratory) domain
  • Each variable can have one label up to 256 characters
  • Labels appear in procedures output, making results more readable
  • proc contents displays all variable labels

R Example


# Load necessary libraries
library(haven)
library(dplyr)
library(purrr)
library(xportr)
library(tibble)

# Step 1: Create laboratory data
lb <- tibble(
  USUBJID = c("STUDY1-001", "STUDY1-002", "STUDY1-003"),
  LBTEST  = c("GLUCOSE", "GLUCOSE", "GLUCOSE"),
  LBORRES = c(105.2, 98.6, 112.4),
  LBDTC   = c("2022-03-16", "2022-03-21", "2022-03-23")
)

# Step 2: Optional manual labeling (if not using metadata)
lb$LBORRES <- labelled(lb$LBORRES, label = "Result or Finding in Original Units")
lb$LBTEST  <- labelled(lb$LBTEST, label = "Laboratory Test")

# Step 3: Create metadata for xportr labeling
v_labels <- tribble(
  ~dataset, ~variable, ~label,
  "lb", "USUBJID", "Unique Subject Identifier",
  "lb", "LBTEST",  "Laboratory Test",
  "lb", "LBORRES", "Result or Finding in Original Units",
  "lb", "LBDTC",   "Date/Time of Specimen Collection"
)

# Step 4: Apply labels using xportr
lb <- lb %>%
  xportr_label(v_labels)

# Step 5: Check variable labels
map_chr(lb, ~ attr(.x, "label"))

Explanation (R):

  • You can assign variable labels directly using labelled() or var_label(), but for production and SDTM, use xportr_label() with a metadata table.
  • The xportr_label() function reads a metadata table (with dataset, variable, and label columns) and applies the correct label to each variable as an attribute.
  • Labels are stored as attributes and can be checked with map_chr(df, ~ attr(.x, "label")).
  • This metadata-driven approach ensures consistency and makes it easy to update or audit labels across multiple datasets.
  • The example shows both manual and metadata-driven labeling, but for regulatory workflows, prefer the metadata table method.

Input Table:

USUBJID LBTEST LBORRES LBDTC
STUDY1-001 GLUCOSE 105.2 2022-03-16
STUDY1-002 GLUCOSE 98.6 2022-03-21
STUDY1-003 GLUCOSE 112.4 2022-03-23

Expected Output After Labelling:

> # Step 3: View value labels
> attr(ae$AESEV, "labels")
    MILD MODERATE   SEVERE 
       1        2        3 
> attr(ae$AESER, "labels")
    Yes      No Unknown 
    "Y"     "N"     "U" 
> 
> # Step 4: Convert labelled vectors to factors for display/analysis
> ae$AESEV_F <- as_factor(ae$AESEV)
> 
> # Step 5: Compare numeric and factor versions
> head(data.frame(
+   numeric = ae$AESEV,
+   factor  = ae$AESEV_F
+ ))
  numeric   factor
1       1     MILD
2       2 MODERATE
3       1     MILD

3. Value Labels (Formats)

Capability SAS R
Create value labels proc format; value sexfmt 1='Male' 2='Female'; df$sex <- labelled(df$sex, c("Male"=1, "Female"=2))
Apply formats format sex sexfmt.; Already applied with labelled()
View formats proc format; run; val_labels(df$sex)
Convert to factors Use formatted value as_factor(df$sex)

SAS Example

/* Create formats for SDTM */
proc format;
  value sexfmt 'M'='Male' 'F'='Female' 'U'='Unknown';
  value ynfmt 'Y'='Yes' 'N'='No' 'U'='Unknown';
  value sevfmt 1='MILD' 2='MODERATE' 3='SEVERE';
run;

/* Apply formats to AE dataset */
data ae;
  input USUBJID $ AETERM $ AESEV AESER $;
  format AESEV sevfmt. AESER ynfmt.;
  datalines;
STUDY1-001 Headache 1 N
STUDY1-002 Nausea 2 N
STUDY1-003 Rash 1 N
;
run;

/* View data with formats */
proc print data=ae;
run;

Explanation (SAS):

  • Creates a format using proc format that maps numeric codes to severity levels
  • Applies the format to the AESEV variable using the format statement
  • The format affects how values are displayed but doesn't change underlying data
  • SAS procedures use formatted values in tables and graphs

R Example

library(haven)
library(tibble)

# Step 1: Create Adverse Events data with coded severity
ae <- tibble(
  USUBJID = c("STUDY1-001", "STUDY1-002", "STUDY1-003"),
  AETERM  = c("Headache", "Nausea", "Rash"),
  AESEV   = c(1, 2, 1),
  AESER   = c("N", "N", "N")
)

# Step 2: Apply value labels using haven::labelled()
ae$AESEV <- labelled(
  ae$AESEV,
  labels = c("MILD" = 1, "MODERATE" = 2, "SEVERE" = 3)
)

ae$AESER <- labelled(
  ae$AESER,
  labels = c("Yes" = "Y", "No" = "N", "Unknown" = "U")
)

# Step 3: View value labels
attr(ae$AESEV, "labels")
attr(ae$AESER, "labels")

# Step 4: Convert labelled vectors to factors for display/analysis
ae$AESEV_F <- as_factor(ae$AESEV)

# Step 5: Compare numeric and factor versions
head(data.frame(
  numeric = ae$AESEV,
  factor  = ae$AESEV_F
))

Explanation (R):

  • Creates value labels for the adverse event severity codes
  • Uses labelled() function to attach value labels to variables
  • Value labels map numeric/character codes to descriptive text
  • The original values remain unchanged but gain metadata
  • as_factor() converts labelled variables to factors for analysis/plotting
  • Follows CDISC SDTM standards for adverse event data

Expected Output After Value Labelling:

> # Step 3: View value labels
> attr(ae$AESEV, "labels")
    MILD MODERATE   SEVERE 
       1        2        3 
> attr(ae$AESER, "labels")
    Yes      No Unknown 
    "Y"     "N"     "U" 
> 
> # Step 4: Convert labelled vectors to factors for display/analysis
> ae$AESEV_F <- as_factor(ae$AESEV)
> 
> # Step 5: Compare numeric and factor versions
> head(data.frame(
+   numeric = ae$AESEV,
+   factor  = ae$AESEV_F
+ ))
  numeric   factor
1       1     MILD
2       2 MODERATE
3       1     MILD

3. Variable Properties: Length and Types

Capability SAS R
Set variable length length USUBJID $200; xportr_length() with metadata
Set variable type attrib x length=8 format=8.; R types + SAS type attributes
Column order Order in data step or retain select() or xportr_order()

SAS Example

/* Set variable lengths for DM domain */
data dm_v2;
  length USUBJID $200 SUBJID $20;
  set dm;
run;

/* Alternative with attrib */
data dm_v2;
  attrib 
    USUBJID length=$200 label="Unique Subject Identifier"
    SUBJID length=$20 label="Subject Identifier for the Study"
    SEX length=$1 label="Sex"
    AGE length=8 format=3. label="Age"
    RFSTDTC length=$10 format=yymmdd10. label="Subject Reference Start Date/Time"
    RFENDTC length=$10 format=yymmdd10. label="Subject Reference End Date/Time";
  set dm;
run;

Explanation (SAS):

  • Uses length statement to set character (with $) and numeric lengths following SDTM guidelines
  • attrib statement combines length, format, and label in one statement
  • Setting correct lengths is crucial for data integrity and CDISC compliance
  • Character lengths prevent truncation; numeric lengths affect precision

R Example

# ONLY run these lines after a fresh R restart
library(haven)
library(dplyr)
library(tibble)
library(xportr)
library(purrr) # For map_int

# Check package versions (paste the output of these)
print(sessionInfo())
print(packageVersion("xportr"))
print(packageVersion("dplyr")) # xportr often relies on dplyr
# Create mock dm dataset
dm <- tibble(
  USUBJID  = c("STUDY1-001", "STUDY1-002"),
  SUBJID   = c("001", "002"),
  SEX      = c("M", "F"),
  AGE      = c(34, 29),
  RFSTDTC  = c("2022-03-01", "2022-03-02"),
  RFENDTC  = c("2022-03-20", "2022-03-22")
)

# Create metadata
metadata <- tribble(
  ~dataset, ~variable, ~label, ~length, ~type,
  "dm", "USUBJID", "Unique Subject Identifier", 200, "text",
  "dm", "SUBJID", "Subject Identifier for the Study", 20, "text",
  "dm", "SEX", "Sex", 1, "text",
  "dm", "AGE", "Age", 8, "numeric",
  "dm", "RFSTDTC", "Subject Reference Start Date/Time", 10, "date",
  "dm", "RFENDTC", "Subject Reference End Date/Time", 10, "date"
)

# Apply metadata using xportr_metadata()
dm_with_metadata <- dm %>%
  xportr_metadata(metadata = metadata, domain = "dm")

# Access the metadata directly
# More robust way to access the attribute
metadata <- attr(dm_with_metadata, "_xportr.df_metadata_")

# Extract the SAS lengths
sas_lengths <- metadata %>%
  select(variable, length) %>%
  deframe()

print(sas_lengths)

Explanation (R):

  • Uses xportr_length() to set SAS-compatible lengths via attributes
  • Lengths are stored as "SASlength" attributes on each column
  • Types can be set with xportr_type() to ensure proper export
  • A single metadata table can store all properties for centralized management
  • Length attributes are used when writing to SAS format for CDISC compliant submission

Expected Output After Setting Lengths:

> print(sas_lengths)
USUBJID  SUBJID     SEX     AGE RFSTDTC RFENDTC 
    200      20       1       8      10      10 

4. Writing Data with Metadata

Capability SAS R
Export to XPT proc cport data=ds file="file.xpt"; write_xpt(df, "file.xpt")
Export to SAS7BDAT libname out "path"; data out.ds; set ds; run; write_sas(df, "file.sas7bdat")
Metadata preservation Automatic Attributes preserved by haven/xportr
Compress data compress=yes option compress=TRUE option
Dataset label on export Part of dataset label= parameter

SAS Example

/* Export SDTM DM domain to XPT file (transport format) */
proc cport data=dm 
  file="c:\sdtm\dm.xpt";
run;

/* Export to SAS dataset */
libname sdtm "c:\sdtm";
data sdtm.dm;
  set dm;
run;

Explanation (SAS):

  • proc cport creates a transport file (.xpt) for cross-platform use
  • Native SAS dataset created by writing to a libname
  • All metadata (labels, formats, lengths) is automatically preserved
  • Transport files are commonly used for regulatory submissions

R Example

library(haven)
library(xportr)

# Method 1: Using haven
write_xpt(dm, "c:/sdtm/dm_haven.xpt", version = 5)
write_sas(dm, "c:/sdtm/dm_haven.sas7bdat")

# Method 2: Using xportr (preserves metadata better)
dm %>%
  xportr_write("c:/sdtm/dm_xportr.xpt")

# Method 3: With dataset label specified at write time
dm %>%
  xportr_write("c:/sdtm/dm_labelled.xpt", 
               label = "Demographics Domain")

Explanation (R):

  • write_xpt() from haven creates SAS transport files
  • write_sas() creates SAS7BDAT native format files
  • xportr_write() offers enhanced metadata preservation
  • The version = 5 parameter ensures XPT V5 compatibility (FDA standard)
  • Dataset labels can be applied or overridden at export time
  • Attributes from R (labels, formats) are preserved in export
  • Critical for regulatory submissions like FDA and PMDA

Input Dataset:

USUBJID SUBJID SEX AGE RFSTDTC RFENDTC
STUDY1-001 001 M 45 2022-03-15 2022-09-30
STUDY1-002 002 F 62 2022-03-20 2022-10-05
STUDY1-003 003 M 57 2022-03-22 2022-09-28

Expected SAS Output (when reading exported XPT):

PROC CONTENTS of dm dataset

Dataset Name: DM
Label: Demographics Domain

Variables:
#  Variable   Type  Length  Format   Label
1  USUBJID    Char  200              Unique Subject Identifier
2  SUBJID     Char  20               Subject Identifier for the Study
3  SEX        Char  1                Sex
4  AGE        Num   8                Age
5  RFSTDTC    Char  10     YYMMDD10. Subject Reference Start Date/Time
6  RFENDTC    Char  10     YYMMDD10. Subject Reference End Date/Time

5. Beyond Basics: Controlling Output XPT Standards

Feature SAS R with haven/xportr
XPT Version V5 default in proc cport version = "5" or "8"
Variable name length 8 chars (V5), 32 chars (V8) Truncated based on version
Label truncation 40 chars (V5), 200 chars (V8) Truncated based on version
Column ordering As in dataset or retain xportr_order() or select()
SAS engine compliance Automatic Validation with xportr_validate()

SAS Example

/* Export with variable order control for SDTM compliance */
data ex_ordered;
  retain STUDYID USUBJID EXSEQ EXTRT EXDOSE EXDOSU EXSTDTC EXENDTC;
  set ex;
run;

proc cport data=ex_ordered 
  file="c:\sdtm\ex_ordered.xpt";
run;

Explanation (SAS):

  • retain statement controls the order of variables in the EX (Exposure) domain
  • Variables are exported in the SDTM-specified order
  • Transport files have specific limitations on name and label lengths
  • SAS handles compliance with XPT standards automatically
  • Critical for CDISC compliance in regulatory submissions

R Example

# Load necessary libraries
library(xportr)
library(tibble)
library(dplyr)

# Create exposure domain data
ex <- tibble(
  USUBJID = c("STUDY1-001", "STUDY1-002", "STUDY1-003"),
  STUDYID = rep("STUDY1", 3),
  EXSEQ = c(1, 1, 1),
  EXTRT = c("Compound X", "Compound X", "Compound X"),
  EXSTDTC = c("2022-03-15", "2022-03-20", "2022-03-22"),
  EXENDTC = c("2022-09-30", "2022-10-05", "2022-09-28"),
  EXDOSE = c(100, 100, 100),
  EXDOSU = c("mg", "mg", "mg")
)

# Define SDTM standard column order
ex_order <- c("STUDYID", "USUBJID", "EXSEQ", "EXTRT", "EXDOSE", "EXDOSU", "EXSTDTC", "EXENDTC")

# 1. Prepare the data frame by reordering columns
ex_export <- ex %>%
  select(all_of(ex_order))

# 2. Validate the prepared data frame.
# This function prints messages to the console but does not return the data frame.
xpt_validate(ex_export)

# 3. Write the original data frame (ex_export) to the XPT file.
xportr_write(ex_export, path = "C:/Users/gnana/OneDrive/Documents/R Programming/ex_v5.xpt")


#Round trip validation output to verify the exported xpt file

# Specify the full path to your exported file
file_path <- "C:/Users/gnana/OneDrive/Documents/R Programming/ex_v5.xpt"

# --- Validation Step ---
# Attempt to read the .xpt file.
# A successful read indicates a valid file structure.
tryCatch({
  re_imported_data <- read_xpt(file_path)
  
  # If successful, print a confirmation and view the data
  print("Validation successful: The XPT file was read without errors.")
  print(head(re_imported_data))
  
}, error = function(e) {
  
  # If an error occurs, the file is not valid
  print(paste("Validation failed. Error reading XPT file:", e$message))
  
})

Output:

[1] "Validation successful: The XPT file was read without errors."
# A tibble: 3 × 8
  STUDYID USUBJID    EXSEQ EXTRT      EXDOSE EXDOSU EXSTDTC   EXENDTC
  <chr>   <chr>      <dbl> <chr>       <dbl> <chr>  <chr>     <chr>  
1 STUDY1  STUDY1-001     1 Compound X    100 mg     2022-03-… 2022-0…
2 STUDY1  STUDY1-002     1 Compound X    100 mg     2022-03-… 2022-1…
3 STUDY1  STUDY1-003     1 Compound X    100 mg     2022-03-… 2022-0…

Explanation (R):

  • The SDTM column order is set using select(all_of(ex_order)) for explicit control.
  • xpt_validate() checks the data frame for XPT compliance, printing issues to the console.
  • xportr_write() exports the data to an XPT file, preserving metadata and structure.
  • After export, a round-trip validation is performed by reading the XPT file back with read_xpt(). If successful, the file structure and data integrity are confirmed.
  • This workflow ensures that exported files meet regulatory requirements (e.g., FDA, PMDA) for variable names, labels, and ordering, and that the output is verifiable and reproducible.
  • Use version 5 for 8-character variable names and 40-character labels; version 8 allows longer names/labels if required.

6. Beyond Basics: Date Handling and Special Types

Capability SAS R
Date storage Days since 1960-01-01 Date class or numeric with attribute
Datetime storage Seconds since 1960-01-01 POSIXct with attribute
Format specification format date date9.; Custom attributes
Special formats Various SAS formats Custom implementation required

SAS Example

/* Create SDTM VS domain with various date formats */
data vs;
  input USUBJID $ VISITNUM VISIT $ VSTESTCD $ VSDTC $10. VSTIME $8. VSSTRESN;
  VSDTM = dhms(input(VSDTC, yymmdd10.), 0, 0, input(VSTIME, time8.));
  format VSDTC yymmdd10. VSTIME time8. VSDTM datetime19.;
  datalines;
STUDY1-001 1 BASELINE SYSBP 2022-03-15 08:30:00 120.5
STUDY1-002 1 BASELINE SYSBP 2022-03-20 09:15:45 118.0
STUDY1-003 1 BASELINE SYSBP 2022-03-22 10:45:30 135.5
;
run;

proc print data=vs;
run;

Explanation (SAS):

  • SAS stores dates as days since January 1, 1960
  • Datetime values are stored as seconds since January 1, 1960
  • Time values are stored as seconds since midnight
  • Combines date and time into datetime using dhms() function
  • Formats control how dates, times, and vital sign results are displayed
  • Various specialized formats available for different needs

R Example

# Load required libraries
library(haven)
library(tibble)
library(hms)      # For handling time-only data

# Create the vital signs dataframe
vs <- tibble(
  USUBJID   = c("STUDY1-001", "STUDY1-002", "STUDY1-003"),
  VISITNUM  = c(1, 1, 1),
  VISIT     = c("BASELINE", "BASELINE", "BASELINE"),
  VSTESTCD  = c("SYSBP", "SYSBP", "SYSBP"),
  VSDTC     = c("2022-03-15", "2022-03-20", "2022-03-22"),
  VSTIME    = c("08:30:00", "09:15:45", "10:45:30"),
  VSSTRESN  = c(120.5, 118.0, 135.5)
)

# Combine date and time to create a POSIXct datetime object
# This is done before the original columns are converted to ensure clean input
vs$VSDTM <- as.POSIXct(paste(vs$VSDTC, vs$VSTIME), tz = "UTC")

# Convert date and time columns to their proper R data classes
vs$VSDTC <- as.Date(vs$VSDTC)        # Convert to Date class
vs$VSTIME <- as_hms(vs$VSTIME)      # Convert to hms class (which is numeric)

# Assign SAS formats for compatibility
attr(vs$VSDTC,  "format.sas") <- "YYMMDD10."
attr(vs$VSTIME, "format.sas") <- "TIME8."
attr(vs$VSDTM,  "format.sas") <- "DATETIME19."

# Write the dataset to an XPT file
# write_xpt correctly handles Date, hms, and POSIXct objects
write_xpt(vs, path = "C:/Users/gnana/OneDrive/Documents/R Programming/vs.xpt")

# Read the XPT file back in to verify the fix
vs_check <- read_xpt("C:/Users/gnana/OneDrive/Documents/R Programming/vs.xpt")

# Display the result to confirm it worked
print(vs_check)

# Check the formats of the columns in the read dataset
# 1. Check the format for the Date column (VSDTC)
attr(vs_check$VSDTC, "format.sas")

# 2. Check the format for the Time column (VSTIME)
attr(vs_check$VSTIME, "format.sas")

# 3. Check the format for the Datetime column (VSDTM)
attr(vs_check$VSDTM, "format.sas")

# 4. Check a column with no format
attr(vs_check$USUBJID, "format.sas")

Explanation (R):

  • R stores dates as days since January 1, 1970 (Date class)
  • POSIXct stores datetimes as seconds since January 1, 1970
  • Creates combined datetime from separate date and time columns
  • haven converts between R and SAS date epochs automatically
  • SAS formats can be stored as attributes for export
  • Format attributes control how dates and times appear in SAS after export

Input Date Data:

USUBJID VISITNUM VISIT VSTESTCD VSDTC VSTIME VSSTRESN
STUDY1-001 1 BASELINE SYSBP 2022-03-15 08:30:00 120.5
STUDY1-002 1 BASELINE SYSBP 2022-03-20 09:15:45 118.0
STUDY1-003 1 BASELINE SYSBP 2022-03-22 10:45:30 135.5

Output in R:

> print(vs_check)
# A tibble: 3 × 8
  USUBJID    VISITNUM VISIT    VSTESTCD VSDTC      VSTIME   VSSTRESN VSDTM              
  <chr>      <chr>    <dbl> <chr>       <date>     <time>      <dbl> <dttm>             
1 STUDY1-001        1 BASELINE SYSBP    2022-03-15 08:30:00     120. 2022-03-15 08:30:00
2 STUDY1-002        1 BASELINE SYSBP    2022-03-20 09:15:45     118  2022-03-20 09:15:45
3 STUDY1-003        1 BASELINE SYSBP    2022-03-22 10:45:30     136. 2022-03-22 10:45:30
> 
> # Check the formats of the columns in the read dataset
> # 1. Check the format for the Date column (VSDTC)
> attr(vs_check$VSDTC, "format.sas")
[1] "YYMMDD10"
> 
> # 2. Check the format for the Time column (VSTIME)
> attr(vs_check$VSTIME, "format.sas")
[1] "TIME8"
> 
> # 3. Check the format for the Datetime column (VSDTM)
> attr(vs_check$VSDTM, "format.sas")
[1] "DATETIME19"
> 
> # 4. Check a column with no format
> attr(vs_check$USUBJID, "format.sas")
NULL

7. Beyond Basics: Creating SAS Catalogs and Format Libraries

SAS format catalogs (.sas7bcat) store format definitions that can be used across multiple datasets. In R, we can simulate this through centralized metadata tables.

Capability SAS R
Format library proc format library=lib; Central metadata table
Catalog export proc cport formats Not directly supported
Format association libname fmtlib (formats); xportr_format() with metadata
Format inheritance Through library reference Through metadata application

SAS Example

/* Create a format catalog for SDTM domains */
libname fmtlib "c:\sdtm\formats";

proc format library=fmtlib;
  value sexfmt 'M'='Male' 'F'='Female' 'U'='Unknown';
  value ynfmt 'Y'='Yes' 'N'='No' 'U'='Unknown';
  value agegrpfmt 1='<30 years' 2='30-60 years' 3='>=60 years';
  value racefmt 1='WHITE' 2='BLACK OR AFRICAN AMERICAN' 3='ASIAN' 4='OTHER';
run;

/* Use the format library in DM domain */
libname sdtm "c:\sdtm\data";

data sdtm.dm;
  set dm;
  /* Create derived age group variable */
  if AGE < 30 then AGEGRP = 1;
  else if AGE >= 30 and AGE < 60 then AGEGRP = 2;
  else if AGE >= 60 then AGEGRP = 3;
  format SEX sexfmt. AGEGRP agegrpfmt.;
run;

Explanation (SAS):

  • Creates a format catalog in the fmtlib library with SDTM-relevant formats
  • Format definitions are stored separately from datasets
  • Includes formats for sex, yes/no flags, age groups, and visit numbers
  • Formats can be shared across multiple SDTM domains
  • Libraries can be referenced to use stored formats
  • Standard approach in clinical research for consistency

R Equivalent Approach

# Load necessary libraries
library(tibble)
library(dplyr)
library(haven)

# 1. Dummy DM (Demographics) Dataset
# ----------------------------------------------------------------
dm <- tibble(
  STUDYID  = "XYZ123",
  DOMAIN   = "DM",
  USUBJID  = paste("XYZ123-0", 10:19, sep = ""),
  SUBJID   = as.character(10:19),
  RFSTDTC  = "2024-01-15",
  RFENDTC  = "2024-06-20",
  SITEID   = sample(c("S101", "S102", "S103"), 10, replace = TRUE),
  AGE      = sample(25:75, 10, replace = TRUE),
  AGEU     = "YEARS",
  SEX      = sample(c("M", "F", "U"), 10, replace = TRUE, prob = c(0.45, 0.45, 0.1)),
  RACE     = sample(c("WHITE", "BLACK OR AFRICAN AMERICAN", "ASIAN", "OTHER"), 10, replace = TRUE),
  ETHNIC   = sample(c("HISPANIC OR LATINO", "NOT HISPANIC OR LATINO"), 10, replace = TRUE),
  ARMCD    = sample(c("PBO", "TRT1"), 10, replace = TRUE),
  ARM      = ifelse(ARMCD == "PBO", "Placebo", "Treatment 100mg"),
  ACTARMCD = ARMCD,
  ACTARM   = ARM,
  COUNTRY  = "USA"
)

# 2. Central Format Library
#    All 'value's are characters, so the tibble can be created.
# ----------------------------------------------------------------
format_library <- tribble(
  ~format_name, ~value, ~label,
  "sexfmt", "M", "Male",
  "sexfmt", "F", "Female", 
  "sexfmt", "U", "Unknown",
  "racefmt", 1, "WHITE",
  "racefmt", 2, "BLACK OR AFRICAN AMERICAN",
  "racefmt", 3, "ASIAN",
  "racefmt", 4, "OTHER",
  "ynfmt", "Y", "Yes",
  "ynfmt", "N", "No",
  "ynfmt", "U", "Unknown",
  "agegrpfmt", 1, "<30 years",
  "agegrpfmt", 2, "30-60 years",
  "agegrpfmt", 3, ">=60 years"
)

# 3. Format Metadata
#    This table links a variable in a dataset to a format name.
#    Note: We will format a new numeric variable 'RACEN' instead of 'RACE'.
# ----------------------------------------------------------------
format_meta <- tribble(
  ~dataset, ~variable, ~format,
  "dm", "SEX",    "sexfmt",
  "dm", "AGEGRP", "agegrpfmt",
  "dm", "RACEN",  "racefmt"  # We will create RACEN to hold the numeric codes
)

# 4. Create Derived Variables for Formatting
#    We need numeric versions of AGEGRP and RACE to match our formats.
# ----------------------------------------------------------------
dm_to_format <- dm %>%
  mutate(
    # Create numeric AGEGRP from AGE
    AGEGRP = case_when(
      AGE < 30             ~ 1,
      AGE >= 30 & AGE < 60 ~ 2,
      AGE >= 60            ~ 3
    ),
    # Create numeric RACEN from the character RACE variable
    RACEN = case_when(
      RACE == "WHITE"                     ~ 1,
      RACE == "BLACK OR AFRICAN AMERICAN" ~ 2,
      RACE == "ASIAN"                     ~ 3,
      RACE == "OTHER"                     ~ 4
    )
  )

# 5. The "Smart" Formatting Function
#    This function correctly handles type conversion before applying labels.
# ----------------------------------------------------------------
apply_sdtm_formats <- function(df, dataset_name, format_meta, format_library) {
  dataset_formats <- format_meta %>%
    filter(dataset == dataset_name)
  
  for (i in 1:nrow(dataset_formats)) {
    var <- dataset_formats$variable[i]
    fmt <- dataset_formats$format[i]
    
    # Special case for date formats
    if (fmt %in% c("YYMMDD10.", "DATETIME19.")) {
      if (var %in% names(df)) {
        attr(df[[var]], "format.sas") <- fmt
      }
      next
    }
    
    # Extract format definition
    format_def <- format_library %>%
      filter(format_name == fmt)
    
    if (nrow(format_def) == 0) next
    
    # Create named vector for labelled
    values <- format_def$value
    names(values) <- format_def$label
    format_vector <- values
    
    # Apply to variable if it exists
    if (var %in% names(df)) {
      df[[var]] <- haven::labelled(df[[var]], format_vector)
      attr(df[[var]], "format.sas") <- paste0(fmt, ".")
    }
  }
  
  return(df)
}

# 6. Apply the Formats
# ----------------------------------------------------------------
dm_formatted <- dm_to_format %>%
  xportr_df_label(ds_metadata, domain = "dm") %>%
  xportr_label(var_metadata, domain = "dm") %>%
  xportr_length(var_metadata, domain = "dm") %>%
  apply_sdtm_formats(var_meta = var_metadata %>% filter(dataset == "dm"), format_lib = format_library)

print(purrr::map(dm_formatted, attr, "labels"))

# 7. Write the file (Unchanged)
xpt_path <- "C:/Users/gnana/OneDrive/Documents/R Programming/dm.xpt"
haven::write_xpt(dm_formatted, path = xpt_path, version = 5)
cat(paste("Success. The compliant XPT file has been written to:", xpt_path, "\n"))

# 8. Verification Script (Will now show the correct output)
cat(paste("\nReading file from:", xpt_path, "\n\n"))
dm_read_back <- haven::read_xpt(xpt_path)

# Check 1: Dataset Label
cat("\n--- 1. Dataset Label Verification ---\n")
print(attr(dm_read_back, "label"))

# Check 2: Variable Labels
cat("\n--- 2. Variable Label Verification ---\n")
print(purrr::map(dm_read_back, attr, "label"))

# Check 3: Value Labels (Formats)
cat("\n--- 3. Value Label (Format) Verification ---\n")
print(purrr::map(dm_read_back, attr, "labels"))

# Check 4: Formats
sapply(vs_check, function(x) attr(x, "format.sas"))

# Use haven::as_factor() to apply the labels from the file
dm_as_factor <- haven::as_factor(dm_read_back)

# Print the resulting data frame
print(dm_as_factor)
str(dm_as_factor)

Output:

--- 1. Dataset Label Verification ---
> print(attr(dm_read_back, "label"))
[1] "Demographics Domain"
> 
> # Check 2: Variable Labels
> cat("\n--- 2. Variable Label Verification ---\n")

--- 2. Variable Label Verification ---
> print(purrr::map(dm_read_back, attr, "label"))
$STUDYID
[1] "Study Identifier"

$USUBJID
[1] "Unique Subject Identifier"

$SUBJID
[1] "Subject Identifier for the Study"

$SITEID
[1] "Study Site Identifier"

$SEX
NULL

$RACE
NULL

$AGE
[1] "Age"

$RFSTDTC
[1] "Subject Reference Start Date"

$RFENDTC
[1] "Subject Reference End Date"

$AGEGRP
NULL

> 
> # Check 3: Value Labels (Formats)
> cat("\n--- 3. Value Label (Format) Verification ---\n")

--- 3. Value Label (Format) Verification ---
> print(purrr::map(dm_read_back, attr, "labels"))
$STUDYID
NULL

$USUBJID
NULL

$SUBJID
NULL

$SITEID
NULL

$SEX
NULL

$RACE
NULL

$AGE
NULL

$RFSTDTC
NULL

$RFENDTC
NULL

$AGEGRP
NULL

> 
> # Check 4: Formats
> sapply(vs_check, function(x) attr(x, "format.sas"))
$USUBJID
NULL

$VISITNUM
NULL

$VISIT
NULL

$VSTESTCD
NULL

$VSDTC
[1] "YYMMDD10"

$VSTIME
[1] "TIME8"

$VSSTRESN
NULL

$VSDTM
[1] "DATETIME19"

> 
> # Use haven::as_factor() to apply the labels from the file
> dm_as_factor <- haven::as_factor(dm_read_back)
> 
> # Print the resulting data frame
> print(dm_as_factor)
# A tibble: 3 × 10
  STUDYID USUBJID    SUBJID SITEID SEX    RACE   AGE RFSTDTC    RFENDTC    AGEGRP
  <chr>   <chr>      <chr>  <chr>  <chr> <dbl> <dbl> <date>     <date>      <dbl>
1 STUDY1  STUDY1-001 001    01     M         1    45 2022-03-15 2022-09-30      2
2 STUDY1  STUDY1-002 002    01     F         2    62 2022-03-15 2022-09-30      3
3 STUDY1  STUDY1-003 003    02     M         3    57 2022-03-15 2022-09-30      2
> str(dm_as_factor)
tibble [3 × 10] (S3: tbl_df/tbl/data.frame)
 $ STUDYID: chr [1:3] "STUDY1" "STUDY1" "STUDY1"
  ..- attr(*, "label")= chr "Study Identifier"
 $ USUBJID: chr [1:3] "STUDY1-001" "STUDY1-002" "STUDY1-003"
  ..- attr(*, "label")= chr "Unique Subject Identifier"
 $ SUBJID : chr [1:3] "001" "002" "003"
  ..- attr(*, "label")= chr "Subject Identifier for the Study"
 $ SITEID : chr [1:3] "01" "01" "02"
  ..- attr(*, "label")= chr "Study Site Identifier"
 $ SEX    : chr [1:3] "M" "F" "M"
  ..- attr(*, "format.sas")= chr "sexfmt"
 $ RACE   : num [1:3] 1 2 3
  ..- attr(*, "format.sas")= chr "racefmt"
 $ AGE    : num [1:3] 45 62 57
  ..- attr(*, "label")= chr "Age"
 $ RFSTDTC: Date[1:3], format: "2022-03-15" "2022-03-15" "2022-03-15"
 $ RFENDTC: Date[1:3], format: "2022-09-30" "2022-09-30" "2022-09-30"
 $ AGEGRP : num [1:3] 2 3 2
  ..- attr(*, "format.sas")= chr "agegrpfm"
 - attr(*, "label")= chr "Demographics Domain"

Note: Check 3: Value Labels (Formats) - Value Labels Disappear

  • haven::write_xpt() does not preserve value labels (attr(x, "labels")) in the .xpt output. It complies with the SAS Transport v5 format spec but omits formats because:
    • SAS Transport files only store formats as text references, not the format definitions.
    • haven doesn't write the necessary format catalogs (.sas7bcat) which SAS uses to interpret those formats during reading.

So even though your variables have labels attributes attached in memory, they’re stripped when writing to the file—and therefore missing when read back.

If preserving formats is critical:

  • Consider storing the format definitions externally (e.g., using JSON, RDS) and re-applying them after reading the .xpt.
  • Alternatively, switch to sas7bdat format if working in an environment that supports it, or use haven::write_sas() which allows the export of both data and formats via catalog.

If your downstream system supports it, use:

haven::write_sas(dm_prepped, path = "dm.sas7bdat")

This format does preserve value labels and is more flexible than .xpt, though not FDA-submission compliant.

Explanation (R):

  • Demonstrates how to automate export of multiple SDTM domains using a control table and functional programming.
  • Each domain is exported by retrieving the corresponding data frame from the environment, applying metadata (labels, formats, lengths), and writing to XPT using xportr_write().
  • The process uses purrr::pmap_chr() to iterate over each row of the control table, ensuring consistent metadata application and export for each dataset.
  • This approach is scalable for large studies, maintains reproducibility, and centralizes export logic for easier maintenance.

9. Best Practices for Metadata and Exports

Area Best Practice
Metadata Storage Centralize in shared tables/databases
Validation Check round-trip data/metadata preservation
Variable Names Follow 8-char rule for V5 XPT (regulatory)
Labels Keep under 40 chars for V5 XPT compatibility
Formats Store centrally and apply consistently
Date Handling Test date/datetime round-trips
Character Encoding Use UTF-8 when possible; test special characters
Export Path Use file.path() for cross-platform compatibility
Documentation Document metadata sources and mapping rules
Automation Use control tables for multi-dataset exports
Version Control Track metadata tables in version control

**Resource download links**

1.4.2.-Applying-Metadata-and-Outputting-SAS-XPT-Files-with-haven.zip