1.4.2. Applying Metadata and Outputting SAS/XPT Files with haven

1.4. Migrating from SAS to R: A Skill Conversion Guide

1.4.2. Applying Metadata and Outputting SAS/XPT Files with haven

This guide compares SAS and R approaches to handling metadata and exporting datasets, covering everything from basic labels to advanced metadata preservation across systems.

1. Metadata Fundamentals: SAS vs R

Feature	SAS Native	R with haven/xportr
Dataset Labels	Automatic support	Via attributes
Variable Labels	Automatic support	Stored as attributes
Value Labels	PROC FORMAT and labeled values	labelled class and attributes
Formats	Built-in formatting system	Custom attributes
Length Control	Explicit variable lengths	SASlength attribute
Persistence	Stored in dataset	Needs explicit preservation

2. Basic Metadata Application

This section demonstrates how to apply and manage metadata in both SAS and R, focusing on dataset labels, variable labels, and value labels. In R, the recommended approach for production and regulatory workflows is to use metadata tables with functions like xportr_label() and xportr_df_label() for consistent, auditable, and scalable metadata management. Manual assignment is possible for quick tasks, but metadata-driven methods ensure reproducibility and compliance across multiple datasets and domains.

1. Dataset Labels

Capability	SAS	R
Add dataset label	`data mydata(label="Subject Data");`	`attr(df, "label") <- "Subject Data"`
View dataset label	`proc contents data=mydata;`	`attr(df, "label")`
Preserve on export	Automatic	Via `haven::write_sas()` or `xportr::xportr_write()`

SAS Example

/* Create a dataset with a label */
data dm(label="Demographics Domain");
  input USUBJID $ SUBJID $ SEX $ AGE RFSTDTC $10. RFENDTC $10.;
  format RFSTDTC RFENDTC yymmdd10.;
  label
    USUBJID = "Unique Subject Identifier"
    SUBJID = "Subject Identifier for the Study"
    SEX = "Sex"
    AGE = "Age"
    RFSTDTC = "Subject Reference Start Date/Time"
    RFENDTC = "Subject Reference End Date/Time";
  datalines;
STUDY1-001 001 M 45 2022-03-15 2022-09-30
STUDY1-002 002 F 62 2022-03-20 2022-10-05
STUDY1-003 003 M 57 2022-03-22 2022-09-28
;
run;

/* View dataset label */
proc contents data=dm;
run;

Explanation (SAS):

Adds a dataset label directly in the data statement using the label= option
The label is stored as part of the dataset's metadata
proc contents displays the dataset label and other metadata

R Example

# Load required libraries
library(tibble)
library(dplyr)
library(xportr)

# Create Demographics dataset
dm <- tibble(
  USUBJID = c("STUDY1-001", "STUDY1-002", "STUDY1-003"),
  SUBJID = c("001", "002", "003"),
  SEX = c("M", "F", "M"),
  AGE = c(45, 62, 57),
  RFSTDTC = c("2022-03-15", "2022-03-20", "2022-03-22"),
  RFENDTC = c("2022-09-30", "2022-10-05", "2022-09-28")
)

# Create metadata table for dataset labels
df_labels <- tribble(
  ~dataset, ~label,
  "dm", "Demographics Domain",
  "vs", "Vital Signs Domain",
  "ae", "Adverse Events Domain"
)

# Apply the dataset label using xportr_df_label
dm <- dm %>%
  xportr_df_label(df_labels, domain = "dm") 

# Check result
attr(dm, "label")

Output:

[1] "Demographics Domain"

Explanation (R):

Two methods shown: direct attribute assignment or using xportr_df_label()
The xportr_df_label() function takes a metadata table with dataset names and labels
Labels are stored as attributes which can be viewed with attr() function
Using a metadata table allows for centralized label management across multiple SDTM domains

2. Variable (Column) Labels

Capability	SAS	R
Add variable label	`label age="Patient Age";`	`var_label(df$age) <- "Patient Age"`
View variable labels	`proc contents data=mydata;`	`map(df, attr, "label")`
Apply multiple labels	`label` statement with multiple vars	`xportr_label()` with metadata table

SAS Example

/* Create dataset with variable labels */
data lb;
  input USUBJID $ LBTEST $ LBORRES LBDTC $10.;
  format LBDTC yymmdd10.;
  label 
    USUBJID = "Unique Subject Identifier"
    LBTEST = "Laboratory Test"
    LBORRES = "Result or Finding in Original Units"
    LBDTC = "Date/Time of Specimen Collection";
  datalines;
STUDY1-001 GLUCOSE 105.2 2022-03-16
STUDY1-002 GLUCOSE 98.6 2022-03-21
STUDY1-003 GLUCOSE 112.4 2022-03-23
;
run;

/* View variable labels */
proc contents data=lb;
run;

Explanation (SAS):

Uses the label statement to assign descriptive labels to variables in the LB (Laboratory) domain
Each variable can have one label up to 256 characters
Labels appear in procedures output, making results more readable
proc contents displays all variable labels

R Example


# Load necessary libraries
library(haven)
library(dplyr)
library(purrr)
library(xportr)
library(tibble)

# Step 1: Create laboratory data
lb <- tibble(
  USUBJID = c("STUDY1-001", "STUDY1-002", "STUDY1-003"),
  LBTEST  = c("GLUCOSE", "GLUCOSE", "GLUCOSE"),
  LBORRES = c(105.2, 98.6, 112.4),
  LBDTC   = c("2022-03-16", "2022-03-21", "2022-03-23")
)

# Step 2: Optional manual labeling (if not using metadata)
lb$LBORRES <- labelled(lb$LBORRES, label = "Result or Finding in Original Units")
lb$LBTEST  <- labelled(lb$LBTEST, label = "Laboratory Test")

# Step 3: Create metadata for xportr labeling
v_labels <- tribble(
  ~dataset, ~variable, ~label,
  "lb", "USUBJID", "Unique Subject Identifier",
  "lb", "LBTEST",  "Laboratory Test",
  "lb", "LBORRES", "Result or Finding in Original Units",
  "lb", "LBDTC",   "Date/Time of Specimen Collection"
)

# Step 4: Apply labels using xportr
lb <- lb %>%
  xportr_label(v_labels)

# Step 5: Check variable labels
map_chr(lb, ~ attr(.x, "label"))

Explanation (R):

You can assign variable labels directly using labelled() or var_label(), but for production and SDTM, use xportr_label() with a metadata table.
The xportr_label() function reads a metadata table (with dataset, variable, and label columns) and applies the correct label to each variable as an attribute.
Labels are stored as attributes and can be checked with map_chr(df, ~ attr(.x, "label")).
This metadata-driven approach ensures consistency and makes it easy to update or audit labels across multiple datasets.
The example shows both manual and metadata-driven labeling, but for regulatory workflows, prefer the metadata table method.

Input Table:

USUBJID	LBTEST	LBORRES	LBDTC
STUDY1-001	GLUCOSE	105.2	2022-03-16
STUDY1-002	GLUCOSE	98.6	2022-03-21
STUDY1-003	GLUCOSE	112.4	2022-03-23

Expected Output After Labelling:

> # Step 3: View value labels
> attr(ae$AESEV, "labels")
    MILD MODERATE   SEVERE 
       1        2        3 
> attr(ae$AESER, "labels")
    Yes      No Unknown 
    "Y"     "N"     "U" 
> 
> # Step 4: Convert labelled vectors to factors for display/analysis
> ae$AESEV_F <- as_factor(ae$AESEV)
> 
> # Step 5: Compare numeric and factor versions
> head(data.frame(
+   numeric = ae$AESEV,
+   factor  = ae$AESEV_F
+ ))
  numeric   factor
1       1     MILD
2       2 MODERATE
3       1     MILD

3. Value Labels (Formats)

Capability	SAS	R
Create value labels	`proc format; value sexfmt 1='Male' 2='Female';`	`df$sex <- labelled(df$sex, c("Male"=1, "Female"=2))`
Apply formats	`format sex sexfmt.;`	Already applied with `labelled()`
View formats	`proc format; run;`	`val_labels(df$sex)`
Convert to factors	Use formatted value	`as_factor(df$sex)`

SAS Example

/* Create formats for SDTM */
proc format;
  value sexfmt 'M'='Male' 'F'='Female' 'U'='Unknown';
  value ynfmt 'Y'='Yes' 'N'='No' 'U'='Unknown';
  value sevfmt 1='MILD' 2='MODERATE' 3='SEVERE';
run;

/* Apply formats to AE dataset */
data ae;
  input USUBJID $ AETERM $ AESEV AESER $;
  format AESEV sevfmt. AESER ynfmt.;
  datalines;
STUDY1-001 Headache 1 N
STUDY1-002 Nausea 2 N
STUDY1-003 Rash 1 N
;
run;

/* View data with formats */
proc print data=ae;
run;

Explanation (SAS):

Creates a format using proc format that maps numeric codes to severity levels
Applies the format to the AESEV variable using the format statement
The format affects how values are displayed but doesn't change underlying data
SAS procedures use formatted values in tables and graphs

R Example

library(haven)
library(tibble)

# Step 1: Create Adverse Events data with coded severity
ae <- tibble(
  USUBJID = c("STUDY1-001", "STUDY1-002", "STUDY1-003"),
  AETERM  = c("Headache", "Nausea", "Rash"),
  AESEV   = c(1, 2, 1),
  AESER   = c("N", "N", "N")
)

# Step 2: Apply value labels using haven::labelled()
ae$AESEV <- labelled(
  ae$AESEV,
  labels = c("MILD" = 1, "MODERATE" = 2, "SEVERE" = 3)
)

ae$AESER <- labelled(
  ae$AESER,
  labels = c("Yes" = "Y", "No" = "N", "Unknown" = "U")
)

# Step 3: View value labels
attr(ae$AESEV, "labels")
attr(ae$AESER, "labels")

# Step 4: Convert labelled vectors to factors for display/analysis
ae$AESEV_F <- as_factor(ae$AESEV)

# Step 5: Compare numeric and factor versions
head(data.frame(
  numeric = ae$AESEV,
  factor  = ae$AESEV_F
))

Explanation (R):

Creates value labels for the adverse event severity codes
Uses labelled() function to attach value labels to variables
Value labels map numeric/character codes to descriptive text
The original values remain unchanged but gain metadata
as_factor() converts labelled variables to factors for analysis/plotting
Follows CDISC SDTM standards for adverse event data

Expected Output After Value Labelling:

> # Step 3: View value labels
> attr(ae$AESEV, "labels")
    MILD MODERATE   SEVERE 
       1        2        3 
> attr(ae$AESER, "labels")
    Yes      No Unknown 
    "Y"     "N"     "U" 
> 
> # Step 4: Convert labelled vectors to factors for display/analysis
> ae$AESEV_F <- as_factor(ae$AESEV)
> 
> # Step 5: Compare numeric and factor versions
> head(data.frame(
+   numeric = ae$AESEV,
+   factor  = ae$AESEV_F
+ ))
  numeric   factor
1       1     MILD
2       2 MODERATE
3       1     MILD

3. Variable Properties: Length and Types

Capability	SAS	R
Set variable length	`length USUBJID $200;`	`xportr_length()` with metadata
Set variable type	`attrib x length=8 format=8.;`	R types + SAS type attributes
Column order	Order in `data` step or `retain`	`select()` or `xportr_order()`

SAS Example

/* Set variable lengths for DM domain */
data dm_v2;
  length USUBJID $200 SUBJID $20;
  set dm;
run;

/* Alternative with attrib */
data dm_v2;
  attrib 
    USUBJID length=$200 label="Unique Subject Identifier"
    SUBJID length=$20 label="Subject Identifier for the Study"
    SEX length=$1 label="Sex"
    AGE length=8 format=3. label="Age"
    RFSTDTC length=$10 format=yymmdd10. label="Subject Reference Start Date/Time"
    RFENDTC length=$10 format=yymmdd10. label="Subject Reference End Date/Time";
  set dm;
run;

Explanation (SAS):

Uses length statement to set character (with $) and numeric lengths following SDTM guidelines
attrib statement combines length, format, and label in one statement
Setting correct lengths is crucial for data integrity and CDISC compliance
Character lengths prevent truncation; numeric lengths affect precision

R Example

# ONLY run these lines after a fresh R restart
library(haven)
library(dplyr)
library(tibble)
library(xportr)
library(purrr) # For map_int

# Check package versions (paste the output of these)
print(sessionInfo())
print(packageVersion("xportr"))
print(packageVersion("dplyr")) # xportr often relies on dplyr
# Create mock dm dataset
dm <- tibble(
  USUBJID  = c("STUDY1-001", "STUDY1-002"),
  SUBJID   = c("001", "002"),
  SEX      = c("M", "F"),
  AGE      = c(34, 29),
  RFSTDTC  = c("2022-03-01", "2022-03-02"),
  RFENDTC  = c("2022-03-20", "2022-03-22")
)

# Create metadata
metadata <- tribble(
  ~dataset, ~variable, ~label, ~length, ~type,
  "dm", "USUBJID", "Unique Subject Identifier", 200, "text",
  "dm", "SUBJID", "Subject Identifier for the Study", 20, "text",
  "dm", "SEX", "Sex", 1, "text",
  "dm", "AGE", "Age", 8, "numeric",
  "dm", "RFSTDTC", "Subject Reference Start Date/Time", 10, "date",
  "dm", "RFENDTC", "Subject Reference End Date/Time", 10, "date"
)

# Apply metadata using xportr_metadata()
dm_with_metadata <- dm %>%
  xportr_metadata(metadata = metadata, domain = "dm")

# Access the metadata directly
# More robust way to access the attribute
metadata <- attr(dm_with_metadata, "_xportr.df_metadata_")

# Extract the SAS lengths
sas_lengths <- metadata %>%
  select(variable, length) %>%
  deframe()

print(sas_lengths)

Explanation (R):

Uses xportr_length() to set SAS-compatible lengths via attributes
Lengths are stored as "SASlength" attributes on each column
Types can be set with xportr_type() to ensure proper export
A single metadata table can store all properties for centralized management
Length attributes are used when writing to SAS format for CDISC compliant submission

Expected Output After Setting Lengths:

> print(sas_lengths)
USUBJID  SUBJID     SEX     AGE RFSTDTC RFENDTC 
    200      20       1       8      10      10

4. Writing Data with Metadata

Capability	SAS	R
Export to XPT	`proc cport data=ds file="file.xpt";`	`write_xpt(df, "file.xpt")`
Export to SAS7BDAT	`libname out "path"; data out.ds; set ds; run;`	`write_sas(df, "file.sas7bdat")`
Metadata preservation	Automatic	Attributes preserved by haven/xportr
Compress data	`compress=yes` option	`compress=TRUE` option
Dataset label on export	Part of dataset	`label=` parameter

SAS Example

/* Export SDTM DM domain to XPT file (transport format) */
proc cport data=dm 
  file="c:\sdtm\dm.xpt";
run;

/* Export to SAS dataset */
libname sdtm "c:\sdtm";
data sdtm.dm;
  set dm;
run;

Explanation (SAS):

proc cport creates a transport file (.xpt) for cross-platform use
Native SAS dataset created by writing to a libname
All metadata (labels, formats, lengths) is automatically preserved
Transport files are commonly used for regulatory submissions

R Example

library(haven)
library(xportr)

# Method 1: Using haven
write_xpt(dm, "c:/sdtm/dm_haven.xpt", version = 5)
write_sas(dm, "c:/sdtm/dm_haven.sas7bdat")

# Method 2: Using xportr (preserves metadata better)
dm %>%
  xportr_write("c:/sdtm/dm_xportr.xpt")

# Method 3: With dataset label specified at write time
dm %>%
  xportr_write("c:/sdtm/dm_labelled.xpt", 
               label = "Demographics Domain")

Explanation (R):

write_xpt() from haven creates SAS transport files
write_sas() creates SAS7BDAT native format files
xportr_write() offers enhanced metadata preservation
The version = 5 parameter ensures XPT V5 compatibility (FDA standard)
Dataset labels can be applied or overridden at export time
Attributes from R (labels, formats) are preserved in export
Critical for regulatory submissions like FDA and PMDA

Input Dataset:

USUBJID	SUBJID	SEX	AGE	RFSTDTC	RFENDTC
STUDY1-001	001	M	45	2022-03-15	2022-09-30
STUDY1-002	002	F	62	2022-03-20	2022-10-05
STUDY1-003	003	M	57	2022-03-22	2022-09-28

Expected SAS Output (when reading exported XPT):

PROC CONTENTS of dm dataset

Dataset Name: DM
Label: Demographics Domain

Variables:
#  Variable   Type  Length  Format   Label
1  USUBJID    Char  200              Unique Subject Identifier
2  SUBJID     Char  20               Subject Identifier for the Study
3  SEX        Char  1                Sex
4  AGE        Num   8                Age
5  RFSTDTC    Char  10     YYMMDD10. Subject Reference Start Date/Time
6  RFENDTC    Char  10     YYMMDD10. Subject Reference End Date/Time

5. Beyond Basics: Controlling Output XPT Standards

Feature	SAS	R with haven/xportr
XPT Version	V5 default in `proc cport`	`version = "5"` or `"8"`
Variable name length	8 chars (V5), 32 chars (V8)	Truncated based on version
Label truncation	40 chars (V5), 200 chars (V8)	Truncated based on version
Column ordering	As in dataset or `retain`	`xportr_order()` or `select()`
SAS engine compliance	Automatic	Validation with `xportr_validate()`

SAS Example

/* Export with variable order control for SDTM compliance */
data ex_ordered;
  retain STUDYID USUBJID EXSEQ EXTRT EXDOSE EXDOSU EXSTDTC EXENDTC;
  set ex;
run;

proc cport data=ex_ordered 
  file="c:\sdtm\ex_ordered.xpt";
run;

Explanation (SAS):

retain statement controls the order of variables in the EX (Exposure) domain
Variables are exported in the SDTM-specified order
Transport files have specific limitations on name and label lengths
SAS handles compliance with XPT standards automatically
Critical for CDISC compliance in regulatory submissions

R Example

# Load necessary libraries
library(xportr)
library(tibble)
library(dplyr)

# Create exposure domain data
ex <- tibble(
  USUBJID = c("STUDY1-001", "STUDY1-002", "STUDY1-003"),
  STUDYID = rep("STUDY1", 3),
  EXSEQ = c(1, 1, 1),
  EXTRT = c("Compound X", "Compound X", "Compound X"),
  EXSTDTC = c("2022-03-15", "2022-03-20", "2022-03-22"),
  EXENDTC = c("2022-09-30", "2022-10-05", "2022-09-28"),
  EXDOSE = c(100, 100, 100),
  EXDOSU = c("mg", "mg", "mg")
)

# Define SDTM standard column order
ex_order <- c("STUDYID", "USUBJID", "EXSEQ", "EXTRT", "EXDOSE", "EXDOSU", "EXSTDTC", "EXENDTC")

# 1. Prepare the data frame by reordering columns
ex_export <- ex %>%
  select(all_of(ex_order))

# 2. Validate the prepared data frame.
# This function prints messages to the console but does not return the data frame.
xpt_validate(ex_export)

# 3. Write the original data frame (ex_export) to the XPT file.
xportr_write(ex_export, path = "C:/Users/gnana/OneDrive/Documents/R Programming/ex_v5.xpt")


#Round trip validation output to verify the exported xpt file

# Specify the full path to your exported file
file_path <- "C:/Users/gnana/OneDrive/Documents/R Programming/ex_v5.xpt"

# --- Validation Step ---
# Attempt to read the .xpt file.
# A successful read indicates a valid file structure.
tryCatch({
  re_imported_data <- read_xpt(file_path)
  
  # If successful, print a confirmation and view the data
  print("Validation successful: The XPT file was read without errors.")
  print(head(re_imported_data))
  
}, error = function(e) {
  
  # If an error occurs, the file is not valid
  print(paste("Validation failed. Error reading XPT file:", e$message))
  
})

Output:

[1] "Validation successful: The XPT file was read without errors."
# A tibble: 3 × 8
  STUDYID USUBJID    EXSEQ EXTRT      EXDOSE EXDOSU EXSTDTC   EXENDTC
  <chr>   <chr>      <dbl> <chr>       <dbl> <chr>  <chr>     <chr>  
1 STUDY1  STUDY1-001     1 Compound X    100 mg     2022-03-… 2022-0…
2 STUDY1  STUDY1-002     1 Compound X    100 mg     2022-03-… 2022-1…
3 STUDY1  STUDY1-003     1 Compound X    100 mg     2022-03-… 2022-0…

Explanation (R):

The SDTM column order is set using select(all_of(ex_order)) for explicit control.
xpt_validate() checks the data frame for XPT compliance, printing issues to the console.
xportr_write() exports the data to an XPT file, preserving metadata and structure.
After export, a round-trip validation is performed by reading the XPT file back with read_xpt(). If successful, the file structure and data integrity are confirmed.
This workflow ensures that exported files meet regulatory requirements (e.g., FDA, PMDA) for variable names, labels, and ordering, and that the output is verifiable and reproducible.
Use version 5 for 8-character variable names and 40-character labels; version 8 allows longer names/labels if required.

6. Beyond Basics: Date Handling and Special Types

Capability	SAS	R
Date storage	Days since 1960-01-01	Date class or numeric with attribute
Datetime storage	Seconds since 1960-01-01	POSIXct with attribute
Format specification	`format date date9.;`	Custom attributes
Special formats	Various SAS formats	Custom implementation required

SAS Example

/* Create SDTM VS domain with various date formats */
data vs;
  input USUBJID $ VISITNUM VISIT $ VSTESTCD $ VSDTC $10. VSTIME $8. VSSTRESN;
  VSDTM = dhms(input(VSDTC, yymmdd10.), 0, 0, input(VSTIME, time8.));
  format VSDTC yymmdd10. VSTIME time8. VSDTM datetime19.;
  datalines;
STUDY1-001 1 BASELINE SYSBP 2022-03-15 08:30:00 120.5
STUDY1-002 1 BASELINE SYSBP 2022-03-20 09:15:45 118.0
STUDY1-003 1 BASELINE SYSBP 2022-03-22 10:45:30 135.5
;
run;

proc print data=vs;
run;

Explanation (SAS):

SAS stores dates as days since January 1, 1960
Datetime values are stored as seconds since January 1, 1960
Time values are stored as seconds since midnight
Combines date and time into datetime using dhms() function
Formats control how dates, times, and vital sign results are displayed
Various specialized formats available for different needs

R Example

# Load required libraries
library(haven)
library(tibble)
library(hms)      # For handling time-only data

# Create the vital signs dataframe
vs <- tibble(
  USUBJID   = c("STUDY1-001", "STUDY1-002", "STUDY1-003"),
  VISITNUM  = c(1, 1, 1),
  VISIT     = c("BASELINE", "BASELINE", "BASELINE"),
  VSTESTCD  = c("SYSBP", "SYSBP", "SYSBP"),
  VSDTC     = c("2022-03-15", "2022-03-20", "2022-03-22"),
  VSTIME    = c("08:30:00", "09:15:45", "10:45:30"),
  VSSTRESN  = c(120.5, 118.0, 135.5)
)

# Combine date and time to create a POSIXct datetime object
# This is done before the original columns are converted to ensure clean input
vs$VSDTM <- as.POSIXct(paste(vs$VSDTC, vs$VSTIME), tz = "UTC")

# Convert date and time columns to their proper R data classes
vs$VSDTC <- as.Date(vs$VSDTC)        # Convert to Date class
vs$VSTIME <- as_hms(vs$VSTIME)      # Convert to hms class (which is numeric)

# Assign SAS formats for compatibility
attr(vs$VSDTC,  "format.sas") <- "YYMMDD10."
attr(vs$VSTIME, "format.sas") <- "TIME8."
attr(vs$VSDTM,  "format.sas") <- "DATETIME19."

# Write the dataset to an XPT file
# write_xpt correctly handles Date, hms, and POSIXct objects
write_xpt(vs, path = "C:/Users/gnana/OneDrive/Documents/R Programming/vs.xpt")

# Read the XPT file back in to verify the fix
vs_check <- read_xpt("C:/Users/gnana/OneDrive/Documents/R Programming/vs.xpt")

# Display the result to confirm it worked
print(vs_check)

# Check the formats of the columns in the read dataset
# 1. Check the format for the Date column (VSDTC)
attr(vs_check$VSDTC, "format.sas")

# 2. Check the format for the Time column (VSTIME)
attr(vs_check$VSTIME, "format.sas")

# 3. Check the format for the Datetime column (VSDTM)
attr(vs_check$VSDTM, "format.sas")

# 4. Check a column with no format
attr(vs_check$USUBJID, "format.sas")

Explanation (R):

R stores dates as days since January 1, 1970 (Date class)
POSIXct stores datetimes as seconds since January 1, 1970
Creates combined datetime from separate date and time columns
haven converts between R and SAS date epochs automatically
SAS formats can be stored as attributes for export
Format attributes control how dates and times appear in SAS after export

Input Date Data:

USUBJID	VISITNUM	VISIT	VSTESTCD	VSDTC	VSTIME	VSSTRESN
STUDY1-001	1	BASELINE	SYSBP	2022-03-15	08:30:00	120.5
STUDY1-002	1	BASELINE	SYSBP	2022-03-20	09:15:45	118.0
STUDY1-003	1	BASELINE	SYSBP	2022-03-22	10:45:30	135.5

Output in R:

> print(vs_check)
# A tibble: 3 × 8
  USUBJID    VISITNUM VISIT    VSTESTCD VSDTC      VSTIME   VSSTRESN VSDTM              
  <chr>      <chr>    <dbl> <chr>       <date>     <time>      <dbl> <dttm>             
1 STUDY1-001        1 BASELINE SYSBP    2022-03-15 08:30:00     120. 2022-03-15 08:30:00
2 STUDY1-002        1 BASELINE SYSBP    2022-03-20 09:15:45     118  2022-03-20 09:15:45
3 STUDY1-003        1 BASELINE SYSBP    2022-03-22 10:45:30     136. 2022-03-22 10:45:30
> 
> # Check the formats of the columns in the read dataset
> # 1. Check the format for the Date column (VSDTC)
> attr(vs_check$VSDTC, "format.sas")
[1] "YYMMDD10"
> 
> # 2. Check the format for the Time column (VSTIME)
> attr(vs_check$VSTIME, "format.sas")
[1] "TIME8"
> 
> # 3. Check the format for the Datetime column (VSDTM)
> attr(vs_check$VSDTM, "format.sas")
[1] "DATETIME19"
> 
> # 4. Check a column with no format
> attr(vs_check$USUBJID, "format.sas")
NULL

7. Beyond Basics: Creating SAS Catalogs and Format Libraries

SAS format catalogs (.sas7bcat) store format definitions that can be used across multiple datasets. In R, we can simulate this through centralized metadata tables.

Capability	SAS	R
Format library	`proc format library=lib;`	Central metadata table
Catalog export	`proc cport` formats	Not directly supported
Format association	`libname fmtlib (formats);`	`xportr_format()` with metadata
Format inheritance	Through library reference	Through metadata application

SAS Example

/* Create a format catalog for SDTM domains */
libname fmtlib "c:\sdtm\formats";

proc format library=fmtlib;
  value sexfmt 'M'='Male' 'F'='Female' 'U'='Unknown';
  value ynfmt 'Y'='Yes' 'N'='No' 'U'='Unknown';
  value agegrpfmt 1='<30 years' 2='30-60 years' 3='>=60 years';
  value racefmt 1='WHITE' 2='BLACK OR AFRICAN AMERICAN' 3='ASIAN' 4='OTHER';
run;

/* Use the format library in DM domain */
libname sdtm "c:\sdtm\data";

data sdtm.dm;
  set dm;
  /* Create derived age group variable */
  if AGE < 30 then AGEGRP = 1;
  else if AGE >= 30 and AGE < 60 then AGEGRP = 2;
  else if AGE >= 60 then AGEGRP = 3;
  format SEX sexfmt. AGEGRP agegrpfmt.;
run;

Explanation (SAS):

Creates a format catalog in the fmtlib library with SDTM-relevant formats
Format definitions are stored separately from datasets
Includes formats for sex, yes/no flags, age groups, and visit numbers
Formats can be shared across multiple SDTM domains
Libraries can be referenced to use stored formats
Standard approach in clinical research for consistency

R Equivalent Approach

# Load necessary libraries
library(tibble)
library(dplyr)
library(haven)

# 1. Dummy DM (Demographics) Dataset
# ----------------------------------------------------------------
dm <- tibble(
  STUDYID  = "XYZ123",
  DOMAIN   = "DM",
  USUBJID  = paste("XYZ123-0", 10:19, sep = ""),
  SUBJID   = as.character(10:19),
  RFSTDTC  = "2024-01-15",
  RFENDTC  = "2024-06-20",
  SITEID   = sample(c("S101", "S102", "S103"), 10, replace = TRUE),
  AGE      = sample(25:75, 10, replace = TRUE),
  AGEU     = "YEARS",
  SEX      = sample(c("M", "F", "U"), 10, replace = TRUE, prob = c(0.45, 0.45, 0.1)),
  RACE     = sample(c("WHITE", "BLACK OR AFRICAN AMERICAN", "ASIAN", "OTHER"), 10, replace = TRUE),
  ETHNIC   = sample(c("HISPANIC OR LATINO", "NOT HISPANIC OR LATINO"), 10, replace = TRUE),
  ARMCD    = sample(c("PBO", "TRT1"), 10, replace = TRUE),
  ARM      = ifelse(ARMCD == "PBO", "Placebo", "Treatment 100mg"),
  ACTARMCD = ARMCD,
  ACTARM   = ARM,
  COUNTRY  = "USA"
)

# 2. Central Format Library
#    All 'value's are characters, so the tibble can be created.
# ----------------------------------------------------------------
format_library <- tribble(
  ~format_name, ~value, ~label,
  "sexfmt", "M", "Male",
  "sexfmt", "F", "Female", 
  "sexfmt", "U", "Unknown",
  "racefmt", 1, "WHITE",
  "racefmt", 2, "BLACK OR AFRICAN AMERICAN",
  "racefmt", 3, "ASIAN",
  "racefmt", 4, "OTHER",
  "ynfmt", "Y", "Yes",
  "ynfmt", "N", "No",
  "ynfmt", "U", "Unknown",
  "agegrpfmt", 1, "<30 years",
  "agegrpfmt", 2, "30-60 years",
  "agegrpfmt", 3, ">=60 years"
)

# 3. Format Metadata
#    This table links a variable in a dataset to a format name.
#    Note: We will format a new numeric variable 'RACEN' instead of 'RACE'.
# ----------------------------------------------------------------
format_meta <- tribble(
  ~dataset, ~variable, ~format,
  "dm", "SEX",    "sexfmt",
  "dm", "AGEGRP", "agegrpfmt",
  "dm", "RACEN",  "racefmt"  # We will create RACEN to hold the numeric codes
)

# 4. Create Derived Variables for Formatting
#    We need numeric versions of AGEGRP and RACE to match our formats.
# ----------------------------------------------------------------
dm_to_format <- dm %>%
  mutate(
    # Create numeric AGEGRP from AGE
    AGEGRP = case_when(
      AGE < 30             ~ 1,
      AGE >= 30 & AGE < 60 ~ 2,
      AGE >= 60            ~ 3
    ),
    # Create numeric RACEN from the character RACE variable
    RACEN = case_when(
      RACE == "WHITE"                     ~ 1,
      RACE == "BLACK OR AFRICAN AMERICAN" ~ 2,
      RACE == "ASIAN"                     ~ 3,
      RACE == "OTHER"                     ~ 4
    )
  )

# 5. The "Smart" Formatting Function
#    This function correctly handles type conversion before applying labels.
# ----------------------------------------------------------------
apply_sdtm_formats <- function(df, dataset_name, format_meta, format_library) {
  dataset_formats <- format_meta %>%
    filter(dataset == dataset_name)
  
  for (i in 1:nrow(dataset_formats)) {
    var <- dataset_formats$variable[i]
    fmt <- dataset_formats$format[i]
    
    # Special case for date formats
    if (fmt %in% c("YYMMDD10.", "DATETIME19.")) {
      if (var %in% names(df)) {
        attr(df[[var]], "format.sas") <- fmt
      }
      next
    }
    
    # Extract format definition
    format_def <- format_library %>%
      filter(format_name == fmt)
    
    if (nrow(format_def) == 0) next
    
    # Create named vector for labelled
    values <- format_def$value
    names(values) <- format_def$label
    format_vector <- values
    
    # Apply to variable if it exists
    if (var %in% names(df)) {
      df[[var]] <- haven::labelled(df[[var]], format_vector)
      attr(df[[var]], "format.sas") <- paste0(fmt, ".")
    }
  }
  
  return(df)
}

# 6. Apply the Formats
# ----------------------------------------------------------------
dm_formatted <- dm_to_format %>%
  xportr_df_label(ds_metadata, domain = "dm") %>%
  xportr_label(var_metadata, domain = "dm") %>%
  xportr_length(var_metadata, domain = "dm") %>%
  apply_sdtm_formats(var_meta = var_metadata %>% filter(dataset == "dm"), format_lib = format_library)

print(purrr::map(dm_formatted, attr, "labels"))

# 7. Write the file (Unchanged)
xpt_path <- "C:/Users/gnana/OneDrive/Documents/R Programming/dm.xpt"
haven::write_xpt(dm_formatted, path = xpt_path, version = 5)
cat(paste("Success. The compliant XPT file has been written to:", xpt_path, "\n"))

# 8. Verification Script (Will now show the correct output)
cat(paste("\nReading file from:", xpt_path, "\n\n"))
dm_read_back <- haven::read_xpt(xpt_path)

# Check 1: Dataset Label
cat("\n--- 1. Dataset Label Verification ---\n")
print(attr(dm_read_back, "label"))

# Check 2: Variable Labels
cat("\n--- 2. Variable Label Verification ---\n")
print(purrr::map(dm_read_back, attr, "label"))

# Check 3: Value Labels (Formats)
cat("\n--- 3. Value Label (Format) Verification ---\n")
print(purrr::map(dm_read_back, attr, "labels"))

# Check 4: Formats
sapply(vs_check, function(x) attr(x, "format.sas"))

# Use haven::as_factor() to apply the labels from the file
dm_as_factor <- haven::as_factor(dm_read_back)

# Print the resulting data frame
print(dm_as_factor)
str(dm_as_factor)

Output:

--- 1. Dataset Label Verification ---
> print(attr(dm_read_back, "label"))
[1] "Demographics Domain"
> 
> # Check 2: Variable Labels
> cat("\n--- 2. Variable Label Verification ---\n")

--- 2. Variable Label Verification ---
> print(purrr::map(dm_read_back, attr, "label"))
$STUDYID
[1] "Study Identifier"

$USUBJID
[1] "Unique Subject Identifier"

$SUBJID
[1] "Subject Identifier for the Study"

$SITEID
[1] "Study Site Identifier"

$SEX
NULL

$RACE
NULL

$AGE
[1] "Age"

$RFSTDTC
[1] "Subject Reference Start Date"

$RFENDTC
[1] "Subject Reference End Date"

$AGEGRP
NULL

> 
> # Check 3: Value Labels (Formats)
> cat("\n--- 3. Value Label (Format) Verification ---\n")

--- 3. Value Label (Format) Verification ---
> print(purrr::map(dm_read_back, attr, "labels"))
$STUDYID
NULL

$USUBJID
NULL

$SUBJID
NULL

$SITEID
NULL

$SEX
NULL

$RACE
NULL

$AGE
NULL

$RFSTDTC
NULL

$RFENDTC
NULL

$AGEGRP
NULL

> 
> # Check 4: Formats
> sapply(vs_check, function(x) attr(x, "format.sas"))
$USUBJID
NULL

$VISITNUM
NULL

$VISIT
NULL

$VSTESTCD
NULL

$VSDTC
[1] "YYMMDD10"

$VSTIME
[1] "TIME8"

$VSSTRESN
NULL

$VSDTM
[1] "DATETIME19"

> 
> # Use haven::as_factor() to apply the labels from the file
> dm_as_factor <- haven::as_factor(dm_read_back)
> 
> # Print the resulting data frame
> print(dm_as_factor)
# A tibble: 3 × 10
  STUDYID USUBJID    SUBJID SITEID SEX    RACE   AGE RFSTDTC    RFENDTC    AGEGRP
  <chr>   <chr>      <chr>  <chr>  <chr> <dbl> <dbl> <date>     <date>      <dbl>
1 STUDY1  STUDY1-001 001    01     M         1    45 2022-03-15 2022-09-30      2
2 STUDY1  STUDY1-002 002    01     F         2    62 2022-03-15 2022-09-30      3
3 STUDY1  STUDY1-003 003    02     M         3    57 2022-03-15 2022-09-30      2
> str(dm_as_factor)
tibble [3 × 10] (S3: tbl_df/tbl/data.frame)
 $ STUDYID: chr [1:3] "STUDY1" "STUDY1" "STUDY1"
  ..- attr(*, "label")= chr "Study Identifier"
 $ USUBJID: chr [1:3] "STUDY1-001" "STUDY1-002" "STUDY1-003"
  ..- attr(*, "label")= chr "Unique Subject Identifier"
 $ SUBJID : chr [1:3] "001" "002" "003"
  ..- attr(*, "label")= chr "Subject Identifier for the Study"
 $ SITEID : chr [1:3] "01" "01" "02"
  ..- attr(*, "label")= chr "Study Site Identifier"
 $ SEX    : chr [1:3] "M" "F" "M"
  ..- attr(*, "format.sas")= chr "sexfmt"
 $ RACE   : num [1:3] 1 2 3
  ..- attr(*, "format.sas")= chr "racefmt"
 $ AGE    : num [1:3] 45 62 57
  ..- attr(*, "label")= chr "Age"
 $ RFSTDTC: Date[1:3], format: "2022-03-15" "2022-03-15" "2022-03-15"
 $ RFENDTC: Date[1:3], format: "2022-09-30" "2022-09-30" "2022-09-30"
 $ AGEGRP : num [1:3] 2 3 2
  ..- attr(*, "format.sas")= chr "agegrpfm"
 - attr(*, "label")= chr "Demographics Domain"

Note: Check 3: Value Labels (Formats) - Value Labels Disappear

haven::write_xpt() does not preserve value labels (attr(x, "labels")) in the .xpt output. It complies with the SAS Transport v5 format spec but omits formats because:
- SAS Transport files only store formats as text references, not the format definitions.
- haven doesn't write the necessary format catalogs (.sas7bcat) which SAS uses to interpret those formats during reading.

So even though your variables have labels attributes attached in memory, they’re stripped when writing to the file—and therefore missing when read back.

If preserving formats is critical:

Consider storing the format definitions externally (e.g., using JSON, RDS) and re-applying them after reading the .xpt.
Alternatively, switch to sas7bdat format if working in an environment that supports it, or use haven::write_sas() which allows the export of both data and formats via catalog.

If your downstream system supports it, use:

haven::write_sas(dm_prepped, path = "dm.sas7bdat")

This format does preserve value labels and is more flexible than .xpt, though not FDA-submission compliant.

Explanation (R):

Demonstrates how to automate export of multiple SDTM domains using a control table and functional programming.
Each domain is exported by retrieving the corresponding data frame from the environment, applying metadata (labels, formats, lengths), and writing to XPT using xportr_write().
The process uses purrr::pmap_chr() to iterate over each row of the control table, ensuring consistent metadata application and export for each dataset.
This approach is scalable for large studies, maintains reproducibility, and centralizes export logic for easier maintenance.

9. Best Practices for Metadata and Exports

Area	Best Practice
Metadata Storage	Centralize in shared tables/databases
Validation	Check round-trip data/metadata preservation
Variable Names	Follow 8-char rule for V5 XPT (regulatory)
Labels	Keep under 40 chars for V5 XPT compatibility
Formats	Store centrally and apply consistently
Date Handling	Test date/datetime round-trips
Character Encoding	Use UTF-8 when possible; test special characters
Export Path	Use `file.path()` for cross-platform compatibility
Documentation	Document metadata sources and mapping rules
Automation	Use control tables for multi-dataset exports
Version Control	Track metadata tables in version control

Resource download links

1.4.2.-Applying-Metadata-and-Outputting-SAS-XPT-Files-with-haven.zip

⁂

1.4. Migrating from SAS to R: A Skill Conversion Guide