1.4. Migrating from SAS to R: A Skill Conversion Guide
1.4.2. Applying Metadata and Outputting SAS/XPT Files with haven
This guide compares SAS and R approaches to handling metadata and exporting datasets, covering everything from basic labels to advanced metadata preservation across systems.
1. Metadata Fundamentals: SAS vs R
| Feature | SAS Native | R with haven/xportr |
|---|---|---|
| Dataset Labels | Automatic support | Via attributes |
| Variable Labels | Automatic support | Stored as attributes |
| Value Labels | PROC FORMAT and labeled values | labelled class and attributes |
| Formats | Built-in formatting system | Custom attributes |
| Length Control | Explicit variable lengths | SASlength attribute |
| Persistence | Stored in dataset | Needs explicit preservation |
2. Basic Metadata Application
This section demonstrates how to apply and manage metadata in both SAS and R, focusing on dataset labels, variable labels, and value labels. In R, the recommended approach for production and regulatory workflows is to use metadata tables with functions like xportr_label() and xportr_df_label() for consistent, auditable, and scalable metadata management. Manual assignment is possible for quick tasks, but metadata-driven methods ensure reproducibility and compliance across multiple datasets and domains.
1. Dataset Labels
| Capability | SAS | R |
|---|---|---|
| Add dataset label | data mydata(label="Subject Data"); |
attr(df, "label") <- "Subject Data" |
| View dataset label | proc contents data=mydata; |
attr(df, "label") |
| Preserve on export | Automatic | Via haven::write_sas() or xportr::xportr_write() |
SAS Example
/* Create a dataset with a label */
data dm(label="Demographics Domain");
input USUBJID $ SUBJID $ SEX $ AGE RFSTDTC $10. RFENDTC $10.;
format RFSTDTC RFENDTC yymmdd10.;
label
USUBJID = "Unique Subject Identifier"
SUBJID = "Subject Identifier for the Study"
SEX = "Sex"
AGE = "Age"
RFSTDTC = "Subject Reference Start Date/Time"
RFENDTC = "Subject Reference End Date/Time";
datalines;
STUDY1-001 001 M 45 2022-03-15 2022-09-30
STUDY1-002 002 F 62 2022-03-20 2022-10-05
STUDY1-003 003 M 57 2022-03-22 2022-09-28
;
run;
/* View dataset label */
proc contents data=dm;
run;
Explanation (SAS):
- Adds a dataset label directly in the
datastatement using thelabel=option - The label is stored as part of the dataset's metadata
proc contentsdisplays the dataset label and other metadata
R Example
# Load required libraries
library(tibble)
library(dplyr)
library(xportr)
# Create Demographics dataset
dm <- tibble(
USUBJID = c("STUDY1-001", "STUDY1-002", "STUDY1-003"),
SUBJID = c("001", "002", "003"),
SEX = c("M", "F", "M"),
AGE = c(45, 62, 57),
RFSTDTC = c("2022-03-15", "2022-03-20", "2022-03-22"),
RFENDTC = c("2022-09-30", "2022-10-05", "2022-09-28")
)
# Create metadata table for dataset labels
df_labels <- tribble(
~dataset, ~label,
"dm", "Demographics Domain",
"vs", "Vital Signs Domain",
"ae", "Adverse Events Domain"
)
# Apply the dataset label using xportr_df_label
dm <- dm %>%
xportr_df_label(df_labels, domain = "dm")
# Check result
attr(dm, "label")
Output:
[1] "Demographics Domain"
Explanation (R):
- Two methods shown: direct attribute assignment or using
xportr_df_label() - The
xportr_df_label()function takes a metadata table with dataset names and labels - Labels are stored as attributes which can be viewed with
attr()function - Using a metadata table allows for centralized label management across multiple SDTM domains
2. Variable (Column) Labels
| Capability | SAS | R |
|---|---|---|
| Add variable label | label age="Patient Age"; |
var_label(df$age) <- "Patient Age" |
| View variable labels | proc contents data=mydata; |
map(df, attr, "label") |
| Apply multiple labels | label statement with multiple vars |
xportr_label() with metadata table |
SAS Example
/* Create dataset with variable labels */
data lb;
input USUBJID $ LBTEST $ LBORRES LBDTC $10.;
format LBDTC yymmdd10.;
label
USUBJID = "Unique Subject Identifier"
LBTEST = "Laboratory Test"
LBORRES = "Result or Finding in Original Units"
LBDTC = "Date/Time of Specimen Collection";
datalines;
STUDY1-001 GLUCOSE 105.2 2022-03-16
STUDY1-002 GLUCOSE 98.6 2022-03-21
STUDY1-003 GLUCOSE 112.4 2022-03-23
;
run;
/* View variable labels */
proc contents data=lb;
run;
Explanation (SAS):
- Uses the
labelstatement to assign descriptive labels to variables in the LB (Laboratory) domain - Each variable can have one label up to 256 characters
- Labels appear in procedures output, making results more readable
proc contentsdisplays all variable labels
R Example
# Load necessary libraries
library(haven)
library(dplyr)
library(purrr)
library(xportr)
library(tibble)
# Step 1: Create laboratory data
lb <- tibble(
USUBJID = c("STUDY1-001", "STUDY1-002", "STUDY1-003"),
LBTEST = c("GLUCOSE", "GLUCOSE", "GLUCOSE"),
LBORRES = c(105.2, 98.6, 112.4),
LBDTC = c("2022-03-16", "2022-03-21", "2022-03-23")
)
# Step 2: Optional manual labeling (if not using metadata)
lb$LBORRES <- labelled(lb$LBORRES, label = "Result or Finding in Original Units")
lb$LBTEST <- labelled(lb$LBTEST, label = "Laboratory Test")
# Step 3: Create metadata for xportr labeling
v_labels <- tribble(
~dataset, ~variable, ~label,
"lb", "USUBJID", "Unique Subject Identifier",
"lb", "LBTEST", "Laboratory Test",
"lb", "LBORRES", "Result or Finding in Original Units",
"lb", "LBDTC", "Date/Time of Specimen Collection"
)
# Step 4: Apply labels using xportr
lb <- lb %>%
xportr_label(v_labels)
# Step 5: Check variable labels
map_chr(lb, ~ attr(.x, "label"))
Explanation (R):
- You can assign variable labels directly using
labelled()orvar_label(), but for production and SDTM, usexportr_label()with a metadata table. - The
xportr_label()function reads a metadata table (with dataset, variable, and label columns) and applies the correct label to each variable as an attribute. - Labels are stored as attributes and can be checked with
map_chr(df, ~ attr(.x, "label")). - This metadata-driven approach ensures consistency and makes it easy to update or audit labels across multiple datasets.
- The example shows both manual and metadata-driven labeling, but for regulatory workflows, prefer the metadata table method.
Input Table:
| USUBJID | LBTEST | LBORRES | LBDTC |
|---|---|---|---|
| STUDY1-001 | GLUCOSE | 105.2 | 2022-03-16 |
| STUDY1-002 | GLUCOSE | 98.6 | 2022-03-21 |
| STUDY1-003 | GLUCOSE | 112.4 | 2022-03-23 |
Expected Output After Labelling:
> # Step 3: View value labels
> attr(ae$AESEV, "labels")
MILD MODERATE SEVERE
1 2 3
> attr(ae$AESER, "labels")
Yes No Unknown
"Y" "N" "U"
>
> # Step 4: Convert labelled vectors to factors for display/analysis
> ae$AESEV_F <- as_factor(ae$AESEV)
>
> # Step 5: Compare numeric and factor versions
> head(data.frame(
+ numeric = ae$AESEV,
+ factor = ae$AESEV_F
+ ))
numeric factor
1 1 MILD
2 2 MODERATE
3 1 MILD
3. Value Labels (Formats)
| Capability | SAS | R |
|---|---|---|
| Create value labels | proc format; value sexfmt 1='Male' 2='Female'; |
df$sex <- labelled(df$sex, c("Male"=1, "Female"=2)) |
| Apply formats | format sex sexfmt.; |
Already applied with labelled() |
| View formats | proc format; run; |
val_labels(df$sex) |
| Convert to factors | Use formatted value | as_factor(df$sex) |
SAS Example
/* Create formats for SDTM */
proc format;
value sexfmt 'M'='Male' 'F'='Female' 'U'='Unknown';
value ynfmt 'Y'='Yes' 'N'='No' 'U'='Unknown';
value sevfmt 1='MILD' 2='MODERATE' 3='SEVERE';
run;
/* Apply formats to AE dataset */
data ae;
input USUBJID $ AETERM $ AESEV AESER $;
format AESEV sevfmt. AESER ynfmt.;
datalines;
STUDY1-001 Headache 1 N
STUDY1-002 Nausea 2 N
STUDY1-003 Rash 1 N
;
run;
/* View data with formats */
proc print data=ae;
run;
Explanation (SAS):
- Creates a format using
proc formatthat maps numeric codes to severity levels - Applies the format to the AESEV variable using the
formatstatement - The format affects how values are displayed but doesn't change underlying data
- SAS procedures use formatted values in tables and graphs
R Example
library(haven)
library(tibble)
# Step 1: Create Adverse Events data with coded severity
ae <- tibble(
USUBJID = c("STUDY1-001", "STUDY1-002", "STUDY1-003"),
AETERM = c("Headache", "Nausea", "Rash"),
AESEV = c(1, 2, 1),
AESER = c("N", "N", "N")
)
# Step 2: Apply value labels using haven::labelled()
ae$AESEV <- labelled(
ae$AESEV,
labels = c("MILD" = 1, "MODERATE" = 2, "SEVERE" = 3)
)
ae$AESER <- labelled(
ae$AESER,
labels = c("Yes" = "Y", "No" = "N", "Unknown" = "U")
)
# Step 3: View value labels
attr(ae$AESEV, "labels")
attr(ae$AESER, "labels")
# Step 4: Convert labelled vectors to factors for display/analysis
ae$AESEV_F <- as_factor(ae$AESEV)
# Step 5: Compare numeric and factor versions
head(data.frame(
numeric = ae$AESEV,
factor = ae$AESEV_F
))
Explanation (R):
- Creates value labels for the adverse event severity codes
- Uses
labelled()function to attach value labels to variables - Value labels map numeric/character codes to descriptive text
- The original values remain unchanged but gain metadata
as_factor()converts labelled variables to factors for analysis/plotting- Follows CDISC SDTM standards for adverse event data
Expected Output After Value Labelling:
> # Step 3: View value labels
> attr(ae$AESEV, "labels")
MILD MODERATE SEVERE
1 2 3
> attr(ae$AESER, "labels")
Yes No Unknown
"Y" "N" "U"
>
> # Step 4: Convert labelled vectors to factors for display/analysis
> ae$AESEV_F <- as_factor(ae$AESEV)
>
> # Step 5: Compare numeric and factor versions
> head(data.frame(
+ numeric = ae$AESEV,
+ factor = ae$AESEV_F
+ ))
numeric factor
1 1 MILD
2 2 MODERATE
3 1 MILD
3. Variable Properties: Length and Types
| Capability | SAS | R |
|---|---|---|
| Set variable length | length USUBJID $200; |
xportr_length() with metadata |
| Set variable type | attrib x length=8 format=8.; |
R types + SAS type attributes |
| Column order | Order in data step or retain |
select() or xportr_order() |
SAS Example
/* Set variable lengths for DM domain */
data dm_v2;
length USUBJID $200 SUBJID $20;
set dm;
run;
/* Alternative with attrib */
data dm_v2;
attrib
USUBJID length=$200 label="Unique Subject Identifier"
SUBJID length=$20 label="Subject Identifier for the Study"
SEX length=$1 label="Sex"
AGE length=8 format=3. label="Age"
RFSTDTC length=$10 format=yymmdd10. label="Subject Reference Start Date/Time"
RFENDTC length=$10 format=yymmdd10. label="Subject Reference End Date/Time";
set dm;
run;
Explanation (SAS):
- Uses
lengthstatement to set character (with $) and numeric lengths following SDTM guidelines attribstatement combines length, format, and label in one statement- Setting correct lengths is crucial for data integrity and CDISC compliance
- Character lengths prevent truncation; numeric lengths affect precision
R Example
# ONLY run these lines after a fresh R restart
library(haven)
library(dplyr)
library(tibble)
library(xportr)
library(purrr) # For map_int
# Check package versions (paste the output of these)
print(sessionInfo())
print(packageVersion("xportr"))
print(packageVersion("dplyr")) # xportr often relies on dplyr
# Create mock dm dataset
dm <- tibble(
USUBJID = c("STUDY1-001", "STUDY1-002"),
SUBJID = c("001", "002"),
SEX = c("M", "F"),
AGE = c(34, 29),
RFSTDTC = c("2022-03-01", "2022-03-02"),
RFENDTC = c("2022-03-20", "2022-03-22")
)
# Create metadata
metadata <- tribble(
~dataset, ~variable, ~label, ~length, ~type,
"dm", "USUBJID", "Unique Subject Identifier", 200, "text",
"dm", "SUBJID", "Subject Identifier for the Study", 20, "text",
"dm", "SEX", "Sex", 1, "text",
"dm", "AGE", "Age", 8, "numeric",
"dm", "RFSTDTC", "Subject Reference Start Date/Time", 10, "date",
"dm", "RFENDTC", "Subject Reference End Date/Time", 10, "date"
)
# Apply metadata using xportr_metadata()
dm_with_metadata <- dm %>%
xportr_metadata(metadata = metadata, domain = "dm")
# Access the metadata directly
# More robust way to access the attribute
metadata <- attr(dm_with_metadata, "_xportr.df_metadata_")
# Extract the SAS lengths
sas_lengths <- metadata %>%
select(variable, length) %>%
deframe()
print(sas_lengths)
Explanation (R):
- Uses
xportr_length()to set SAS-compatible lengths via attributes - Lengths are stored as "SASlength" attributes on each column
- Types can be set with
xportr_type()to ensure proper export - A single metadata table can store all properties for centralized management
- Length attributes are used when writing to SAS format for CDISC compliant submission
Expected Output After Setting Lengths:
> print(sas_lengths)
USUBJID SUBJID SEX AGE RFSTDTC RFENDTC
200 20 1 8 10 10
4. Writing Data with Metadata
| Capability | SAS | R |
|---|---|---|
| Export to XPT | proc cport data=ds file="file.xpt"; |
write_xpt(df, "file.xpt") |
| Export to SAS7BDAT | libname out "path"; data out.ds; set ds; run; |
write_sas(df, "file.sas7bdat") |
| Metadata preservation | Automatic | Attributes preserved by haven/xportr |
| Compress data | compress=yes option |
compress=TRUE option |
| Dataset label on export | Part of dataset | label= parameter |
SAS Example
/* Export SDTM DM domain to XPT file (transport format) */
proc cport data=dm
file="c:\sdtm\dm.xpt";
run;
/* Export to SAS dataset */
libname sdtm "c:\sdtm";
data sdtm.dm;
set dm;
run;
Explanation (SAS):
proc cportcreates a transport file (.xpt) for cross-platform use- Native SAS dataset created by writing to a libname
- All metadata (labels, formats, lengths) is automatically preserved
- Transport files are commonly used for regulatory submissions
R Example
library(haven)
library(xportr)
# Method 1: Using haven
write_xpt(dm, "c:/sdtm/dm_haven.xpt", version = 5)
write_sas(dm, "c:/sdtm/dm_haven.sas7bdat")
# Method 2: Using xportr (preserves metadata better)
dm %>%
xportr_write("c:/sdtm/dm_xportr.xpt")
# Method 3: With dataset label specified at write time
dm %>%
xportr_write("c:/sdtm/dm_labelled.xpt",
label = "Demographics Domain")
Explanation (R):
write_xpt()from haven creates SAS transport fileswrite_sas()creates SAS7BDAT native format filesxportr_write()offers enhanced metadata preservation- The
version = 5parameter ensures XPT V5 compatibility (FDA standard) - Dataset labels can be applied or overridden at export time
- Attributes from R (labels, formats) are preserved in export
- Critical for regulatory submissions like FDA and PMDA
Input Dataset:
| USUBJID | SUBJID | SEX | AGE | RFSTDTC | RFENDTC |
|---|---|---|---|---|---|
| STUDY1-001 | 001 | M | 45 | 2022-03-15 | 2022-09-30 |
| STUDY1-002 | 002 | F | 62 | 2022-03-20 | 2022-10-05 |
| STUDY1-003 | 003 | M | 57 | 2022-03-22 | 2022-09-28 |
Expected SAS Output (when reading exported XPT):
PROC CONTENTS of dm dataset
Dataset Name: DM
Label: Demographics Domain
Variables:
# Variable Type Length Format Label
1 USUBJID Char 200 Unique Subject Identifier
2 SUBJID Char 20 Subject Identifier for the Study
3 SEX Char 1 Sex
4 AGE Num 8 Age
5 RFSTDTC Char 10 YYMMDD10. Subject Reference Start Date/Time
6 RFENDTC Char 10 YYMMDD10. Subject Reference End Date/Time
5. Beyond Basics: Controlling Output XPT Standards
| Feature | SAS | R with haven/xportr |
|---|---|---|
| XPT Version | V5 default in proc cport |
version = "5" or "8" |
| Variable name length | 8 chars (V5), 32 chars (V8) | Truncated based on version |
| Label truncation | 40 chars (V5), 200 chars (V8) | Truncated based on version |
| Column ordering | As in dataset or retain |
xportr_order() or select() |
| SAS engine compliance | Automatic | Validation with xportr_validate() |
SAS Example
/* Export with variable order control for SDTM compliance */
data ex_ordered;
retain STUDYID USUBJID EXSEQ EXTRT EXDOSE EXDOSU EXSTDTC EXENDTC;
set ex;
run;
proc cport data=ex_ordered
file="c:\sdtm\ex_ordered.xpt";
run;
Explanation (SAS):
retainstatement controls the order of variables in the EX (Exposure) domain- Variables are exported in the SDTM-specified order
- Transport files have specific limitations on name and label lengths
- SAS handles compliance with XPT standards automatically
- Critical for CDISC compliance in regulatory submissions
R Example
# Load necessary libraries
library(xportr)
library(tibble)
library(dplyr)
# Create exposure domain data
ex <- tibble(
USUBJID = c("STUDY1-001", "STUDY1-002", "STUDY1-003"),
STUDYID = rep("STUDY1", 3),
EXSEQ = c(1, 1, 1),
EXTRT = c("Compound X", "Compound X", "Compound X"),
EXSTDTC = c("2022-03-15", "2022-03-20", "2022-03-22"),
EXENDTC = c("2022-09-30", "2022-10-05", "2022-09-28"),
EXDOSE = c(100, 100, 100),
EXDOSU = c("mg", "mg", "mg")
)
# Define SDTM standard column order
ex_order <- c("STUDYID", "USUBJID", "EXSEQ", "EXTRT", "EXDOSE", "EXDOSU", "EXSTDTC", "EXENDTC")
# 1. Prepare the data frame by reordering columns
ex_export <- ex %>%
select(all_of(ex_order))
# 2. Validate the prepared data frame.
# This function prints messages to the console but does not return the data frame.
xpt_validate(ex_export)
# 3. Write the original data frame (ex_export) to the XPT file.
xportr_write(ex_export, path = "C:/Users/gnana/OneDrive/Documents/R Programming/ex_v5.xpt")
#Round trip validation output to verify the exported xpt file
# Specify the full path to your exported file
file_path <- "C:/Users/gnana/OneDrive/Documents/R Programming/ex_v5.xpt"
# --- Validation Step ---
# Attempt to read the .xpt file.
# A successful read indicates a valid file structure.
tryCatch({
re_imported_data <- read_xpt(file_path)
# If successful, print a confirmation and view the data
print("Validation successful: The XPT file was read without errors.")
print(head(re_imported_data))
}, error = function(e) {
# If an error occurs, the file is not valid
print(paste("Validation failed. Error reading XPT file:", e$message))
})
Output:
[1] "Validation successful: The XPT file was read without errors."
# A tibble: 3 × 8
STUDYID USUBJID EXSEQ EXTRT EXDOSE EXDOSU EXSTDTC EXENDTC
<chr> <chr> <dbl> <chr> <dbl> <chr> <chr> <chr>
1 STUDY1 STUDY1-001 1 Compound X 100 mg 2022-03-… 2022-0…
2 STUDY1 STUDY1-002 1 Compound X 100 mg 2022-03-… 2022-1…
3 STUDY1 STUDY1-003 1 Compound X 100 mg 2022-03-… 2022-0…
Explanation (R):
- The SDTM column order is set using
select(all_of(ex_order))for explicit control. xpt_validate()checks the data frame for XPT compliance, printing issues to the console.xportr_write()exports the data to an XPT file, preserving metadata and structure.- After export, a round-trip validation is performed by reading the XPT file back with
read_xpt(). If successful, the file structure and data integrity are confirmed. - This workflow ensures that exported files meet regulatory requirements (e.g., FDA, PMDA) for variable names, labels, and ordering, and that the output is verifiable and reproducible.
- Use version 5 for 8-character variable names and 40-character labels; version 8 allows longer names/labels if required.
6. Beyond Basics: Date Handling and Special Types
| Capability | SAS | R |
|---|---|---|
| Date storage | Days since 1960-01-01 | Date class or numeric with attribute |
| Datetime storage | Seconds since 1960-01-01 | POSIXct with attribute |
| Format specification | format date date9.; |
Custom attributes |
| Special formats | Various SAS formats | Custom implementation required |
SAS Example
/* Create SDTM VS domain with various date formats */
data vs;
input USUBJID $ VISITNUM VISIT $ VSTESTCD $ VSDTC $10. VSTIME $8. VSSTRESN;
VSDTM = dhms(input(VSDTC, yymmdd10.), 0, 0, input(VSTIME, time8.));
format VSDTC yymmdd10. VSTIME time8. VSDTM datetime19.;
datalines;
STUDY1-001 1 BASELINE SYSBP 2022-03-15 08:30:00 120.5
STUDY1-002 1 BASELINE SYSBP 2022-03-20 09:15:45 118.0
STUDY1-003 1 BASELINE SYSBP 2022-03-22 10:45:30 135.5
;
run;
proc print data=vs;
run;
Explanation (SAS):
- SAS stores dates as days since January 1, 1960
- Datetime values are stored as seconds since January 1, 1960
- Time values are stored as seconds since midnight
- Combines date and time into datetime using
dhms()function - Formats control how dates, times, and vital sign results are displayed
- Various specialized formats available for different needs
R Example
# Load required libraries
library(haven)
library(tibble)
library(hms) # For handling time-only data
# Create the vital signs dataframe
vs <- tibble(
USUBJID = c("STUDY1-001", "STUDY1-002", "STUDY1-003"),
VISITNUM = c(1, 1, 1),
VISIT = c("BASELINE", "BASELINE", "BASELINE"),
VSTESTCD = c("SYSBP", "SYSBP", "SYSBP"),
VSDTC = c("2022-03-15", "2022-03-20", "2022-03-22"),
VSTIME = c("08:30:00", "09:15:45", "10:45:30"),
VSSTRESN = c(120.5, 118.0, 135.5)
)
# Combine date and time to create a POSIXct datetime object
# This is done before the original columns are converted to ensure clean input
vs$VSDTM <- as.POSIXct(paste(vs$VSDTC, vs$VSTIME), tz = "UTC")
# Convert date and time columns to their proper R data classes
vs$VSDTC <- as.Date(vs$VSDTC) # Convert to Date class
vs$VSTIME <- as_hms(vs$VSTIME) # Convert to hms class (which is numeric)
# Assign SAS formats for compatibility
attr(vs$VSDTC, "format.sas") <- "YYMMDD10."
attr(vs$VSTIME, "format.sas") <- "TIME8."
attr(vs$VSDTM, "format.sas") <- "DATETIME19."
# Write the dataset to an XPT file
# write_xpt correctly handles Date, hms, and POSIXct objects
write_xpt(vs, path = "C:/Users/gnana/OneDrive/Documents/R Programming/vs.xpt")
# Read the XPT file back in to verify the fix
vs_check <- read_xpt("C:/Users/gnana/OneDrive/Documents/R Programming/vs.xpt")
# Display the result to confirm it worked
print(vs_check)
# Check the formats of the columns in the read dataset
# 1. Check the format for the Date column (VSDTC)
attr(vs_check$VSDTC, "format.sas")
# 2. Check the format for the Time column (VSTIME)
attr(vs_check$VSTIME, "format.sas")
# 3. Check the format for the Datetime column (VSDTM)
attr(vs_check$VSDTM, "format.sas")
# 4. Check a column with no format
attr(vs_check$USUBJID, "format.sas")
Explanation (R):
- R stores dates as days since January 1, 1970 (Date class)
- POSIXct stores datetimes as seconds since January 1, 1970
- Creates combined datetime from separate date and time columns
- haven converts between R and SAS date epochs automatically
- SAS formats can be stored as attributes for export
- Format attributes control how dates and times appear in SAS after export
Input Date Data:
| USUBJID | VISITNUM | VISIT | VSTESTCD | VSDTC | VSTIME | VSSTRESN |
|---|---|---|---|---|---|---|
| STUDY1-001 | 1 | BASELINE | SYSBP | 2022-03-15 | 08:30:00 | 120.5 |
| STUDY1-002 | 1 | BASELINE | SYSBP | 2022-03-20 | 09:15:45 | 118.0 |
| STUDY1-003 | 1 | BASELINE | SYSBP | 2022-03-22 | 10:45:30 | 135.5 |
Output in R:
> print(vs_check)
# A tibble: 3 × 8
USUBJID VISITNUM VISIT VSTESTCD VSDTC VSTIME VSSTRESN VSDTM
<chr> <chr> <dbl> <chr> <date> <time> <dbl> <dttm>
1 STUDY1-001 1 BASELINE SYSBP 2022-03-15 08:30:00 120. 2022-03-15 08:30:00
2 STUDY1-002 1 BASELINE SYSBP 2022-03-20 09:15:45 118 2022-03-20 09:15:45
3 STUDY1-003 1 BASELINE SYSBP 2022-03-22 10:45:30 136. 2022-03-22 10:45:30
>
> # Check the formats of the columns in the read dataset
> # 1. Check the format for the Date column (VSDTC)
> attr(vs_check$VSDTC, "format.sas")
[1] "YYMMDD10"
>
> # 2. Check the format for the Time column (VSTIME)
> attr(vs_check$VSTIME, "format.sas")
[1] "TIME8"
>
> # 3. Check the format for the Datetime column (VSDTM)
> attr(vs_check$VSDTM, "format.sas")
[1] "DATETIME19"
>
> # 4. Check a column with no format
> attr(vs_check$USUBJID, "format.sas")
NULL
7. Beyond Basics: Creating SAS Catalogs and Format Libraries
SAS format catalogs (.sas7bcat) store format definitions that can be used across multiple datasets. In R, we can simulate this through centralized metadata tables.
| Capability | SAS | R |
|---|---|---|
| Format library | proc format library=lib; |
Central metadata table |
| Catalog export | proc cport formats |
Not directly supported |
| Format association | libname fmtlib (formats); |
xportr_format() with metadata |
| Format inheritance | Through library reference | Through metadata application |
SAS Example
/* Create a format catalog for SDTM domains */
libname fmtlib "c:\sdtm\formats";
proc format library=fmtlib;
value sexfmt 'M'='Male' 'F'='Female' 'U'='Unknown';
value ynfmt 'Y'='Yes' 'N'='No' 'U'='Unknown';
value agegrpfmt 1='<30 years' 2='30-60 years' 3='>=60 years';
value racefmt 1='WHITE' 2='BLACK OR AFRICAN AMERICAN' 3='ASIAN' 4='OTHER';
run;
/* Use the format library in DM domain */
libname sdtm "c:\sdtm\data";
data sdtm.dm;
set dm;
/* Create derived age group variable */
if AGE < 30 then AGEGRP = 1;
else if AGE >= 30 and AGE < 60 then AGEGRP = 2;
else if AGE >= 60 then AGEGRP = 3;
format SEX sexfmt. AGEGRP agegrpfmt.;
run;
Explanation (SAS):
- Creates a format catalog in the fmtlib library with SDTM-relevant formats
- Format definitions are stored separately from datasets
- Includes formats for sex, yes/no flags, age groups, and visit numbers
- Formats can be shared across multiple SDTM domains
- Libraries can be referenced to use stored formats
- Standard approach in clinical research for consistency
R Equivalent Approach
# Load necessary libraries
library(tibble)
library(dplyr)
library(haven)
# 1. Dummy DM (Demographics) Dataset
# ----------------------------------------------------------------
dm <- tibble(
STUDYID = "XYZ123",
DOMAIN = "DM",
USUBJID = paste("XYZ123-0", 10:19, sep = ""),
SUBJID = as.character(10:19),
RFSTDTC = "2024-01-15",
RFENDTC = "2024-06-20",
SITEID = sample(c("S101", "S102", "S103"), 10, replace = TRUE),
AGE = sample(25:75, 10, replace = TRUE),
AGEU = "YEARS",
SEX = sample(c("M", "F", "U"), 10, replace = TRUE, prob = c(0.45, 0.45, 0.1)),
RACE = sample(c("WHITE", "BLACK OR AFRICAN AMERICAN", "ASIAN", "OTHER"), 10, replace = TRUE),
ETHNIC = sample(c("HISPANIC OR LATINO", "NOT HISPANIC OR LATINO"), 10, replace = TRUE),
ARMCD = sample(c("PBO", "TRT1"), 10, replace = TRUE),
ARM = ifelse(ARMCD == "PBO", "Placebo", "Treatment 100mg"),
ACTARMCD = ARMCD,
ACTARM = ARM,
COUNTRY = "USA"
)
# 2. Central Format Library
# All 'value's are characters, so the tibble can be created.
# ----------------------------------------------------------------
format_library <- tribble(
~format_name, ~value, ~label,
"sexfmt", "M", "Male",
"sexfmt", "F", "Female",
"sexfmt", "U", "Unknown",
"racefmt", 1, "WHITE",
"racefmt", 2, "BLACK OR AFRICAN AMERICAN",
"racefmt", 3, "ASIAN",
"racefmt", 4, "OTHER",
"ynfmt", "Y", "Yes",
"ynfmt", "N", "No",
"ynfmt", "U", "Unknown",
"agegrpfmt", 1, "<30 years",
"agegrpfmt", 2, "30-60 years",
"agegrpfmt", 3, ">=60 years"
)
# 3. Format Metadata
# This table links a variable in a dataset to a format name.
# Note: We will format a new numeric variable 'RACEN' instead of 'RACE'.
# ----------------------------------------------------------------
format_meta <- tribble(
~dataset, ~variable, ~format,
"dm", "SEX", "sexfmt",
"dm", "AGEGRP", "agegrpfmt",
"dm", "RACEN", "racefmt" # We will create RACEN to hold the numeric codes
)
# 4. Create Derived Variables for Formatting
# We need numeric versions of AGEGRP and RACE to match our formats.
# ----------------------------------------------------------------
dm_to_format <- dm %>%
mutate(
# Create numeric AGEGRP from AGE
AGEGRP = case_when(
AGE < 30 ~ 1,
AGE >= 30 & AGE < 60 ~ 2,
AGE >= 60 ~ 3
),
# Create numeric RACEN from the character RACE variable
RACEN = case_when(
RACE == "WHITE" ~ 1,
RACE == "BLACK OR AFRICAN AMERICAN" ~ 2,
RACE == "ASIAN" ~ 3,
RACE == "OTHER" ~ 4
)
)
# 5. The "Smart" Formatting Function
# This function correctly handles type conversion before applying labels.
# ----------------------------------------------------------------
apply_sdtm_formats <- function(df, dataset_name, format_meta, format_library) {
dataset_formats <- format_meta %>%
filter(dataset == dataset_name)
for (i in 1:nrow(dataset_formats)) {
var <- dataset_formats$variable[i]
fmt <- dataset_formats$format[i]
# Special case for date formats
if (fmt %in% c("YYMMDD10.", "DATETIME19.")) {
if (var %in% names(df)) {
attr(df[[var]], "format.sas") <- fmt
}
next
}
# Extract format definition
format_def <- format_library %>%
filter(format_name == fmt)
if (nrow(format_def) == 0) next
# Create named vector for labelled
values <- format_def$value
names(values) <- format_def$label
format_vector <- values
# Apply to variable if it exists
if (var %in% names(df)) {
df[[var]] <- haven::labelled(df[[var]], format_vector)
attr(df[[var]], "format.sas") <- paste0(fmt, ".")
}
}
return(df)
}
# 6. Apply the Formats
# ----------------------------------------------------------------
dm_formatted <- dm_to_format %>%
xportr_df_label(ds_metadata, domain = "dm") %>%
xportr_label(var_metadata, domain = "dm") %>%
xportr_length(var_metadata, domain = "dm") %>%
apply_sdtm_formats(var_meta = var_metadata %>% filter(dataset == "dm"), format_lib = format_library)
print(purrr::map(dm_formatted, attr, "labels"))
# 7. Write the file (Unchanged)
xpt_path <- "C:/Users/gnana/OneDrive/Documents/R Programming/dm.xpt"
haven::write_xpt(dm_formatted, path = xpt_path, version = 5)
cat(paste("Success. The compliant XPT file has been written to:", xpt_path, "\n"))
# 8. Verification Script (Will now show the correct output)
cat(paste("\nReading file from:", xpt_path, "\n\n"))
dm_read_back <- haven::read_xpt(xpt_path)
# Check 1: Dataset Label
cat("\n--- 1. Dataset Label Verification ---\n")
print(attr(dm_read_back, "label"))
# Check 2: Variable Labels
cat("\n--- 2. Variable Label Verification ---\n")
print(purrr::map(dm_read_back, attr, "label"))
# Check 3: Value Labels (Formats)
cat("\n--- 3. Value Label (Format) Verification ---\n")
print(purrr::map(dm_read_back, attr, "labels"))
# Check 4: Formats
sapply(vs_check, function(x) attr(x, "format.sas"))
# Use haven::as_factor() to apply the labels from the file
dm_as_factor <- haven::as_factor(dm_read_back)
# Print the resulting data frame
print(dm_as_factor)
str(dm_as_factor)
Output:
--- 1. Dataset Label Verification ---
> print(attr(dm_read_back, "label"))
[1] "Demographics Domain"
>
> # Check 2: Variable Labels
> cat("\n--- 2. Variable Label Verification ---\n")
--- 2. Variable Label Verification ---
> print(purrr::map(dm_read_back, attr, "label"))
$STUDYID
[1] "Study Identifier"
$USUBJID
[1] "Unique Subject Identifier"
$SUBJID
[1] "Subject Identifier for the Study"
$SITEID
[1] "Study Site Identifier"
$SEX
NULL
$RACE
NULL
$AGE
[1] "Age"
$RFSTDTC
[1] "Subject Reference Start Date"
$RFENDTC
[1] "Subject Reference End Date"
$AGEGRP
NULL
>
> # Check 3: Value Labels (Formats)
> cat("\n--- 3. Value Label (Format) Verification ---\n")
--- 3. Value Label (Format) Verification ---
> print(purrr::map(dm_read_back, attr, "labels"))
$STUDYID
NULL
$USUBJID
NULL
$SUBJID
NULL
$SITEID
NULL
$SEX
NULL
$RACE
NULL
$AGE
NULL
$RFSTDTC
NULL
$RFENDTC
NULL
$AGEGRP
NULL
>
> # Check 4: Formats
> sapply(vs_check, function(x) attr(x, "format.sas"))
$USUBJID
NULL
$VISITNUM
NULL
$VISIT
NULL
$VSTESTCD
NULL
$VSDTC
[1] "YYMMDD10"
$VSTIME
[1] "TIME8"
$VSSTRESN
NULL
$VSDTM
[1] "DATETIME19"
>
> # Use haven::as_factor() to apply the labels from the file
> dm_as_factor <- haven::as_factor(dm_read_back)
>
> # Print the resulting data frame
> print(dm_as_factor)
# A tibble: 3 × 10
STUDYID USUBJID SUBJID SITEID SEX RACE AGE RFSTDTC RFENDTC AGEGRP
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <date> <date> <dbl>
1 STUDY1 STUDY1-001 001 01 M 1 45 2022-03-15 2022-09-30 2
2 STUDY1 STUDY1-002 002 01 F 2 62 2022-03-15 2022-09-30 3
3 STUDY1 STUDY1-003 003 02 M 3 57 2022-03-15 2022-09-30 2
> str(dm_as_factor)
tibble [3 × 10] (S3: tbl_df/tbl/data.frame)
$ STUDYID: chr [1:3] "STUDY1" "STUDY1" "STUDY1"
..- attr(*, "label")= chr "Study Identifier"
$ USUBJID: chr [1:3] "STUDY1-001" "STUDY1-002" "STUDY1-003"
..- attr(*, "label")= chr "Unique Subject Identifier"
$ SUBJID : chr [1:3] "001" "002" "003"
..- attr(*, "label")= chr "Subject Identifier for the Study"
$ SITEID : chr [1:3] "01" "01" "02"
..- attr(*, "label")= chr "Study Site Identifier"
$ SEX : chr [1:3] "M" "F" "M"
..- attr(*, "format.sas")= chr "sexfmt"
$ RACE : num [1:3] 1 2 3
..- attr(*, "format.sas")= chr "racefmt"
$ AGE : num [1:3] 45 62 57
..- attr(*, "label")= chr "Age"
$ RFSTDTC: Date[1:3], format: "2022-03-15" "2022-03-15" "2022-03-15"
$ RFENDTC: Date[1:3], format: "2022-09-30" "2022-09-30" "2022-09-30"
$ AGEGRP : num [1:3] 2 3 2
..- attr(*, "format.sas")= chr "agegrpfm"
- attr(*, "label")= chr "Demographics Domain"
Note: Check 3: Value Labels (Formats) - Value Labels Disappear
haven::write_xpt()does not preserve value labels (attr(x, "labels")) in the.xptoutput. It complies with the SAS Transport v5 format spec but omits formats because:- SAS Transport files only store formats as text references, not the format definitions.
havendoesn't write the necessary format catalogs (.sas7bcat) which SAS uses to interpret those formats during reading.
So even though your variables have labels attributes attached in memory, they’re stripped when writing to the file—and therefore missing when read back.
If preserving formats is critical:
- Consider storing the format definitions externally (e.g., using JSON, RDS) and re-applying them after reading the
.xpt. - Alternatively, switch to
sas7bdatformat if working in an environment that supports it, or usehaven::write_sas()which allows the export of both data and formats via catalog.
If your downstream system supports it, use:
haven::write_sas(dm_prepped, path = "dm.sas7bdat")
This format does preserve value labels and is more flexible than .xpt, though not FDA-submission compliant.
Explanation (R):
- Demonstrates how to automate export of multiple SDTM domains using a control table and functional programming.
- Each domain is exported by retrieving the corresponding data frame from the environment, applying metadata (labels, formats, lengths), and writing to XPT using
xportr_write(). - The process uses
purrr::pmap_chr()to iterate over each row of the control table, ensuring consistent metadata application and export for each dataset. - This approach is scalable for large studies, maintains reproducibility, and centralizes export logic for easier maintenance.
9. Best Practices for Metadata and Exports
| Area | Best Practice |
|---|---|
| Metadata Storage | Centralize in shared tables/databases |
| Validation | Check round-trip data/metadata preservation |
| Variable Names | Follow 8-char rule for V5 XPT (regulatory) |
| Labels | Keep under 40 chars for V5 XPT compatibility |
| Formats | Store centrally and apply consistently |
| Date Handling | Test date/datetime round-trips |
| Character Encoding | Use UTF-8 when possible; test special characters |
| Export Path | Use file.path() for cross-platform compatibility |
| Documentation | Document metadata sources and mapping rules |
| Automation | Use control tables for multi-dataset exports |
| Version Control | Track metadata tables in version control |
**Resource download links**
1.4.2.-Applying-Metadata-and-Outputting-SAS-XPT-Files-with-haven.zip