1.4.1. Read SAS Datasets: R vs SAS Programming

1.4. Migrating from SAS to R: A Skill Conversion Guide

1.4.1 How to Read SAS Datasets with R Programming Using Haven Package: A Complete Guide

The transition from SAS to R has become increasingly important for data analysts and statisticians seeking more flexible, cost-effective solutions. The haven package provides a powerful bridge between these two environments, enabling seamless import and export of SAS datasets while preserving critical metadata like variable labels, value formats, and special missing values.

This comprehensive guide demonstrates how to leverage R's haven package for SAS data processing, comparing approaches side-by-side with traditional SAS programming. Whether you're a beginner exploring R or an advanced user seeking production-ready workflows, this guide covers everything from basic data import to sophisticated automation strategies.

1. Getting Started: Installation and Setup

1.1 Haven Package Installation

The haven package is your gateway to SAS data in R. Here are multiple installation approaches:

Basic Installation:

# Method 1: Install from CRAN (recommended)
install.packages("haven")

# Method 2: Install via tidyverse (includes haven)
install.packages("tidyverse")

# Load the package
library(haven)

# Verify installation
packageVersion("haven")

Complete Environment Setup:

# Recommended package ecosystem
packages_to_install <- c(
    "haven",        # SAS/SPSS/Stata import/export
    "tidyverse",    # Data manipulation ecosystem  
    "labelled",     # Enhanced labelled data support
    "sjlabelled",   # Additional label utilities
    "lubridate",    # Date/time handling
    "readr"         # Fast data reading
)

# Install all packages
install.packages(packages_to_install)

1.2 Understanding SAS File Formats

Before diving into code examples, it's essential to understand the different SAS file formats you'll encounter:

.sas7bdat - Native SAS dataset format containing data and metadata
.xpt - SAS transport format for cross-platform data sharing
.sas7bcat - SAS format catalog containing value label definitions
.sas7bndx - SAS index files for performance optimization

2. Basic Data Import: Your First Steps

Simple Dataset Reading

Let's start with the fundamental operation - reading a SAS dataset into R.

SAS Approach:

/* Set up SAS library */
libname mydata "C:/data";

/* Read dataset */
data work.patients;
  set mydata.patients;
run;

/* View dataset contents */
proc contents data=work.patients;
run;

R Haven Approach:

# Read SAS dataset directly
patients <- read_sas("C:/data/patients.sas7bdat")

# View dataset structure
str(patients)
glimpse(patients)  # if dplyr is loaded

Sample Input Dataset Structure:

Variable	Type	Label	Format
PATID	Numeric	Patient ID	8.
AGE	Numeric	Patient Age	3.
SEX	Character	Gender	$SEXFMT.
WEIGHT	Numeric	Weight (kg)	5.1
TREATMENT	Character	Treatment	$TRTFMT.
VISIT_DATE	Date	Visit Date	DATE9.

Expected R Output After read_sas():

# A tibble: 150 × 6
   PATID   AGE SEX   WEIGHT TREATMENT VISIT_DATE
   <dbl> <dbl> <chr>  <dbl> <chr>     <date>    
1   1001    45 M       82.5 A         2023-01-15
2   1002    62 F       64.3 B         2023-01-16
3   1003    33 M       91.0 A         2023-01-17
4   1004    29 F       67.8 C         2023-01-18
5   1005    51 F       75.2 B         2023-01-19

Variable Labels (preserved in attributes):
$PATID: "Patient ID"
$AGE: "Patient Age" 
$SEX: Format labels preserved as haven_labelled
$WEIGHT: "Weight (kg)"

3. Working with Labels and Formats

Variable Labels vs Value Labels

Understanding the difference between variable labels and value labels is crucial:

Variable Labels: Descriptive text for the variable itself (e.g., "Patient Age in Years")
Value Labels: Text descriptions for specific values (e.g., 1="Male", 2="Female")

SAS Format Creation and Application:

/* Create value formats */
proc format;
  value sexfmt 
    1 = 'Male'
    2 = 'Female';
  value trtfmt 
    'A' = 'Drug A'
    'B' = 'Drug B' 
    'C' = 'Placebo';
run;

/* Apply labels and formats */
data patients_labeled;
  set mydata.patients;
  
  label 
    age = "Patient Age (years)"
    sex = "Patient Gender"
    weight = "Weight in Kilograms"
    treatment = "Treatment Assignment";
  
  format sex sexfmt. treatment trtfmt.;
run;

R Haven Equivalent:

# Read with format catalog
patients <- read_sas("patients.sas7bdat", 
                     catalog_file = "formats.sas7bcat")

#Note: R cannot generate a .sas7bcat format catalog file, because that file type is proprietary to SAS

# Add variable labels
var_label(patients$AGE) <- "Patient Age (years)"
var_label(patients$SEX) <- "Patient Gender"
var_label(patients$WEIGHT) <- "Weight in Kilograms"
var_label(patients$TREATMENT) <- "Treatment Assignment"

# Create value labels manually
patients$SEX <- labelled(patients$SEX, c("Male" = 1, "Female" = 2))
patients$TREATMENT <- labelled(patients$TREATMENT, c("Drug A" = "A", "Drug B" = "B", "Placebo" = "C"))

# Convert to factors for analysis
patients_factor <- patients %>%
  mutate(
    SEX_F = as_factor(SEX),
    TREATMENT_F = as_factor(TREATMENT)
  )

Input: Raw SAS Data with Numeric Codes

SUBJID	SEX	RACE	STATUS
101	1	1	1
102	2	3	1
103	1	2	2
104	2	1	3

SAS Format Definitions:

SEX: 1='Male', 2='Female'
RACE: 1='White', 2='Black', 3='Asian'
STATUS: 1='Active', 2='Completed', 3='Discontinued'

Expected Output After as_factor() in R:

# A tibble: 4 × 4
  SUBJID SEX    RACE  STATUS      
   <dbl> <fct>  <fct> <fct>       
1    101 Male   White Active      
2    102 Female Asian Active      
3    103 Male   Black Completed   
4    104 Female White Discontinued

4. Handling Missing Values

Understanding SAS Missing Value System

SAS has a sophisticated missing value system that goes beyond simple NA values:

Standard Missing (.) - Regular missing value
Special Missing (.A to .Z) - Coded missing with specific meanings
Numeric Missing Codes - User-defined values like -9, -99

SAS Missing Value Example:

data missing_demo;
  input id value1 value2 $;
  datalines;
  1 25 A
  2 .  B
  3 .A C
  4 .Z .
  5 30 D
  ;
run;

/* Check missing patterns */
proc means data=missing_demo nmiss;
  var value1;
run;

/* Handle different missing types */
data clean_data;
  set missing_demo;
  
  if value1 = . then missing_flag = 1;
  if value1 = .A then reason = "Not Applicable";
  if value1 = .Z then reason = "Not Done";
  
  where value1 is not missing;
run;

R Haven Equivalent:

# Read with special missing values preserved
#missing_df <- read_sas("missing_demo.sas7bdat", user_na = TRUE)

# Create data with special missing values
missing_demo <- tibble(
  id = 1:5,
  value1 = c(25, NA, tagged_na("A"), tagged_na("Z"), 30),
  value2 = c("A", "B", "C", NA, "D")
)

# Handle different missing types
missing_demo <- missing_demo %>%
  mutate(
    missing_flag = ifelse(is.na(value1), 1, 0),
    missing_type = case_when(
      is_tagged_na(value1, "A") ~ "Not Applicable",
      is_tagged_na(value1, "Z") ~ "Not Done", 
      is.na(value1) ~ "Missing",
      TRUE ~ "Present"
    )
  )

# Filter non-missing values
clean_data <- missing_demo %>%
  filter(!is.na(value1))

Input: SAS Dataset with Special Missing Values

ID	VALUE1	VALUE2	REASON
1	25	Present
2	.	Missing	Standard Missing
3	.A	Present	Not Applicable
4	.Z	Present	Not Done
5	30	Present

Expected Output in R with Tagged NAs:

# A tibble: 5 × 4
     ID VALUE1 VALUE2  missing_type  
  <dbl>  <dbl> <chr>   <chr>         
1     1   25   Present Present       
2     2   NA   Missing Missing       
3     3   NA   Present Not Applicable
4     4   NA   Present Not Done      
5     5   30   Present Present       

# Verify tagged NA values:
is_tagged_na(data$VALUE1, 'A')  # Returns: FALSE FALSE TRUE FALSE FALSE
is_tagged_na(data$VALUE1, 'Z')  # Returns: FALSE FALSE FALSE TRUE FALSE

5. Metadata-Driven Automation

For production environments, metadata-driven approaches provide scalability and maintainability.

SAS Metadata-Driven Processing:

/* Control table for batch processing */
data import_control;
  input dataset $20. path $50. keep_vars $100. filter_condition $200.;
  datalines;
demographics /data/demo.sas7bdat USUBJID AGE SEX RACE where AGE >= 18
vitals /data/vitals.sas7bdat USUBJID VISITNUM SYSBP DIABP where SYSBP > 0
labs /data/labs.sas7bdat USUBJID LBTEST LBRESULT where LBRESULT is not missing
;
run;

/* Dynamic processing macro */
%macro batch_import;
  %let dsid = %sysfunc(open(import_control));
  %let nobs = %sysfunc(attrn(&dsid, nobs));
  %let rc = %sysfunc(close(&dsid));
  
  %do i = 1 %to &nobs;
    data _null_;
      set import_control(firstobs=&i obs=&i);
      call symputx('ds', dataset);
      call symputx('path', path);
      call symputx('keep', keep_vars);
      call symputx('filter', filter_condition);
    run;
    
    data &ds;
      set "&path" (keep=&keep);
      &filter;
    run;
  %end;
%mend;

%batch_import;

R Haven Metadata-Driven Processing:

library(tibble)
library(labelled)
library(readxl)     # Optional: for actual file reading
library(haven)      # Optional: for read_xpt/write_xpt
library(dplyr)
library(purrr)
library(rlang)
library(stringr)

# --- Define datasets with labels ---
DEMOGRAPHICS <- tibble(
  USUBJID = paste0("SUBJ", 1:5),
  AGE = c(19, 45, 30, 61, 22),
  SEX = labelled(c(1, 2, 1, 2, 1), c(MALE = 1, FEMALE = 2)),
  RACE = labelled(c("WHITE", "ASIAN", "BLACK", "WHITE", "OTHER"),
                  c("WHITE" = "WHITE", "ASIAN" = "ASIAN", "BLACK" = "BLACK", "OTHER" = "OTHER"))
)

var_label(DEMOGRAPHICS$USUBJID) <- "Unique Subject Identifier"
var_label(DEMOGRAPHICS$AGE) <- "Age at Screening"
var_label(DEMOGRAPHICS$SEX) <- "Biological Sex"
var_label(DEMOGRAPHICS$RACE) <- "Race of Subject"

VITALS <- tibble(
  USUBJID = paste0("SUBJ", 1:5),
  VISITNUM = 1:5,
  SYSBP = c(120, 145, 138, 135, 150),
  DIABP = c(80, 90, 92, 85, 88)
)

var_label(VITALS$USUBJID) <- "Unique Subject Identifier"
var_label(VITALS$VISITNUM) <- "Visit Number"
var_label(VITALS$SYSBP) <- "Systolic Blood Pressure"
var_label(VITALS$DIABP) <- "Diastolic Blood Pressure"

LABS <- tibble(
  USUBJID = paste0("SUBJ", 1:5),
  LBTEST = c("GLUCOSE", "ALT", "AST", "CHOLESTEROL", "HB"),
  LBRESULT = c(80, 34, 29, 198, 14)
)

var_label(LABS$USUBJID) <- "Unique Subject Identifier"
var_label(LABS$LBTEST) <- "Lab Test Name"
var_label(LABS$LBRESULT) <- "Lab Test Result"

# --- Import control ---
import_control <- tibble(
  dataset = c("demographics", "vitals", "labs"),
  path = c("demo.sas7bdat", "vitals.sas7bdat", "labs.sas7bdat"),
  col_select = c("USUBJID, AGE, SEX, RACE", "USUBJID, VISITNUM, SYSBP, DIABP", "USUBJID, LBTEST, LBRESULT"),
  filter_expr = c("AGE >= 18", "SYSBP > 0", "!is.na(LBRESULT)"),
  validation_rules = c("AGE > 0 & AGE < 150", "SYSBP >= 70 & SYSBP <= 250", "LBRESULT >= 0")
)

# --- Flexible import and validation ---
import_and_validate <- function(dataset, path, col_select, filter_expr, validation_rules) {
  cols <- str_split(col_select, ", ")[[1]]
  
  tryCatch({
    # Use actual data instead of reading files
    data <- switch(dataset,
                   demographics = DEMOGRAPHICS,
                   vitals = VITALS,
                   labs = LABS)
    
    data <- data %>% select(all_of(cols))
    
    if (!is.na(filter_expr)) {
      filter_condition <- parse_expr(filter_expr)
      data <- data %>% filter(!!filter_condition)
    }
    
    if (!is.na(validation_rules)) {
      validation_condition <- parse_expr(validation_rules)
      validation_failures <- data %>%
        filter(!(!!validation_condition)) %>%
        nrow()
      if (validation_failures > 0) {
        warning(paste("Dataset", dataset, "has", validation_failures, "validation failures"))
      }
    }
    
    attr(data, "import_timestamp") <- Sys.time()
    attr(data, "source_file") <- path
    attr(data, "validation_rules") <- validation_rules
    
    return(data)
    
  }, error = function(e) {
    warning(paste("Failed to import", dataset, ":", e$message))
    return(NULL)
  })
}

# --- Apply across all controls ---
imported_datasets <- import_control %>%
  pmap(import_and_validate) %>%
  set_names(import_control$dataset)

imported_datasets <- imported_datasets[!sapply(imported_datasets, is.null)]

# --- Summarize results ---
import_summary <- map_dfr(names(imported_datasets), ~{
  data <- imported_datasets[[.x]]
  tibble(
    dataset = .x,
    rows = nrow(data),
    columns = ncol(data),
    import_time = attr(data, "import_timestamp"),
    source_file = attr(data, "source_file")
  )
})

print(import_summary)

Input: Control Table for Batch Processing

Dataset	Path	Col_Select	Filter_Expr	Validation_Rules
demographics	/data/demo.sas7bdat	USUBJID, AGE, SEX, RACE	AGE >= 18	AGE > 0 & AGE < 150
vitals	/data/vitals.sas7bdat	USUBJID, VISITNUM, SYSBP, DIABP	SYSBP > 0	SYSBP >= 70 & SYSBP <= 250
labs	/data/labs.sas7bdat	USUBJID, LBTEST, LBRESULT	!is.na(LBRESULT)	LBRESULT >= 0

Expected Output: Import Summary Report

> print(import_summary)
# A tibble: 3 × 5
  dataset       rows columns import_time         source_file    
  <chr>        <int>   <int> <dttm>              <chr>          
1 demographics     5       4 2025-07-27 09:31:09 demo.sas7bdat  
2 vitals           5       4 2025-07-27 09:31:09 vitals.sas7bdat
3 labs             5       3 2025-07-27 09:31:09 labs.sas7bdat

6. Troubleshooting Common Issues

6.1 File Access and Encoding Problems

Issue: Special characters appear as question marks or garbled text Solution: Specify encoding explicitly

df <- read_sas("international_data.sas7bdat", encoding = "UTF-8")

Issue: Cannot read SAS files from network drives Solution: Use mapped drives or UNC paths with proper authentication

# Use mapped drive
df <- read_sas("Z:/shared_data/study.sas7bdat")

# Or full UNC path
df <- read_sas("//server/share/data/study.sas7bdat")

6.2 Memory and Performance Issues

Issue: R crashes when loading large SAS files Solution: Use selective column reading and chunked processing

# Read only essential columns
subset_df <- read_sas("large_file.sas7bdat", 
                      col_select = c(USUBJID, PRIMARY_ENDPOINT, TREATMENT))

# Monitor memory usage
cat("Memory usage:", format(pryr::mem_used(), units = "MB"), "\n")

R Code Explanation

read_sas() from the haven package reads SAS datasets directly into R.
The col_select argument allows you to specify only the columns you need, reducing memory usage and improving speed.
cat() prints the current memory usage using pryr::mem_used(), which is helpful for monitoring resource consumption when working with large files.

6.3 Format and Label Issues

Issue: Value labels not appearing correctly
Solution: Check format catalog import and apply as_factor()

# Verify format catalog import
df <- read_sas("data.sas7bdat", catalog_file = "formats.sas7bcat")

# Check if labels are preserved
sapply(df, function(x) !is.null(attr(x, "labels")))

# Convert to factors for display
df_factored <- df %>% mutate_if(is.labelled, as_factor)