contact@a2zlearners.com

1.5. SAS PROCEDURES -> Equivalent in R

1.5.8. PROC COMPARE in SAS vs R Equivalents

This guide compares SAS and R approaches for comparing datasets/dataframes, highlighting key capabilities, syntactic differences, and advanced techniques for both languages.

1. Basic Dataset Comparison

Capability SAS (PROC COMPARE) R
Compare two datasets PROC COMPARE base=ds1 compare=ds2; diffdf::diffdf(df1, df2)
Basic output Default summary report Console text output
Visual comparison Basic text output diffdf, daff::render_diff()
Key-based comparison ID statement keys parameter
Return value Log output List object with differences

SAS Example

/* Create two datasets for comparison */
data adverse_events_orig;
  input USUBJID $ AEDECOD $ AESTDT :date9. AEENDT :date9.;
  format AESTDT AEENDT date9.;
  datalines;
001-001 HEADACHE 10JAN2023 15JAN2023
001-002 FATIGUE 12JAN2023 19JAN2023
001-003 NAUSEA 15JAN2023 18JAN2023
001-004 FEVER 20JAN2023 22JAN2023
;
run;

data adverse_events_new;
  input USUBJID $ AEDECOD $ AESTDT :date9. AEENDT :date9.;
  format AESTDT AEENDT date9.;
  datalines;
001-001 HEADACHE 10JAN2023 15JAN2023
001-002 FATIGUE 14JAN2023 19JAN2023
001-003 NAUSEA 15JAN2023 20JAN2023
001-005 DIZZINESS 25JAN2023 28JAN2023
;
run;

/* Basic comparison */
proc compare base=adverse_events_orig compare=adverse_events_new;
run;

/* Comparison with specific ID variable */
proc compare base=adverse_events_orig compare=adverse_events_new;
  id USUBJID;
run;

Explanation (SAS):

  • PROC COMPARE compares two datasets (base and compare)
  • The default output shows differences in variables, observations, and values
  • ID statement specifies which variable(s) to use for matching observations
  • The report shows added/deleted observations, matching observations with differences
  • SAS generates a detailed text report with several sections

R Example

library(diffdf)
library(lubridate)

# Create two dataframes for comparison
adverse_events_orig <- data.frame(
  USUBJID = c("001-001", "001-002", "001-003", "001-004"),
  AEDECOD = c("HEADACHE", "FATIGUE", "NAUSEA", "FEVER"),
  AESTDT = as.Date(c("2023-01-10", "2023-01-12", "2023-01-15", "2023-01-20")),
  AEENDT = as.Date(c("2023-01-15", "2023-01-19", "2023-01-18", "2023-01-22"))
)

adverse_events_new <- data.frame(
  USUBJID = c("001-001", "001-002", "001-003", "001-005"),
  AEDECOD = c("HEADACHE", "FATIGUE", "NAUSEA", "DIZZINESS"),
  AESTDT = as.Date(c("2023-01-10", "2023-01-14", "2023-01-15", "2023-01-25")),
  AEENDT = as.Date(c("2023-01-15", "2023-01-19", "2023-01-20", "2023-01-28"))
)

# Basic comparison
diffdf(adverse_events_orig, adverse_events_new)

# Comparison with specific key variables
diffdf(adverse_events_orig, adverse_events_new, keys = "USUBJID")

# Store results for further processing
diff_results <- diffdf(adverse_events_orig, adverse_events_new, 
                      keys = "USUBJID")

Explanation (R):

  • diffdf package provides functionality similar to PROC COMPARE
  • diffdf() function compares two dataframes
  • The keys parameter identifies variables for matching observations
  • Results are printed to console by default but can be stored as an object
  • The output shows differences in variables, rows, and values

Input Data (Base/Original Dataset):

USUBJID AEDECOD AESTDT AEENDT
001-001 HEADACHE 2023-01-10 2023-01-15
001-002 FATIGUE 2023-01-12 2023-01-19
001-003 NAUSEA 2023-01-15 2023-01-18
001-004 FEVER 2023-01-20 2023-01-22

Input Data (Compare/New Dataset):

USUBJID AEDECOD AESTDT AEENDT
001-001 HEADACHE 2023-01-10 2023-01-15
001-002 FATIGUE 2023-01-14 2023-01-19
001-003 NAUSEA 2023-01-15 2023-01-20
001-005 DIZZINESS 2023-01-25 2023-01-28

Expected Output:

Differences found between the objects!

A summary is given below.

There are rows in BASE that are not in COMPARE!
All rows are shown in table below

==============================================
  USUBJID      AEDECOD      AESTDT    AEENDT  
----------------------------------------------
 001-004       FEVER     2023-01-20 2023-01-22
----------------------------------------------

There are rows in COMPARE that are not in BASE!
All rows are shown in table below

================================================
  USUBJID      AEDECOD        AESTDT     AEENDT  
------------------------------------------------
 001-005     DIZZINESS    2023-01-25  2023-01-28
------------------------------------------------

Not all Values Compared Equal
All rows are shown in table below

=============================
 Variable  No of Differences 
-----------------------------
  AESTDT          1         
  AEENDT          1         
-----------------------------


All rows are shown in table below

=================================================================
 VARIABLE   USUBJID   AEDECOD    BASE         COMPARE            
-----------------------------------------------------------------
  AESTDT    001-002   FATIGUE   2023-01-12   2023-01-14         
  AEENDT    001-003   NAUSEA    2023-01-18   2023-01-20         
-----------------------------------------------------------------

2. Comparison Options and Criteria

Capability SAS (PROC COMPARE) R
Compare specific variables VAR statement vars parameter
Exclude variables EXCLUDE statement exclude_vars parameter
Case sensitivity IGNORECASE option ignore.case parameter
Numeric tolerance CRITERION=value tol parameter
Suppress output NOSUMMARY option print = FALSE
Detailed report LISTALL option keep_unchanged_rows = TRUE

SAS Example

/* Compare specific variables with options */
proc compare base=adverse_events_orig compare=adverse_events_new
            listall criterion=0.01 ignorecase;
  id USUBJID;
  var AEDECOD AESTDT;
  exclude AEENDT;
run;

/* Return comparison information to DATA step */
proc compare base=adverse_events_orig compare=adverse_events_new 
            out=comparison_results outbase outcomp outdif
            noprint;
  id USUBJID;
run;

/* Check the comparison return code */
%macro check_compare;
  %if &sysinfo = 0 %then %put Datasets are identical;
  %else %if %eval(&sysinfo & 1) %then %put Base dataset has observations not in compare dataset;
  %else %if %eval(&sysinfo & 2) %then %put Compare dataset has observations not in base dataset;
  %else %if %eval(&sysinfo & 4) %then %put Variables have unequal values;
%mend;

%check_compare;

Explanation (SAS):

  • VAR statement specifies which variables to compare
  • EXCLUDE statement specifies variables to omit from comparison
  • CRITERION option specifies tolerance for numeric comparisons
  • IGNORECASE option makes string comparisons case-insensitive
  • LISTALL option shows all matching observations
  • OUT= creates a dataset with comparison results
  • OUTBASE, OUTCOMP, and OUTDIF create datasets with unmatched and differing observations
  • &SYSINFO macro variable contains bit flags indicating comparison results

R Example

# Load required libraries
library(diffdf)
library(lubridate)
library(dplyr)   # for advanced summary later

# ================================
# 1. Create example dataframes
# ================================
adverse_events_orig <- data.frame(
  USUBJID = c("001-001", "001-002", "001-003", "001-004"),
  AEDECOD = c("HEADACHE", "FATIGUE", "NAUSEA", "FEVER"),
  AESTDT  = as.Date(c("2023-01-10", "2023-01-12", "2023-01-15", "2023-01-20")),
  AEENDT  = as.Date(c("2023-01-15", "2023-01-19", "2023-01-18", "2023-01-22"))
)

adverse_events_new <- data.frame(
  USUBJID = c("001-001", "001-002", "001-003", "001-005"),
  AEDECOD = c("HEADACHE", "FATIGUE", "NAUSEA", "DIZZINESS"),
  AESTDT  = as.Date(c("2023-01-10", "2023-01-14", "2023-01-15", "2023-01-25")),
  AEENDT  = as.Date(c("2023-01-15", "2023-01-19", "2023-01-20", "2023-01-28"))
)

# ================================
# 2. Compare only selected columns
# ================================
# Example: Compare USUBJID, AEDECOD, AESTDT only (exclude AEENDT)
myvars <- c("USUBJID", "AEDECOD", "AESTDT")
result_selected <- diffdf(
  adverse_events_orig[myvars],
  adverse_events_new[myvars],
  keys = "USUBJID"
)

# ================================
# 3. Exclude columns (e.g., AEENDT)
# ================================
exclude <- "AEENDT"
vars <- setdiff(names(adverse_events_orig), exclude)
result_exclude <- diffdf(
  adverse_events_orig[vars],
  adverse_events_new[vars],
  keys = "USUBJID"
)

# ================================
# 4. Ignore case for a variable
# ================================
adverse_events_orig2 <- adverse_events_orig
adverse_events_new2 <- adverse_events_new
adverse_events_orig2$AEDECOD <- tolower(adverse_events_orig2$AEDECOD)
adverse_events_new2$AEDECOD <- tolower(adverse_events_new2$AEDECOD)
result_ignore_case <- diffdf(
  adverse_events_orig2,
  adverse_events_new2,
  keys = "USUBJID"
)
# Note: This mutates data, so use copies.

# ================================
# 5. Show unchanged, changed, and unique rows (full status)
# ================================
# Merge for row-by-row inspection
full_joined <- merge(
  adverse_events_orig,
  adverse_events_new,
  by = "USUBJID",
  suffixes = c("_orig", "_new"),
  all = TRUE
)

# For clarity, annotate row status:
full_joined$status <- with(full_joined,
  ifelse(
    is.na(AEDECOD_orig), "Only_in_new",
    ifelse(is.na(AEDECOD_new), "Only_in_orig",
      ifelse(
        AEDECOD_orig == AEDECOD_new &
        AESTDT_orig  == AESTDT_new  &
        AEENDT_orig  == AEENDT_new, "Unchanged", "Changed"
      )
    )
  )
)
# View the detailed summary
print(full_joined)

# ================================
# 6. (Optional) dplyr approach for summary
# ================================
# Fast status summary with dplyr (requires dplyr)
summary_df <- full_join(
  adverse_events_orig, adverse_events_new,
  by = "USUBJID",
  suffix = c("_orig", "_new")
) %>%
  mutate(
    status = case_when(
      is.na(AEDECOD_orig) ~ "Only_in_new",
      is.na(AEDECOD_new) ~ "Only_in_orig",
      AEDECOD_orig == AEDECOD_new &
        AESTDT_orig == AESTDT_new &
        AEENDT_orig == AEENDT_new ~ "Unchanged",
      TRUE ~ "Changed"
    )
  )

print(summary_df)

Explanation (R):

  • diffdf() parameters control comparison details:
    • vars specifies which variables to include
    • exclude_vars specifies variables to omit
    • tol sets tolerance for numeric comparisons
    • ignore.case makes string comparisons case-insensitive
    • keep_unchanged_rows includes matching observations in output
  • arsenal::comparedf() is an alternative providing similar functionality
  • waldo::compare() offers precise, detailed comparisons
  • Several packages offer comparison functions with different strengths

Input Data (Same as Previous Example)

Expected Output:

> print(summary_df)
  USUBJID AEDECOD_orig AESTDT_orig AEENDT_orig AEDECOD_new AESTDT_new AEENDT_new       status
1 001-001     HEADACHE  2023-01-10  2023-01-15    HEADACHE 2023-01-10 2023-01-15    Unchanged
2 001-002      FATIGUE  2023-01-12  2023-01-19     FATIGUE 2023-01-14 2023-01-19      Changed
3 001-003       NAUSEA  2023-01-15  2023-01-18      NAUSEA 2023-01-15 2023-01-20      Changed
4 001-004        FEVER  2023-01-20  2023-01-22        <NA>       <NA>       <NA> Only_in_orig
5 001-005         <NA>        <NA>        <NA>   DIZZINESS 2023-01-25 2023-01-28  Only_in_new

3. Visual and Interactive Comparisons

Capability SAS (PROC COMPARE) R
Visual output Limited text-based daff::render_diff(), HTML/interactive
Highlighted diffs Not built-in Color-coded in HTML output
Export differences Through OUT= datasets Multiple export options
Interactive exploration Not built-in Shiny apps, interactive tables

SAS Example

/* Generate HTML report of differences */
ods html file="comparison_report.html";

proc compare base=adverse_events_orig compare=adverse_events_new;
  id USUBJID;
run;

ods html close;

/* Custom report creation */
data compare_results;
  merge adverse_events_orig(in=a) 
        adverse_events_new(in=b rename=(AESTDT=AESTDT_new AEENDT=AEENDT_new));
  by USUBJID;
  in_orig = a;
  in_new = b;
  
  /* Flag differences */
  if a and b then do;
    if AESTDT ne AESTDT_new then AESTDT_diff = 1;
    if AEENDT ne AEENDT_new then AEENDT_diff = 1;
  end;
run;

proc print data=compare_results;
  var USUBJID in_orig in_new AEDECOD AESTDT AESTDT_new AESTDT_diff 
      AEENDT AEENDT_new AEENDT_diff;
run;

Explanation (SAS):

  • SAS provides primarily text-based comparison output
  • Custom DATA step merging can create more customized comparison datasets
  • Using ODS HTML generates HTML output, but without special formatting for differences
  • Flags can be created to identify differences for further processing
  • Basic reports can be created with PROC PRINT

R Example


library(daff)
library(DT)
library(dplyr)
library(htmltools)

# Using daff for visual comparison
diff_result <- daff::diff_data(adverse_events_orig, adverse_events_new,
                               )

# Render as HTML with color-coding
diff_html <- daff::render_diff(diff_result)
print(htmltools::HTML(diff_html))

# Export to HTML file
cat(diff_html, file = "comparison_report.html")

# Create interactive datatable for exploration
comparison_df <- bind_rows(
  adverse_events_orig %>% mutate(source = "Original"),
  adverse_events_new %>% mutate(source = "New")
) %>%
  arrange(USUBJID)

# Display as interactive table
datatable(comparison_df, 
          filter = 'top', 
          extensions = c('Buttons', 'ColReorder'),
          options = list(
            dom = 'Bfrtip',
            buttons = c('copy', 'csv', 'excel'),
            colReorder = TRUE
          ))

create_comparison_report <- function(df1, df2, key_cols, filename = NULL) {
  # Ensure key columns exist
  if(!all(key_cols %in% names(df1)) || !all(key_cols %in% names(df2))) {
    stop("Key columns must exist in both dataframes")
  }
  
  # For clarity, get a list of columns to compare (non-key intersection)
  value_cols <- intersect(setdiff(names(df1), key_cols), setdiff(names(df2), key_cols))
  
  # Modified dataframes for join
  df1_mod <- df1 %>%
    rename_with(~paste0(.x, "_orig"), -all_of(key_cols))
  df2_mod <- df2 %>%
    rename_with(~paste0(.x, "_new"), -all_of(key_cols))
  
  # Full outer join
  comparison <- full_join(df1_mod, df2_mod, by = key_cols)
  
  # Compare all matched value columns
  for(col in value_cols) {
    orig_col <- paste0(col, "_orig")
    new_col <- paste0(col, "_new")
    diff_col <- paste0(col, "_diff")
    
    # Column might not exist in both dataframes for all rows (due to join)
    comparison[[diff_col]] <- ifelse(
      is.na(comparison[[orig_col]]) & is.na(comparison[[new_col]]), FALSE,
      is.na(comparison[[orig_col]]) | is.na(comparison[[new_col]]) | (comparison[[orig_col]] != comparison[[new_col]])
    )
  }
  
  # Optionally, export
  if(!is.null(filename)) {
    write.csv(comparison, filename, row.names = FALSE)
  }
  
  return(comparison)
}

# Usage:
detailed_comparison <- create_comparison_report(
  adverse_events_orig, 
  adverse_events_new, 
  key_cols = "USUBJID", 
  filename = "detailed_comparison.csv"
)

print(detailed_comparison)
getwd()


datatable(
  detailed_comparison,
  options = list(
    pageLength = 50,
    rowCallback = JS(
      "function(row, data, index) {",
      "  for (var i=0; i<data.length; i++) {",
      "    if (data[i] === true || data[i] === 'TRUE') {",
      "      $('td:eq('+i+')', row).css('background-color', '#ffcccc');",
      "    }",
      "  }",
      "}"
    )
  )
)

Explanation (R):

  • daff package provides visualization of differences with color coding
  • DT::datatable() creates interactive tables for exploring differences
  • Custom functions can create detailed comparison dataframes
  • HTML output with color highlighting makes differences easy to spot
  • Multiple output formats available (HTML, CSV, Excel)
  • Interactive tables allow filtering, sorting, and exporting results

Expected Output: The visual output in R would highlight differences with colors:

  • Added rows in green
  • Deleted rows in red
  • Changed values highlighted with yellow background
  • Interactive tables allow sorting, filtering, and exploration

4. Beyond Basics: Complex Comparison Scenarios

Capability SAS (PROC COMPARE) R
Complex key matching Multiple ID variables Multiple keys variables
Custom comparators Limited built-in Custom functions
Comparison of list objects Not applicable listcompr package
Handling large datasets Dataset options Chunking/parallel processing
Structure vs. content Limited options Separate structural/content comparison

SAS Example

/* Complex comparison with multiple options */
proc compare base=adverse_events_orig compare=adverse_events_new
            criterion=0.01 
            method=absolute /* Use absolute differences */
            briefsummary /* Shorter summary */
            allstats /* All comparison statistics */
            allobs; /* Compare all observations */
  id USUBJID;
  var AESTDT AEENDT;
run;

/* Compare variable attributes */
proc compare base=adverse_events_orig compare=adverse_events_new attlenmac;
  var _all_;
run;

/* Handling large datasets */
%let memsize = 8G;
options compress=yes;

proc compare base=big_dataset(obs=1000) compare=new_big_dataset(obs=1000);
  id patient_id;
run;

Explanation (SAS):

  • CRITERION and METHOD control how numeric differences are calculated
  • BRIEFSUMMARY and ALLSTATS control output detail level
  • ALLOBS forces comparison of all observations
  • ATTLENMAC compares variable attributes
  • Dataset options like OBS= can limit comparison for large datasets
  • System options like MEMSIZE and COMPRESS help with large data

R Example

library(diffdf)
library(fuzzyjoin)
library(tictoc)
library(parallel)

adverse_events_orig <- data.frame(
  USUBJID = c("001-001", "001-002", "001-003", "001-004"),
  AEDECOD = c("HEADACHE", "FATIGUE", "NAUSEA", "FEVER"),
  AESTDT  = as.Date(c("2023-01-10", "2023-01-12", "2023-01-15", "2023-01-20")),
  AEENDT  = as.Date(c("2023-01-15", "2023-01-19", "2023-01-18", "2023-01-22"))
)

adverse_events_new <- data.frame(
  USUBJID = c("001-001", "001-002", "001-003", "001-005"),
  AEDECOD = c("HEADACHE", "FATIGUE", "NAUSEA", "DIZZINESS"),
  AESTDT  = as.Date(c("2023-01-10", "2023-01-14", "2023-01-15", "2023-01-25")),
  AEENDT  = as.Date(c("2023-01-15", "2023-01-19", "2023-01-20", "2023-01-28"))
)

# Custom comparison function with tolerance
compare_with_tolerance <- function(df1, df2, keys, numeric_vars, tol = 0.01) {
  # Perform basic comparison
  basic_diff <- diffdf(df1, df2, keys = keys)
  
  # Create custom tolerance comparison for numeric variables
  custom_results <- list()
  
  # For each key combination
  unique_keys <- unique(rbind(
    df1[, keys, drop = FALSE],
    df2[, keys, drop = FALSE]
  ))
  
  for(i in 1:nrow(unique_keys)) {
    key_vals <- unique_keys[i, , drop = FALSE]
    
    # Find matching rows
    df1_match <- merge(key_vals, df1)
    df2_match <- merge(key_vals, df2)
    
    if(nrow(df1_match) > 0 && nrow(df2_match) > 0) {
      # Compare numeric variables with tolerance
      for(var in numeric_vars) {
        if(var %in% names(df1_match) && var %in% names(df2_match)) {
          val1 <- df1_match[[var]]
          val2 <- df2_match[[var]]
          
          if(!is.na(val1) && !is.na(val2)) {
            diff <- abs(val1 - val2)
            if(diff <= tol) {
              # These values are considered equal with tolerance
              # Remove from basic differences if present
              # This is simplified; actual implementation would be more complex
            }
          }
        }
      }
    }
  }
  
  # Return enhanced results
  return(basic_diff)
}

fuzzy_compare <- function(df1, df2, keys, text_vars, max_dist = 2) {
  stopifnot(is.data.frame(df1), is.data.frame(df2))
  stopifnot(all(keys %in% names(df1)), all(keys %in% names(df2)))
  stopifnot(all(text_vars %in% names(df1)), all(text_vars %in% names(df2)))
  
  # Find common keys
  matched_keys <- dplyr::inner_join(
    df1[, keys, drop = FALSE],
    df2[, keys, drop = FALSE],
    by = keys
  )
  
  results <- list()
  
  for(i in seq_len(nrow(matched_keys))) {
    key_vals <- matched_keys[i, , drop = FALSE]
    row1 <- dplyr::inner_join(key_vals, df1, by = keys)
    row2 <- dplyr::inner_join(key_vals, df2, by = keys)
    
    for(var in text_vars) {
      val1 <- as.character(row1[[var]])
      val2 <- as.character(row2[[var]])
      
      # Only compare if not both NA and not exactly identical
      if (!(is.na(val1) && is.na(val2)) && (val1 != val2)) {
        # Fuzzy string distance on lower/trimmed (to allow for case or typo)
        text1 <- tolower(trimws(val1))
        text2 <- tolower(trimws(val2))
        dist_val <- as.integer(adist(text1, text2)[1])
        if (dist_val <= max_dist) {
          results[[length(results) + 1]] <- data.frame(
            as.list(key_vals),
            variable = var,
            value1 = val1,
            value2 = val2,
            distance = dist_val,
            stringsAsFactors = FALSE
          )
        }
      }
    }
  }
  if(length(results) > 0) {
    final_df <- bind_rows(results)
    rownames(final_df) <- NULL
    return(final_df)
  } else {
    return(data.frame())
  }
}

# Example usage
adverse_events_orig_extended <- adverse_events_orig
adverse_events_new_extended <- adverse_events_new

# Add some columns to demonstrate complex comparison
adverse_events_orig_extended$AESEV <- c("MILD", "MODERATE", "MILD", "SEVERE")
adverse_events_new_extended$AESEV <- c("MILD", "MODERATE", "mild", "SEVERE")

adverse_events_orig_extended$AEVALUE <- c(0.5, 1.2, 0.8, 2.1)
adverse_events_new_extended$AEVALUE <- c(0.51, 1.2, 0.79, 1.98)

# Custom comparison with tolerance
custom_comparison <- compare_with_tolerance(
  adverse_events_orig_extended, 
  adverse_events_new_extended,
  keys = "USUBJID",
  numeric_vars = c("AEVALUE"),
  tol = 0.05
)

print(custom_comparison)

# Fuzzy text comparison
text_diffs <- fuzzy_compare(
  adverse_events_orig_extended,
  adverse_events_new_extended,
  keys = "USUBJID",
  text_vars = c("AESEV"),
  max_dist = 2
)

# Display fuzzy matches
print(text_diffs)

Explanation (R):

  • Custom functions can implement specific comparison logic
  • fuzzyjoin allows approximate matching of text values
  • parallel package enables multi-core processing for large datasets
  • Chunking approach allows comparing large datasets without memory issues
  • Custom tolerances can be applied to specific numeric variables
  • Fuzzy text matching identifies minor spelling differences

Input Data (Extended for Advanced Examples): Original dataset with additional columns:

USUBJID AEDECOD AESTDT AEENDT AESEV AEVALUE
001-001 HEADACHE 2023-01-10 2023-01-15 MILD 0.5
001-002 FATIGUE 2023-01-12 2023-01-19 MODERATE 1.2
001-003 NAUSEA 2023-01-15 2023-01-18 MILD 0.8
001-004 FEVER 2023-01-20 2023-01-22 SEVERE 2.1

New dataset with additional columns:

USUBJID AEDECOD AESTDT AEENDT AESEV AEVALUE
001-001 HEADACHE 2023-01-10 2023-01-15 MILD 0.51
001-002 FATIGUE 2023-01-14 2023-01-19 MODERATE 1.2
001-003 NAUSEA 2023-01-15 2023-01-20 mild 0.79
001-005 DIZZINESS 2023-01-25 2023-01-28 SEVERE 1.98

Expected Output of Fuzzy Text Comparison:

print(custom_comparison)
Differences found between the objects!

Summary of BASE and COMPARE
  ====================================
    PROPERTY      BASE        COMP    
  ------------------------------------
      Name        df1         df2     
     Class     data.frame  data.frame 
    Rows(#)        4           4      
   Columns(#)      6           6      
  ------------------------------------


There are rows in BASE that are not in COMPARE !!
  =========
   USUBJID 
  ---------
   001-004 
  ---------


There are rows in COMPARE that are not in BASE !!
  =========
   USUBJID 
  ---------
   001-005 
  ---------


Not all Values Compared Equal
  =============================
   Variable  No of Differences 
  -----------------------------
  AESTDT          1         
  AEENDT          1         
  AESEV            1         
 AEVALUE           2         
  -----------------------------


  ===========================================
   VARIABLE  USUBJID     BASE      COMPARE   
  -------------------------------------------
    AESTDT   001-002  2023-01-12  2023-01-14 
  -------------------------------------------


  ===========================================
   VARIABLE  USUBJID     BASE      COMPARE   
  -------------------------------------------
    AEENDT   001-003  2023-01-18  2023-01-20 
  -------------------------------------------


  ==================================
   VARIABLE  USUBJID  BASE  COMPARE 
  ----------------------------------
    AESEV    001-003  MILD   mild   
  ----------------------------------


  ==================================
   VARIABLE  USUBJID  BASE  COMPARE 
  ----------------------------------
   AEVALUE   001-001  0.5    0.51   
   AEVALUE   001-003  0.8    0.79   
  ----------------------------------

> print(text_diffs)
  USUBJID variable value1 value2 distance
1 001-003    AESEV   MILD   mild        0

5. Comparing Dataset Structures

Capability SAS (PROC COMPARE) R
Variable types comparison ATTLEN option identical(sapply(df1, class), sapply(df2, class))
Variables existence Default output setdiff(names(df1), names(df2))
Metadata comparison Limited options compare_df_meta()
Labels and formats ATTRCHAR option Custom attributes comparison
Structure-only comparison NOVALUES option Structure-focused functions

SAS Example

/* Compare structure without values */
proc compare base=adverse_events_orig compare=adverse_events_new
            novalues /* Don't compare data values */
            outnoequal /* Output variables that aren't equal */
            attlen /* Compare lengths */
            attrchar; /* Compare character attributes */
  var _all_; 
run;

/* Create dataset with metadata */
proc contents data=adverse_events_orig out=ae_orig_meta; run;
proc contents data=adverse_events_new out=ae_new_meta; run;

/* Compare metadata */
proc compare base=ae_orig_meta compare=ae_new_meta;
  id name;
  var type length format label;
run;

Explanation (SAS):

  • NOVALUES option skips comparing actual values
  • ATTLEN and ATTRCHAR compare variable attributes
  • PROC CONTENTS with OUT= creates metadata datasets
  • These can be compared to find structural differences
  • Useful for validating dataset consistency

R Example


library(diffdf)
library(dplyr)
library(tibble)

# Function to compare dataframe structures
compare_structure <- function(df1, df2, name1 = "DF1", name2 = "DF2") {
  result <- list()
  
  # Compare column names
  cols1 <- names(df1)
  cols2 <- names(df2)
  
  result$columns_only_in_1 <- setdiff(cols1, cols2)
  result$columns_only_in_2 <- setdiff(cols2, cols1)
  result$common_columns <- intersect(cols1, cols2)
  
  # Compare column types for common columns
  if(length(result$common_columns) > 0) {
    types1 <- sapply(df1[result$common_columns], class)
    types2 <- sapply(df2[result$common_columns], class)
    
    # Find type differences
    type_diff <- which(types1 != types2)
    if(length(type_diff) > 0) {
      result$type_differences <- data.frame(
        column = names(type_diff),
        type1 = types1[type_diff],
        type2 = types2[type_diff],
        stringsAsFactors = FALSE
      )
    } else {
      result$type_differences <- data.frame()
    }
  }
  
  # Compare dimensions
  result$dim1 <- dim(df1)
  result$dim2 <- dim(df2)
  
  # Compare attributes (focusing on column-level attributes)
  result$attribute_differences <- list()
  
  for(col in result$common_columns) {
    attr1 <- attributes(df1[[col]])
    attr2 <- attributes(df2[[col]])
    
    # Remove class as it's already compared
    attr1$class <- NULL
    attr2$class <- NULL
    
    # Compare remaining attributes
    if(!identical(attr1, attr2)) {
      result$attribute_differences[[col]] <- list(
        name1 = attr1,
        name2 = attr2
      )
    }
  }
  
  # Create summary report
  cat("Structure Comparison:", name1, "vs", name2, "\n")
  cat("---------------------------------------\n")
  cat("Dimensions:", paste(result$dim1, collapse = "x"), "vs", 
      paste(result$dim2, collapse = "x"), "\n")
  
  if(length(result$columns_only_in_1) > 0) {
    cat("\nColumns only in", name1, ":\n")
    cat(paste(" -", result$columns_only_in_1), sep = "\n")
  }
  
  if(length(result$columns_only_in_2) > 0) {
    cat("\nColumns only in", name2, ":\n")
    cat(paste(" -", result$columns_only_in_2), sep = "\n")
  }
  
  if(nrow(result$type_differences) > 0) {
    cat("\nColumn type differences:\n")
    print(result$type_differences)
  }
  
  if(length(result$attribute_differences) > 0) {
    cat("\nColumn attribute differences found in", 
        length(result$attribute_differences), "columns\n")
  }
  
  # Return the full result object for further processing
  invisible(result)
}

# Create dataframes with structure differences
ae_orig_modified <- adverse_events_orig
ae_new_modified <- adverse_events_new

# Add/modify columns to demonstrate structure differences
ae_orig_modified$AESEV <- factor(c("MILD", "MODERATE", "MILD", "SEVERE"))
ae_new_modified$AESEV <- c("MILD", "MODERATE", "MILD", "SEVERE")
ae_new_modified$AECAT <- c("GI", "SYSTEMIC", "GI", "SYSTEMIC")

# Convert a date to character to create type mismatch
ae_new_modified$AESTDT <- as.character(ae_new_modified$AESTDT)

# Compare structures
structure_diff <- compare_structure(
  ae_orig_modified, 
  ae_new_modified,
  "Original", 
  "New"
)

# Alternative approach for metadata comparison
meta_compare <- function(df1, df2) {
  meta1 <- data.frame(
    variable = names(df1),
    type = sapply(df1, function(x) class(x)[1]),
    length = sapply(df1, function(x) if(is.character(x)) max(nchar(x), na.rm=TRUE) else NA),
    label = sapply(df1, function(x) { lbl <- attr(x, "label"); if(is.null(lbl)) NA_character_ else as.character(lbl) }),
    stringsAsFactors = FALSE
  )
  
  meta2 <- data.frame(
    variable = names(df2),
    type = sapply(df2, function(x) class(x)[1]),
    length = sapply(df2, function(x) if(is.character(x)) max(nchar(x), na.rm=TRUE) else NA),
    label = sapply(df2, function(x) { lbl <- attr(x, "label"); if(is.null(lbl)) NA_character_ else as.character(lbl) }),
    stringsAsFactors = FALSE
  )
  
  # Compare metadata using diffdf
  diffdf(meta1, meta2, keys = "variable")
}


# Run metadata comparison
meta_compare(ae_orig_modified, ae_new_modified)

Explanation (R):

  • compare_structure() function focuses solely on structure, not data values
  • Compares column names, types, dimensions, and column-level attributes
  • diffdf::compare_df_meta() compares metadata between dataframes
  • Custom functions can extract and compare specific structural elements
  • Useful for validating consistent data structures across datasets

Expected Output of Structure Comparison:

Structure Comparison: Original vs New 
---------------------------------------
Dimensions: 4x5 vs 4x6 

Columns only in New :
 - AECAT

Column type differences:
       column  type1     type2
AESTDT AESTDT   Date character
AESEV   AESEV factor character

Column attribute differences found in 2 columns

> meta_compare(ae_orig_modified, ae_new_modified)
Differences found between the objects!

Summary of BASE and COMPARE
  ====================================
    PROPERTY      BASE        COMP    
  ------------------------------------
      Name       meta1       meta2    
     Class     data.frame  data.frame 
    Rows(#)        5           6      
   Columns(#)      4           4      
  ------------------------------------


There are rows in COMPARE that are not in BASE !!
  ==========
   variable 
  ----------
    AECAT   
  ----------


Not all Values Compared Equal
  =============================
   Variable  No of Differences 
  -----------------------------
     type            2         
    length           3         
  -----------------------------


  =======================================
   VARIABLE  variable   BASE    COMPARE  
  ---------------------------------------
     type     AESEV    factor  character 
     type     AESTDT    Date   character 
  ---------------------------------------


  ===================================
   VARIABLE  variable  BASE  COMPARE 
  -----------------------------------
    length   AEDECOD     8      9    
    length    AESEV     NA      8    
    length    AESTDT    NA     10    
  -----------------------------------

6. Best Practices for Dataset Comparison

  • Establish Clear Comparison Goals

    • Define what constitutes a "match" vs. "difference"
    • Determine which differences matter for your analysis
    • Consider tolerance thresholds for numeric data
  • Handle Keys Properly

    • Use appropriate key variables for observation matching
    • Consider composite keys when a single field isn't unique
    • Check for duplicate keys before comparison
  • Validate Structures First

    • Compare dataset structures before comparing values
    • Handle missing variables appropriately
    • Check variable types for compatibility
  • Document Comparison Parameters

    • Record comparison options used
    • Document tolerance levels and exclusions
    • Include comparison settings in analysis metadata
  • Optimize for Large Data

    • Consider sampling for initial comparisons
    • Use chunking for large datasets
    • Apply parallel processing for complex comparisons
  • Create Meaningful Reports

    • Summarize differences in context
    • Highlight critical differences
    • Include counts and percentages for context
  • Follow Up on Differences

    • Investigate unexpected differences
    • Create standardized workflows for resolving differences
    • Document justified differences and their resolutions

**Resource download links**

1.5.8.-PROC-COMPARE-in-SAS-vs-R-Equivalents.zip