1.5.8. PROC COMPARE in SAS vs R Equivalents

1.5. SAS PROCEDURES -> Equivalent in R

1.5.8. PROC COMPARE in SAS vs R Equivalents

This guide compares SAS and R approaches for comparing datasets/dataframes, highlighting key capabilities, syntactic differences, and advanced techniques for both languages.

1. Basic Dataset Comparison

Capability	SAS (PROC COMPARE)	R
Compare two datasets	`PROC COMPARE base=ds1 compare=ds2;`	`diffdf::diffdf(df1, df2)`
Basic output	Default summary report	Console text output
Visual comparison	Basic text output	`diffdf`, `daff::render_diff()`
Key-based comparison	`ID` statement	`keys` parameter
Return value	Log output	List object with differences

SAS Example

/* Create two datasets for comparison */
data adverse_events_orig;
  input USUBJID $ AEDECOD $ AESTDT :date9. AEENDT :date9.;
  format AESTDT AEENDT date9.;
  datalines;
001-001 HEADACHE 10JAN2023 15JAN2023
001-002 FATIGUE 12JAN2023 19JAN2023
001-003 NAUSEA 15JAN2023 18JAN2023
001-004 FEVER 20JAN2023 22JAN2023
;
run;

data adverse_events_new;
  input USUBJID $ AEDECOD $ AESTDT :date9. AEENDT :date9.;
  format AESTDT AEENDT date9.;
  datalines;
001-001 HEADACHE 10JAN2023 15JAN2023
001-002 FATIGUE 14JAN2023 19JAN2023
001-003 NAUSEA 15JAN2023 20JAN2023
001-005 DIZZINESS 25JAN2023 28JAN2023
;
run;

/* Basic comparison */
proc compare base=adverse_events_orig compare=adverse_events_new;
run;

/* Comparison with specific ID variable */
proc compare base=adverse_events_orig compare=adverse_events_new;
  id USUBJID;
run;

Explanation (SAS):

PROC COMPARE compares two datasets (base and compare)
The default output shows differences in variables, observations, and values
ID statement specifies which variable(s) to use for matching observations
The report shows added/deleted observations, matching observations with differences
SAS generates a detailed text report with several sections

R Example

library(diffdf)
library(lubridate)

# Create two dataframes for comparison
adverse_events_orig <- data.frame(
  USUBJID = c("001-001", "001-002", "001-003", "001-004"),
  AEDECOD = c("HEADACHE", "FATIGUE", "NAUSEA", "FEVER"),
  AESTDT = as.Date(c("2023-01-10", "2023-01-12", "2023-01-15", "2023-01-20")),
  AEENDT = as.Date(c("2023-01-15", "2023-01-19", "2023-01-18", "2023-01-22"))
)

adverse_events_new <- data.frame(
  USUBJID = c("001-001", "001-002", "001-003", "001-005"),
  AEDECOD = c("HEADACHE", "FATIGUE", "NAUSEA", "DIZZINESS"),
  AESTDT = as.Date(c("2023-01-10", "2023-01-14", "2023-01-15", "2023-01-25")),
  AEENDT = as.Date(c("2023-01-15", "2023-01-19", "2023-01-20", "2023-01-28"))
)

# Basic comparison
diffdf(adverse_events_orig, adverse_events_new)

# Comparison with specific key variables
diffdf(adverse_events_orig, adverse_events_new, keys = "USUBJID")

# Store results for further processing
diff_results <- diffdf(adverse_events_orig, adverse_events_new, 
                      keys = "USUBJID")

Explanation (R):

diffdf package provides functionality similar to PROC COMPARE
diffdf() function compares two dataframes
The keys parameter identifies variables for matching observations
Results are printed to console by default but can be stored as an object
The output shows differences in variables, rows, and values

Input Data (Base/Original Dataset):

USUBJID	AEDECOD	AESTDT	AEENDT
001-001	HEADACHE	2023-01-10	2023-01-15
001-002	FATIGUE	2023-01-12	2023-01-19
001-003	NAUSEA	2023-01-15	2023-01-18
001-004	FEVER	2023-01-20	2023-01-22

Input Data (Compare/New Dataset):

USUBJID	AEDECOD	AESTDT	AEENDT
001-001	HEADACHE	2023-01-10	2023-01-15
001-002	FATIGUE	2023-01-14	2023-01-19
001-003	NAUSEA	2023-01-15	2023-01-20
001-005	DIZZINESS	2023-01-25	2023-01-28

Expected Output:

Differences found between the objects!

A summary is given below.

There are rows in BASE that are not in COMPARE!
All rows are shown in table below

==============================================
  USUBJID      AEDECOD      AESTDT    AEENDT  
----------------------------------------------
 001-004       FEVER     2023-01-20 2023-01-22
----------------------------------------------

There are rows in COMPARE that are not in BASE!
All rows are shown in table below

================================================
  USUBJID      AEDECOD        AESTDT     AEENDT  
------------------------------------------------
 001-005     DIZZINESS    2023-01-25  2023-01-28
------------------------------------------------

Not all Values Compared Equal
All rows are shown in table below

=============================
 Variable  No of Differences 
-----------------------------
  AESTDT          1         
  AEENDT          1         
-----------------------------


All rows are shown in table below

=================================================================
 VARIABLE   USUBJID   AEDECOD    BASE         COMPARE            
-----------------------------------------------------------------
  AESTDT    001-002   FATIGUE   2023-01-12   2023-01-14         
  AEENDT    001-003   NAUSEA    2023-01-18   2023-01-20         
-----------------------------------------------------------------

2. Comparison Options and Criteria

Capability	SAS (PROC COMPARE)	R
Compare specific variables	`VAR` statement	`vars` parameter
Exclude variables	`EXCLUDE` statement	`exclude_vars` parameter
Case sensitivity	`IGNORECASE` option	`ignore.case` parameter
Numeric tolerance	`CRITERION=value`	`tol` parameter
Suppress output	`NOSUMMARY` option	`print = FALSE`
Detailed report	`LISTALL` option	`keep_unchanged_rows = TRUE`

SAS Example

/* Compare specific variables with options */
proc compare base=adverse_events_orig compare=adverse_events_new
            listall criterion=0.01 ignorecase;
  id USUBJID;
  var AEDECOD AESTDT;
  exclude AEENDT;
run;

/* Return comparison information to DATA step */
proc compare base=adverse_events_orig compare=adverse_events_new 
            out=comparison_results outbase outcomp outdif
            noprint;
  id USUBJID;
run;

/* Check the comparison return code */
%macro check_compare;
  %if &sysinfo = 0 %then %put Datasets are identical;
  %else %if %eval(&sysinfo & 1) %then %put Base dataset has observations not in compare dataset;
  %else %if %eval(&sysinfo & 2) %then %put Compare dataset has observations not in base dataset;
  %else %if %eval(&sysinfo & 4) %then %put Variables have unequal values;
%mend;

%check_compare;

Explanation (SAS):

VAR statement specifies which variables to compare
EXCLUDE statement specifies variables to omit from comparison
CRITERION option specifies tolerance for numeric comparisons
IGNORECASE option makes string comparisons case-insensitive
LISTALL option shows all matching observations
OUT= creates a dataset with comparison results
OUTBASE, OUTCOMP, and OUTDIF create datasets with unmatched and differing observations
&SYSINFO macro variable contains bit flags indicating comparison results

R Example

# Load required libraries
library(diffdf)
library(lubridate)
library(dplyr)   # for advanced summary later

# ================================
# 1. Create example dataframes
# ================================
adverse_events_orig <- data.frame(
  USUBJID = c("001-001", "001-002", "001-003", "001-004"),
  AEDECOD = c("HEADACHE", "FATIGUE", "NAUSEA", "FEVER"),
  AESTDT  = as.Date(c("2023-01-10", "2023-01-12", "2023-01-15", "2023-01-20")),
  AEENDT  = as.Date(c("2023-01-15", "2023-01-19", "2023-01-18", "2023-01-22"))
)

adverse_events_new <- data.frame(
  USUBJID = c("001-001", "001-002", "001-003", "001-005"),
  AEDECOD = c("HEADACHE", "FATIGUE", "NAUSEA", "DIZZINESS"),
  AESTDT  = as.Date(c("2023-01-10", "2023-01-14", "2023-01-15", "2023-01-25")),
  AEENDT  = as.Date(c("2023-01-15", "2023-01-19", "2023-01-20", "2023-01-28"))
)

# ================================
# 2. Compare only selected columns
# ================================
# Example: Compare USUBJID, AEDECOD, AESTDT only (exclude AEENDT)
myvars <- c("USUBJID", "AEDECOD", "AESTDT")
result_selected <- diffdf(
  adverse_events_orig[myvars],
  adverse_events_new[myvars],
  keys = "USUBJID"
)

# ================================
# 3. Exclude columns (e.g., AEENDT)
# ================================
exclude <- "AEENDT"
vars <- setdiff(names(adverse_events_orig), exclude)
result_exclude <- diffdf(
  adverse_events_orig[vars],
  adverse_events_new[vars],
  keys = "USUBJID"
)

# ================================
# 4. Ignore case for a variable
# ================================
adverse_events_orig2 <- adverse_events_orig
adverse_events_new2 <- adverse_events_new
adverse_events_orig2$AEDECOD <- tolower(adverse_events_orig2$AEDECOD)
adverse_events_new2$AEDECOD <- tolower(adverse_events_new2$AEDECOD)
result_ignore_case <- diffdf(
  adverse_events_orig2,
  adverse_events_new2,
  keys = "USUBJID"
)
# Note: This mutates data, so use copies.

# ================================
# 5. Show unchanged, changed, and unique rows (full status)
# ================================
# Merge for row-by-row inspection
full_joined <- merge(
  adverse_events_orig,
  adverse_events_new,
  by = "USUBJID",
  suffixes = c("_orig", "_new"),
  all = TRUE
)

# For clarity, annotate row status:
full_joined$status <- with(full_joined,
  ifelse(
    is.na(AEDECOD_orig), "Only_in_new",
    ifelse(is.na(AEDECOD_new), "Only_in_orig",
      ifelse(
        AEDECOD_orig == AEDECOD_new &
        AESTDT_orig  == AESTDT_new  &
        AEENDT_orig  == AEENDT_new, "Unchanged", "Changed"
      )
    )
  )
)
# View the detailed summary
print(full_joined)

# ================================
# 6. (Optional) dplyr approach for summary
# ================================
# Fast status summary with dplyr (requires dplyr)
summary_df <- full_join(
  adverse_events_orig, adverse_events_new,
  by = "USUBJID",
  suffix = c("_orig", "_new")
) %>%
  mutate(
    status = case_when(
      is.na(AEDECOD_orig) ~ "Only_in_new",
      is.na(AEDECOD_new) ~ "Only_in_orig",
      AEDECOD_orig == AEDECOD_new &
        AESTDT_orig == AESTDT_new &
        AEENDT_orig == AEENDT_new ~ "Unchanged",
      TRUE ~ "Changed"
    )
  )

print(summary_df)

Explanation (R):

diffdf() parameters control comparison details:
- vars specifies which variables to include
- exclude_vars specifies variables to omit
- tol sets tolerance for numeric comparisons
- ignore.case makes string comparisons case-insensitive
- keep_unchanged_rows includes matching observations in output
arsenal::comparedf() is an alternative providing similar functionality
waldo::compare() offers precise, detailed comparisons
Several packages offer comparison functions with different strengths

Input Data (Same as Previous Example)

Expected Output:

> print(summary_df)
  USUBJID AEDECOD_orig AESTDT_orig AEENDT_orig AEDECOD_new AESTDT_new AEENDT_new       status
1 001-001     HEADACHE  2023-01-10  2023-01-15    HEADACHE 2023-01-10 2023-01-15    Unchanged
2 001-002      FATIGUE  2023-01-12  2023-01-19     FATIGUE 2023-01-14 2023-01-19      Changed
3 001-003       NAUSEA  2023-01-15  2023-01-18      NAUSEA 2023-01-15 2023-01-20      Changed
4 001-004        FEVER  2023-01-20  2023-01-22        <NA>       <NA>       <NA> Only_in_orig
5 001-005         <NA>        <NA>        <NA>   DIZZINESS 2023-01-25 2023-01-28  Only_in_new

3. Visual and Interactive Comparisons

Capability	SAS (PROC COMPARE)	R
Visual output	Limited text-based	`daff::render_diff()`, HTML/interactive
Highlighted diffs	Not built-in	Color-coded in HTML output
Export differences	Through `OUT=` datasets	Multiple export options
Interactive exploration	Not built-in	Shiny apps, interactive tables

SAS Example

/* Generate HTML report of differences */
ods html file="comparison_report.html";

proc compare base=adverse_events_orig compare=adverse_events_new;
  id USUBJID;
run;

ods html close;

/* Custom report creation */
data compare_results;
  merge adverse_events_orig(in=a) 
        adverse_events_new(in=b rename=(AESTDT=AESTDT_new AEENDT=AEENDT_new));
  by USUBJID;
  in_orig = a;
  in_new = b;
  
  /* Flag differences */
  if a and b then do;
    if AESTDT ne AESTDT_new then AESTDT_diff = 1;
    if AEENDT ne AEENDT_new then AEENDT_diff = 1;
  end;
run;

proc print data=compare_results;
  var USUBJID in_orig in_new AEDECOD AESTDT AESTDT_new AESTDT_diff 
      AEENDT AEENDT_new AEENDT_diff;
run;

Explanation (SAS):

SAS provides primarily text-based comparison output
Custom DATA step merging can create more customized comparison datasets
Using ODS HTML generates HTML output, but without special formatting for differences
Flags can be created to identify differences for further processing
Basic reports can be created with PROC PRINT

R Example


library(daff)
library(DT)
library(dplyr)
library(htmltools)

# Using daff for visual comparison
diff_result <- daff::diff_data(adverse_events_orig, adverse_events_new,
                               )

# Render as HTML with color-coding
diff_html <- daff::render_diff(diff_result)
print(htmltools::HTML(diff_html))

# Export to HTML file
cat(diff_html, file = "comparison_report.html")

# Create interactive datatable for exploration
comparison_df <- bind_rows(
  adverse_events_orig %>% mutate(source = "Original"),
  adverse_events_new %>% mutate(source = "New")
) %>%
  arrange(USUBJID)

# Display as interactive table
datatable(comparison_df, 
          filter = 'top', 
          extensions = c('Buttons', 'ColReorder'),
          options = list(
            dom = 'Bfrtip',
            buttons = c('copy', 'csv', 'excel'),
            colReorder = TRUE
          ))

create_comparison_report <- function(df1, df2, key_cols, filename = NULL) {
  # Ensure key columns exist
  if(!all(key_cols %in% names(df1)) || !all(key_cols %in% names(df2))) {
    stop("Key columns must exist in both dataframes")
  }
  
  # For clarity, get a list of columns to compare (non-key intersection)
  value_cols <- intersect(setdiff(names(df1), key_cols), setdiff(names(df2), key_cols))
  
  # Modified dataframes for join
  df1_mod <- df1 %>%
    rename_with(~paste0(.x, "_orig"), -all_of(key_cols))
  df2_mod <- df2 %>%
    rename_with(~paste0(.x, "_new"), -all_of(key_cols))
  
  # Full outer join
  comparison <- full_join(df1_mod, df2_mod, by = key_cols)
  
  # Compare all matched value columns
  for(col in value_cols) {
    orig_col <- paste0(col, "_orig")
    new_col <- paste0(col, "_new")
    diff_col <- paste0(col, "_diff")
    
    # Column might not exist in both dataframes for all rows (due to join)
    comparison[[diff_col]] <- ifelse(
      is.na(comparison[[orig_col]]) & is.na(comparison[[new_col]]), FALSE,
      is.na(comparison[[orig_col]]) | is.na(comparison[[new_col]]) | (comparison[[orig_col]] != comparison[[new_col]])
    )
  }
  
  # Optionally, export
  if(!is.null(filename)) {
    write.csv(comparison, filename, row.names = FALSE)
  }
  
  return(comparison)
}

# Usage:
detailed_comparison <- create_comparison_report(
  adverse_events_orig, 
  adverse_events_new, 
  key_cols = "USUBJID", 
  filename = "detailed_comparison.csv"
)

print(detailed_comparison)
getwd()


datatable(
  detailed_comparison,
  options = list(
    pageLength = 50,
    rowCallback = JS(
      "function(row, data, index) {",
      "  for (var i=0; i<data.length; i++) {",
      "    if (data[i] === true || data[i] === 'TRUE') {",
      "      $('td:eq('+i+')', row).css('background-color', '#ffcccc');",
      "    }",
      "  }",
      "}"
    )
  )
)

Explanation (R):

daff package provides visualization of differences with color coding
DT::datatable() creates interactive tables for exploring differences
Custom functions can create detailed comparison dataframes
HTML output with color highlighting makes differences easy to spot
Multiple output formats available (HTML, CSV, Excel)
Interactive tables allow filtering, sorting, and exporting results

Expected Output: The visual output in R would highlight differences with colors:

Added rows in green
Deleted rows in red
Changed values highlighted with yellow background
Interactive tables allow sorting, filtering, and exploration

4. Beyond Basics: Complex Comparison Scenarios

Capability	SAS (PROC COMPARE)	R
Complex key matching	Multiple `ID` variables	Multiple `keys` variables
Custom comparators	Limited built-in	Custom functions
Comparison of list objects	Not applicable	`listcompr` package
Handling large datasets	Dataset options	Chunking/parallel processing
Structure vs. content	Limited options	Separate structural/content comparison

SAS Example

/* Complex comparison with multiple options */
proc compare base=adverse_events_orig compare=adverse_events_new
            criterion=0.01 
            method=absolute /* Use absolute differences */
            briefsummary /* Shorter summary */
            allstats /* All comparison statistics */
            allobs; /* Compare all observations */
  id USUBJID;
  var AESTDT AEENDT;
run;

/* Compare variable attributes */
proc compare base=adverse_events_orig compare=adverse_events_new attlenmac;
  var _all_;
run;

/* Handling large datasets */
%let memsize = 8G;
options compress=yes;

proc compare base=big_dataset(obs=1000) compare=new_big_dataset(obs=1000);
  id patient_id;
run;

Explanation (SAS):

CRITERION and METHOD control how numeric differences are calculated
BRIEFSUMMARY and ALLSTATS control output detail level
ALLOBS forces comparison of all observations
ATTLENMAC compares variable attributes
Dataset options like OBS= can limit comparison for large datasets
System options like MEMSIZE and COMPRESS help with large data

R Example

library(diffdf)
library(fuzzyjoin)
library(tictoc)
library(parallel)

adverse_events_orig <- data.frame(
  USUBJID = c("001-001", "001-002", "001-003", "001-004"),
  AEDECOD = c("HEADACHE", "FATIGUE", "NAUSEA", "FEVER"),
  AESTDT  = as.Date(c("2023-01-10", "2023-01-12", "2023-01-15", "2023-01-20")),
  AEENDT  = as.Date(c("2023-01-15", "2023-01-19", "2023-01-18", "2023-01-22"))
)

adverse_events_new <- data.frame(
  USUBJID = c("001-001", "001-002", "001-003", "001-005"),
  AEDECOD = c("HEADACHE", "FATIGUE", "NAUSEA", "DIZZINESS"),
  AESTDT  = as.Date(c("2023-01-10", "2023-01-14", "2023-01-15", "2023-01-25")),
  AEENDT  = as.Date(c("2023-01-15", "2023-01-19", "2023-01-20", "2023-01-28"))
)

# Custom comparison function with tolerance
compare_with_tolerance <- function(df1, df2, keys, numeric_vars, tol = 0.01) {
  # Perform basic comparison
  basic_diff <- diffdf(df1, df2, keys = keys)
  
  # Create custom tolerance comparison for numeric variables
  custom_results <- list()
  
  # For each key combination
  unique_keys <- unique(rbind(
    df1[, keys, drop = FALSE],
    df2[, keys, drop = FALSE]
  ))
  
  for(i in 1:nrow(unique_keys)) {
    key_vals <- unique_keys[i, , drop = FALSE]
    
    # Find matching rows
    df1_match <- merge(key_vals, df1)
    df2_match <- merge(key_vals, df2)
    
    if(nrow(df1_match) > 0 && nrow(df2_match) > 0) {
      # Compare numeric variables with tolerance
      for(var in numeric_vars) {
        if(var %in% names(df1_match) && var %in% names(df2_match)) {
          val1 <- df1_match[[var]]
          val2 <- df2_match[[var]]
          
          if(!is.na(val1) && !is.na(val2)) {
            diff <- abs(val1 - val2)
            if(diff <= tol) {
              # These values are considered equal with tolerance
              # Remove from basic differences if present
              # This is simplified; actual implementation would be more complex
            }
          }
        }
      }
    }
  }
  
  # Return enhanced results
  return(basic_diff)
}

fuzzy_compare <- function(df1, df2, keys, text_vars, max_dist = 2) {
  stopifnot(is.data.frame(df1), is.data.frame(df2))
  stopifnot(all(keys %in% names(df1)), all(keys %in% names(df2)))
  stopifnot(all(text_vars %in% names(df1)), all(text_vars %in% names(df2)))
  
  # Find common keys
  matched_keys <- dplyr::inner_join(
    df1[, keys, drop = FALSE],
    df2[, keys, drop = FALSE],
    by = keys
  )
  
  results <- list()
  
  for(i in seq_len(nrow(matched_keys))) {
    key_vals <- matched_keys[i, , drop = FALSE]
    row1 <- dplyr::inner_join(key_vals, df1, by = keys)
    row2 <- dplyr::inner_join(key_vals, df2, by = keys)
    
    for(var in text_vars) {
      val1 <- as.character(row1[[var]])
      val2 <- as.character(row2[[var]])
      
      # Only compare if not both NA and not exactly identical
      if (!(is.na(val1) && is.na(val2)) && (val1 != val2)) {
        # Fuzzy string distance on lower/trimmed (to allow for case or typo)
        text1 <- tolower(trimws(val1))
        text2 <- tolower(trimws(val2))
        dist_val <- as.integer(adist(text1, text2)[1])
        if (dist_val <= max_dist) {
          results[[length(results) + 1]] <- data.frame(
            as.list(key_vals),
            variable = var,
            value1 = val1,
            value2 = val2,
            distance = dist_val,
            stringsAsFactors = FALSE
          )
        }
      }
    }
  }
  if(length(results) > 0) {
    final_df <- bind_rows(results)
    rownames(final_df) <- NULL
    return(final_df)
  } else {
    return(data.frame())
  }
}

# Example usage
adverse_events_orig_extended <- adverse_events_orig
adverse_events_new_extended <- adverse_events_new

# Add some columns to demonstrate complex comparison
adverse_events_orig_extended$AESEV <- c("MILD", "MODERATE", "MILD", "SEVERE")
adverse_events_new_extended$AESEV <- c("MILD", "MODERATE", "mild", "SEVERE")

adverse_events_orig_extended$AEVALUE <- c(0.5, 1.2, 0.8, 2.1)
adverse_events_new_extended$AEVALUE <- c(0.51, 1.2, 0.79, 1.98)

# Custom comparison with tolerance
custom_comparison <- compare_with_tolerance(
  adverse_events_orig_extended, 
  adverse_events_new_extended,
  keys = "USUBJID",
  numeric_vars = c("AEVALUE"),
  tol = 0.05
)

print(custom_comparison)

# Fuzzy text comparison
text_diffs <- fuzzy_compare(
  adverse_events_orig_extended,
  adverse_events_new_extended,
  keys = "USUBJID",
  text_vars = c("AESEV"),
  max_dist = 2
)

# Display fuzzy matches
print(text_diffs)

Explanation (R):

Custom functions can implement specific comparison logic
fuzzyjoin allows approximate matching of text values
parallel package enables multi-core processing for large datasets
Chunking approach allows comparing large datasets without memory issues
Custom tolerances can be applied to specific numeric variables
Fuzzy text matching identifies minor spelling differences

Input Data (Extended for Advanced Examples): Original dataset with additional columns:

USUBJID	AEDECOD	AESTDT	AEENDT	AESEV	AEVALUE
001-001	HEADACHE	2023-01-10	2023-01-15	MILD	0.5
001-002	FATIGUE	2023-01-12	2023-01-19	MODERATE	1.2
001-003	NAUSEA	2023-01-15	2023-01-18	MILD	0.8
001-004	FEVER	2023-01-20	2023-01-22	SEVERE	2.1

New dataset with additional columns:

USUBJID	AEDECOD	AESTDT	AEENDT	AESEV	AEVALUE
001-001	HEADACHE	2023-01-10	2023-01-15	MILD	0.51
001-002	FATIGUE	2023-01-14	2023-01-19	MODERATE	1.2
001-003	NAUSEA	2023-01-15	2023-01-20	mild	0.79
001-005	DIZZINESS	2023-01-25	2023-01-28	SEVERE	1.98

Expected Output of Fuzzy Text Comparison:

print(custom_comparison)
Differences found between the objects!

Summary of BASE and COMPARE
  ====================================
    PROPERTY      BASE        COMP    
  ------------------------------------
      Name        df1         df2     
     Class     data.frame  data.frame 
    Rows(#)        4           4      
   Columns(#)      6           6      
  ------------------------------------


There are rows in BASE that are not in COMPARE !!
  =========
   USUBJID 
  ---------
   001-004 
  ---------


There are rows in COMPARE that are not in BASE !!
  =========
   USUBJID 
  ---------
   001-005 
  ---------


Not all Values Compared Equal
  =============================
   Variable  No of Differences 
  -----------------------------
  AESTDT          1         
  AEENDT          1         
  AESEV            1         
 AEVALUE           2         
  -----------------------------


  ===========================================
   VARIABLE  USUBJID     BASE      COMPARE   
  -------------------------------------------
    AESTDT   001-002  2023-01-12  2023-01-14 
  -------------------------------------------


  ===========================================
   VARIABLE  USUBJID     BASE      COMPARE   
  -------------------------------------------
    AEENDT   001-003  2023-01-18  2023-01-20 
  -------------------------------------------


  ==================================
   VARIABLE  USUBJID  BASE  COMPARE 
  ----------------------------------
    AESEV    001-003  MILD   mild   
  ----------------------------------


  ==================================
   VARIABLE  USUBJID  BASE  COMPARE 
  ----------------------------------
   AEVALUE   001-001  0.5    0.51   
   AEVALUE   001-003  0.8    0.79   
  ----------------------------------

> print(text_diffs)
  USUBJID variable value1 value2 distance
1 001-003    AESEV   MILD   mild        0

5. Comparing Dataset Structures

Capability	SAS (PROC COMPARE)	R
Variable types comparison	`ATTLEN` option	`identical(sapply(df1, class), sapply(df2, class))`
Variables existence	Default output	`setdiff(names(df1), names(df2))`
Metadata comparison	Limited options	`compare_df_meta()`
Labels and formats	`ATTRCHAR` option	Custom attributes comparison
Structure-only comparison	`NOVALUES` option	Structure-focused functions

SAS Example

/* Compare structure without values */
proc compare base=adverse_events_orig compare=adverse_events_new
            novalues /* Don't compare data values */
            outnoequal /* Output variables that aren't equal */
            attlen /* Compare lengths */
            attrchar; /* Compare character attributes */
  var _all_; 
run;

/* Create dataset with metadata */
proc contents data=adverse_events_orig out=ae_orig_meta; run;
proc contents data=adverse_events_new out=ae_new_meta; run;

/* Compare metadata */
proc compare base=ae_orig_meta compare=ae_new_meta;
  id name;
  var type length format label;
run;

Explanation (SAS):

NOVALUES option skips comparing actual values
ATTLEN and ATTRCHAR compare variable attributes
PROC CONTENTS with OUT= creates metadata datasets
These can be compared to find structural differences
Useful for validating dataset consistency

R Example


library(diffdf)
library(dplyr)
library(tibble)

# Function to compare dataframe structures
compare_structure <- function(df1, df2, name1 = "DF1", name2 = "DF2") {
  result <- list()
  
  # Compare column names
  cols1 <- names(df1)
  cols2 <- names(df2)
  
  result$columns_only_in_1 <- setdiff(cols1, cols2)
  result$columns_only_in_2 <- setdiff(cols2, cols1)
  result$common_columns <- intersect(cols1, cols2)
  
  # Compare column types for common columns
  if(length(result$common_columns) > 0) {
    types1 <- sapply(df1[result$common_columns], class)
    types2 <- sapply(df2[result$common_columns], class)
    
    # Find type differences
    type_diff <- which(types1 != types2)
    if(length(type_diff) > 0) {
      result$type_differences <- data.frame(
        column = names(type_diff),
        type1 = types1[type_diff],
        type2 = types2[type_diff],
        stringsAsFactors = FALSE
      )
    } else {
      result$type_differences <- data.frame()
    }
  }
  
  # Compare dimensions
  result$dim1 <- dim(df1)
  result$dim2 <- dim(df2)
  
  # Compare attributes (focusing on column-level attributes)
  result$attribute_differences <- list()
  
  for(col in result$common_columns) {
    attr1 <- attributes(df1[[col]])
    attr2 <- attributes(df2[[col]])
    
    # Remove class as it's already compared
    attr1$class <- NULL
    attr2$class <- NULL
    
    # Compare remaining attributes
    if(!identical(attr1, attr2)) {
      result$attribute_differences[[col]] <- list(
        name1 = attr1,
        name2 = attr2
      )
    }
  }
  
  # Create summary report
  cat("Structure Comparison:", name1, "vs", name2, "\n")
  cat("---------------------------------------\n")
  cat("Dimensions:", paste(result$dim1, collapse = "x"), "vs", 
      paste(result$dim2, collapse = "x"), "\n")
  
  if(length(result$columns_only_in_1) > 0) {
    cat("\nColumns only in", name1, ":\n")
    cat(paste(" -", result$columns_only_in_1), sep = "\n")
  }
  
  if(length(result$columns_only_in_2) > 0) {
    cat("\nColumns only in", name2, ":\n")
    cat(paste(" -", result$columns_only_in_2), sep = "\n")
  }
  
  if(nrow(result$type_differences) > 0) {
    cat("\nColumn type differences:\n")
    print(result$type_differences)
  }
  
  if(length(result$attribute_differences) > 0) {
    cat("\nColumn attribute differences found in", 
        length(result$attribute_differences), "columns\n")
  }
  
  # Return the full result object for further processing
  invisible(result)
}

# Create dataframes with structure differences
ae_orig_modified <- adverse_events_orig
ae_new_modified <- adverse_events_new

# Add/modify columns to demonstrate structure differences
ae_orig_modified$AESEV <- factor(c("MILD", "MODERATE", "MILD", "SEVERE"))
ae_new_modified$AESEV <- c("MILD", "MODERATE", "MILD", "SEVERE")
ae_new_modified$AECAT <- c("GI", "SYSTEMIC", "GI", "SYSTEMIC")

# Convert a date to character to create type mismatch
ae_new_modified$AESTDT <- as.character(ae_new_modified$AESTDT)

# Compare structures
structure_diff <- compare_structure(
  ae_orig_modified, 
  ae_new_modified,
  "Original", 
  "New"
)

# Alternative approach for metadata comparison
meta_compare <- function(df1, df2) {
  meta1 <- data.frame(
    variable = names(df1),
    type = sapply(df1, function(x) class(x)[1]),
    length = sapply(df1, function(x) if(is.character(x)) max(nchar(x), na.rm=TRUE) else NA),
    label = sapply(df1, function(x) { lbl <- attr(x, "label"); if(is.null(lbl)) NA_character_ else as.character(lbl) }),
    stringsAsFactors = FALSE
  )
  
  meta2 <- data.frame(
    variable = names(df2),
    type = sapply(df2, function(x) class(x)[1]),
    length = sapply(df2, function(x) if(is.character(x)) max(nchar(x), na.rm=TRUE) else NA),
    label = sapply(df2, function(x) { lbl <- attr(x, "label"); if(is.null(lbl)) NA_character_ else as.character(lbl) }),
    stringsAsFactors = FALSE
  )
  
  # Compare metadata using diffdf
  diffdf(meta1, meta2, keys = "variable")
}


# Run metadata comparison
meta_compare(ae_orig_modified, ae_new_modified)

Explanation (R):

compare_structure() function focuses solely on structure, not data values
Compares column names, types, dimensions, and column-level attributes
diffdf::compare_df_meta() compares metadata between dataframes
Custom functions can extract and compare specific structural elements
Useful for validating consistent data structures across datasets

Expected Output of Structure Comparison:

Structure Comparison: Original vs New 
---------------------------------------
Dimensions: 4x5 vs 4x6 

Columns only in New :
 - AECAT

Column type differences:
       column  type1     type2
AESTDT AESTDT   Date character
AESEV   AESEV factor character

Column attribute differences found in 2 columns

> meta_compare(ae_orig_modified, ae_new_modified)
Differences found between the objects!

Summary of BASE and COMPARE
  ====================================
    PROPERTY      BASE        COMP    
  ------------------------------------
      Name       meta1       meta2    
     Class     data.frame  data.frame 
    Rows(#)        5           6      
   Columns(#)      4           4      
  ------------------------------------


There are rows in COMPARE that are not in BASE !!
  ==========
   variable 
  ----------
    AECAT   
  ----------


Not all Values Compared Equal
  =============================
   Variable  No of Differences 
  -----------------------------
     type            2         
    length           3         
  -----------------------------


  =======================================
   VARIABLE  variable   BASE    COMPARE  
  ---------------------------------------
     type     AESEV    factor  character 
     type     AESTDT    Date   character 
  ---------------------------------------


  ===================================
   VARIABLE  variable  BASE  COMPARE 
  -----------------------------------
    length   AEDECOD     8      9    
    length    AESEV     NA      8    
    length    AESTDT    NA     10    
  -----------------------------------

6. Best Practices for Dataset Comparison

Establish Clear Comparison Goals
- Define what constitutes a "match" vs. "difference"
- Determine which differences matter for your analysis
- Consider tolerance thresholds for numeric data
Handle Keys Properly
- Use appropriate key variables for observation matching
- Consider composite keys when a single field isn't unique
- Check for duplicate keys before comparison
Validate Structures First
- Compare dataset structures before comparing values
- Handle missing variables appropriately
- Check variable types for compatibility
Document Comparison Parameters
- Record comparison options used
- Document tolerance levels and exclusions
- Include comparison settings in analysis metadata
Optimize for Large Data
- Consider sampling for initial comparisons
- Use chunking for large datasets
- Apply parallel processing for complex comparisons
Create Meaningful Reports
- Summarize differences in context
- Highlight critical differences
- Include counts and percentages for context
Follow Up on Differences
- Investigate unexpected differences
- Create standardized workflows for resolving differences
- Document justified differences and their resolutions

Resource download links

1.5.8.-PROC-COMPARE-in-SAS-vs-R-Equivalents.zip

⁂

1.5. SAS PROCEDURES -> Equivalent in R