1.5. SAS PROCEDURES -> Equivalent in R
1.5.8. PROC COMPARE in SAS vs R Equivalents
This guide compares SAS and R approaches for comparing datasets/dataframes, highlighting key capabilities, syntactic differences, and advanced techniques for both languages.
1. Basic Dataset Comparison
| Capability | SAS (PROC COMPARE) | R |
|---|---|---|
| Compare two datasets | PROC COMPARE base=ds1 compare=ds2; |
diffdf::diffdf(df1, df2) |
| Basic output | Default summary report | Console text output |
| Visual comparison | Basic text output | diffdf, daff::render_diff() |
| Key-based comparison | ID statement |
keys parameter |
| Return value | Log output | List object with differences |
SAS Example
/* Create two datasets for comparison */
data adverse_events_orig;
input USUBJID $ AEDECOD $ AESTDT :date9. AEENDT :date9.;
format AESTDT AEENDT date9.;
datalines;
001-001 HEADACHE 10JAN2023 15JAN2023
001-002 FATIGUE 12JAN2023 19JAN2023
001-003 NAUSEA 15JAN2023 18JAN2023
001-004 FEVER 20JAN2023 22JAN2023
;
run;
data adverse_events_new;
input USUBJID $ AEDECOD $ AESTDT :date9. AEENDT :date9.;
format AESTDT AEENDT date9.;
datalines;
001-001 HEADACHE 10JAN2023 15JAN2023
001-002 FATIGUE 14JAN2023 19JAN2023
001-003 NAUSEA 15JAN2023 20JAN2023
001-005 DIZZINESS 25JAN2023 28JAN2023
;
run;
/* Basic comparison */
proc compare base=adverse_events_orig compare=adverse_events_new;
run;
/* Comparison with specific ID variable */
proc compare base=adverse_events_orig compare=adverse_events_new;
id USUBJID;
run;
Explanation (SAS):
PROC COMPAREcompares two datasets (baseandcompare)- The default output shows differences in variables, observations, and values
IDstatement specifies which variable(s) to use for matching observations- The report shows added/deleted observations, matching observations with differences
- SAS generates a detailed text report with several sections
R Example
library(diffdf)
library(lubridate)
# Create two dataframes for comparison
adverse_events_orig <- data.frame(
USUBJID = c("001-001", "001-002", "001-003", "001-004"),
AEDECOD = c("HEADACHE", "FATIGUE", "NAUSEA", "FEVER"),
AESTDT = as.Date(c("2023-01-10", "2023-01-12", "2023-01-15", "2023-01-20")),
AEENDT = as.Date(c("2023-01-15", "2023-01-19", "2023-01-18", "2023-01-22"))
)
adverse_events_new <- data.frame(
USUBJID = c("001-001", "001-002", "001-003", "001-005"),
AEDECOD = c("HEADACHE", "FATIGUE", "NAUSEA", "DIZZINESS"),
AESTDT = as.Date(c("2023-01-10", "2023-01-14", "2023-01-15", "2023-01-25")),
AEENDT = as.Date(c("2023-01-15", "2023-01-19", "2023-01-20", "2023-01-28"))
)
# Basic comparison
diffdf(adverse_events_orig, adverse_events_new)
# Comparison with specific key variables
diffdf(adverse_events_orig, adverse_events_new, keys = "USUBJID")
# Store results for further processing
diff_results <- diffdf(adverse_events_orig, adverse_events_new,
keys = "USUBJID")
Explanation (R):
diffdfpackage provides functionality similar toPROC COMPAREdiffdf()function compares two dataframes- The
keysparameter identifies variables for matching observations - Results are printed to console by default but can be stored as an object
- The output shows differences in variables, rows, and values
Input Data (Base/Original Dataset):
| USUBJID | AEDECOD | AESTDT | AEENDT |
|---|---|---|---|
| 001-001 | HEADACHE | 2023-01-10 | 2023-01-15 |
| 001-002 | FATIGUE | 2023-01-12 | 2023-01-19 |
| 001-003 | NAUSEA | 2023-01-15 | 2023-01-18 |
| 001-004 | FEVER | 2023-01-20 | 2023-01-22 |
Input Data (Compare/New Dataset):
| USUBJID | AEDECOD | AESTDT | AEENDT |
|---|---|---|---|
| 001-001 | HEADACHE | 2023-01-10 | 2023-01-15 |
| 001-002 | FATIGUE | 2023-01-14 | 2023-01-19 |
| 001-003 | NAUSEA | 2023-01-15 | 2023-01-20 |
| 001-005 | DIZZINESS | 2023-01-25 | 2023-01-28 |
Expected Output:
Differences found between the objects!
A summary is given below.
There are rows in BASE that are not in COMPARE!
All rows are shown in table below
==============================================
USUBJID AEDECOD AESTDT AEENDT
----------------------------------------------
001-004 FEVER 2023-01-20 2023-01-22
----------------------------------------------
There are rows in COMPARE that are not in BASE!
All rows are shown in table below
================================================
USUBJID AEDECOD AESTDT AEENDT
------------------------------------------------
001-005 DIZZINESS 2023-01-25 2023-01-28
------------------------------------------------
Not all Values Compared Equal
All rows are shown in table below
=============================
Variable No of Differences
-----------------------------
AESTDT 1
AEENDT 1
-----------------------------
All rows are shown in table below
=================================================================
VARIABLE USUBJID AEDECOD BASE COMPARE
-----------------------------------------------------------------
AESTDT 001-002 FATIGUE 2023-01-12 2023-01-14
AEENDT 001-003 NAUSEA 2023-01-18 2023-01-20
-----------------------------------------------------------------
2. Comparison Options and Criteria
| Capability | SAS (PROC COMPARE) | R |
|---|---|---|
| Compare specific variables | VAR statement |
vars parameter |
| Exclude variables | EXCLUDE statement |
exclude_vars parameter |
| Case sensitivity | IGNORECASE option |
ignore.case parameter |
| Numeric tolerance | CRITERION=value |
tol parameter |
| Suppress output | NOSUMMARY option |
print = FALSE |
| Detailed report | LISTALL option |
keep_unchanged_rows = TRUE |
SAS Example
/* Compare specific variables with options */
proc compare base=adverse_events_orig compare=adverse_events_new
listall criterion=0.01 ignorecase;
id USUBJID;
var AEDECOD AESTDT;
exclude AEENDT;
run;
/* Return comparison information to DATA step */
proc compare base=adverse_events_orig compare=adverse_events_new
out=comparison_results outbase outcomp outdif
noprint;
id USUBJID;
run;
/* Check the comparison return code */
%macro check_compare;
%if &sysinfo = 0 %then %put Datasets are identical;
%else %if %eval(&sysinfo & 1) %then %put Base dataset has observations not in compare dataset;
%else %if %eval(&sysinfo & 2) %then %put Compare dataset has observations not in base dataset;
%else %if %eval(&sysinfo & 4) %then %put Variables have unequal values;
%mend;
%check_compare;
Explanation (SAS):
VARstatement specifies which variables to compareEXCLUDEstatement specifies variables to omit from comparisonCRITERIONoption specifies tolerance for numeric comparisonsIGNORECASEoption makes string comparisons case-insensitiveLISTALLoption shows all matching observationsOUT=creates a dataset with comparison resultsOUTBASE,OUTCOMP, andOUTDIFcreate datasets with unmatched and differing observations&SYSINFOmacro variable contains bit flags indicating comparison results
R Example
# Load required libraries
library(diffdf)
library(lubridate)
library(dplyr) # for advanced summary later
# ================================
# 1. Create example dataframes
# ================================
adverse_events_orig <- data.frame(
USUBJID = c("001-001", "001-002", "001-003", "001-004"),
AEDECOD = c("HEADACHE", "FATIGUE", "NAUSEA", "FEVER"),
AESTDT = as.Date(c("2023-01-10", "2023-01-12", "2023-01-15", "2023-01-20")),
AEENDT = as.Date(c("2023-01-15", "2023-01-19", "2023-01-18", "2023-01-22"))
)
adverse_events_new <- data.frame(
USUBJID = c("001-001", "001-002", "001-003", "001-005"),
AEDECOD = c("HEADACHE", "FATIGUE", "NAUSEA", "DIZZINESS"),
AESTDT = as.Date(c("2023-01-10", "2023-01-14", "2023-01-15", "2023-01-25")),
AEENDT = as.Date(c("2023-01-15", "2023-01-19", "2023-01-20", "2023-01-28"))
)
# ================================
# 2. Compare only selected columns
# ================================
# Example: Compare USUBJID, AEDECOD, AESTDT only (exclude AEENDT)
myvars <- c("USUBJID", "AEDECOD", "AESTDT")
result_selected <- diffdf(
adverse_events_orig[myvars],
adverse_events_new[myvars],
keys = "USUBJID"
)
# ================================
# 3. Exclude columns (e.g., AEENDT)
# ================================
exclude <- "AEENDT"
vars <- setdiff(names(adverse_events_orig), exclude)
result_exclude <- diffdf(
adverse_events_orig[vars],
adverse_events_new[vars],
keys = "USUBJID"
)
# ================================
# 4. Ignore case for a variable
# ================================
adverse_events_orig2 <- adverse_events_orig
adverse_events_new2 <- adverse_events_new
adverse_events_orig2$AEDECOD <- tolower(adverse_events_orig2$AEDECOD)
adverse_events_new2$AEDECOD <- tolower(adverse_events_new2$AEDECOD)
result_ignore_case <- diffdf(
adverse_events_orig2,
adverse_events_new2,
keys = "USUBJID"
)
# Note: This mutates data, so use copies.
# ================================
# 5. Show unchanged, changed, and unique rows (full status)
# ================================
# Merge for row-by-row inspection
full_joined <- merge(
adverse_events_orig,
adverse_events_new,
by = "USUBJID",
suffixes = c("_orig", "_new"),
all = TRUE
)
# For clarity, annotate row status:
full_joined$status <- with(full_joined,
ifelse(
is.na(AEDECOD_orig), "Only_in_new",
ifelse(is.na(AEDECOD_new), "Only_in_orig",
ifelse(
AEDECOD_orig == AEDECOD_new &
AESTDT_orig == AESTDT_new &
AEENDT_orig == AEENDT_new, "Unchanged", "Changed"
)
)
)
)
# View the detailed summary
print(full_joined)
# ================================
# 6. (Optional) dplyr approach for summary
# ================================
# Fast status summary with dplyr (requires dplyr)
summary_df <- full_join(
adverse_events_orig, adverse_events_new,
by = "USUBJID",
suffix = c("_orig", "_new")
) %>%
mutate(
status = case_when(
is.na(AEDECOD_orig) ~ "Only_in_new",
is.na(AEDECOD_new) ~ "Only_in_orig",
AEDECOD_orig == AEDECOD_new &
AESTDT_orig == AESTDT_new &
AEENDT_orig == AEENDT_new ~ "Unchanged",
TRUE ~ "Changed"
)
)
print(summary_df)
Explanation (R):
diffdf()parameters control comparison details:varsspecifies which variables to includeexclude_varsspecifies variables to omittolsets tolerance for numeric comparisonsignore.casemakes string comparisons case-insensitivekeep_unchanged_rowsincludes matching observations in output
arsenal::comparedf()is an alternative providing similar functionalitywaldo::compare()offers precise, detailed comparisons- Several packages offer comparison functions with different strengths
Input Data (Same as Previous Example)
Expected Output:
> print(summary_df)
USUBJID AEDECOD_orig AESTDT_orig AEENDT_orig AEDECOD_new AESTDT_new AEENDT_new status
1 001-001 HEADACHE 2023-01-10 2023-01-15 HEADACHE 2023-01-10 2023-01-15 Unchanged
2 001-002 FATIGUE 2023-01-12 2023-01-19 FATIGUE 2023-01-14 2023-01-19 Changed
3 001-003 NAUSEA 2023-01-15 2023-01-18 NAUSEA 2023-01-15 2023-01-20 Changed
4 001-004 FEVER 2023-01-20 2023-01-22 <NA> <NA> <NA> Only_in_orig
5 001-005 <NA> <NA> <NA> DIZZINESS 2023-01-25 2023-01-28 Only_in_new
3. Visual and Interactive Comparisons
| Capability | SAS (PROC COMPARE) | R |
|---|---|---|
| Visual output | Limited text-based | daff::render_diff(), HTML/interactive |
| Highlighted diffs | Not built-in | Color-coded in HTML output |
| Export differences | Through OUT= datasets |
Multiple export options |
| Interactive exploration | Not built-in | Shiny apps, interactive tables |
SAS Example
/* Generate HTML report of differences */
ods html file="comparison_report.html";
proc compare base=adverse_events_orig compare=adverse_events_new;
id USUBJID;
run;
ods html close;
/* Custom report creation */
data compare_results;
merge adverse_events_orig(in=a)
adverse_events_new(in=b rename=(AESTDT=AESTDT_new AEENDT=AEENDT_new));
by USUBJID;
in_orig = a;
in_new = b;
/* Flag differences */
if a and b then do;
if AESTDT ne AESTDT_new then AESTDT_diff = 1;
if AEENDT ne AEENDT_new then AEENDT_diff = 1;
end;
run;
proc print data=compare_results;
var USUBJID in_orig in_new AEDECOD AESTDT AESTDT_new AESTDT_diff
AEENDT AEENDT_new AEENDT_diff;
run;
Explanation (SAS):
- SAS provides primarily text-based comparison output
- Custom DATA step merging can create more customized comparison datasets
- Using
ODS HTMLgenerates HTML output, but without special formatting for differences - Flags can be created to identify differences for further processing
- Basic reports can be created with PROC PRINT
R Example
library(daff)
library(DT)
library(dplyr)
library(htmltools)
# Using daff for visual comparison
diff_result <- daff::diff_data(adverse_events_orig, adverse_events_new,
)
# Render as HTML with color-coding
diff_html <- daff::render_diff(diff_result)
print(htmltools::HTML(diff_html))
# Export to HTML file
cat(diff_html, file = "comparison_report.html")
# Create interactive datatable for exploration
comparison_df <- bind_rows(
adverse_events_orig %>% mutate(source = "Original"),
adverse_events_new %>% mutate(source = "New")
) %>%
arrange(USUBJID)
# Display as interactive table
datatable(comparison_df,
filter = 'top',
extensions = c('Buttons', 'ColReorder'),
options = list(
dom = 'Bfrtip',
buttons = c('copy', 'csv', 'excel'),
colReorder = TRUE
))
create_comparison_report <- function(df1, df2, key_cols, filename = NULL) {
# Ensure key columns exist
if(!all(key_cols %in% names(df1)) || !all(key_cols %in% names(df2))) {
stop("Key columns must exist in both dataframes")
}
# For clarity, get a list of columns to compare (non-key intersection)
value_cols <- intersect(setdiff(names(df1), key_cols), setdiff(names(df2), key_cols))
# Modified dataframes for join
df1_mod <- df1 %>%
rename_with(~paste0(.x, "_orig"), -all_of(key_cols))
df2_mod <- df2 %>%
rename_with(~paste0(.x, "_new"), -all_of(key_cols))
# Full outer join
comparison <- full_join(df1_mod, df2_mod, by = key_cols)
# Compare all matched value columns
for(col in value_cols) {
orig_col <- paste0(col, "_orig")
new_col <- paste0(col, "_new")
diff_col <- paste0(col, "_diff")
# Column might not exist in both dataframes for all rows (due to join)
comparison[[diff_col]] <- ifelse(
is.na(comparison[[orig_col]]) & is.na(comparison[[new_col]]), FALSE,
is.na(comparison[[orig_col]]) | is.na(comparison[[new_col]]) | (comparison[[orig_col]] != comparison[[new_col]])
)
}
# Optionally, export
if(!is.null(filename)) {
write.csv(comparison, filename, row.names = FALSE)
}
return(comparison)
}
# Usage:
detailed_comparison <- create_comparison_report(
adverse_events_orig,
adverse_events_new,
key_cols = "USUBJID",
filename = "detailed_comparison.csv"
)
print(detailed_comparison)
getwd()
datatable(
detailed_comparison,
options = list(
pageLength = 50,
rowCallback = JS(
"function(row, data, index) {",
" for (var i=0; i<data.length; i++) {",
" if (data[i] === true || data[i] === 'TRUE') {",
" $('td:eq('+i+')', row).css('background-color', '#ffcccc');",
" }",
" }",
"}"
)
)
)
Explanation (R):
daffpackage provides visualization of differences with color codingDT::datatable()creates interactive tables for exploring differences- Custom functions can create detailed comparison dataframes
- HTML output with color highlighting makes differences easy to spot
- Multiple output formats available (HTML, CSV, Excel)
- Interactive tables allow filtering, sorting, and exporting results
Expected Output: The visual output in R would highlight differences with colors:
- Added rows in green
- Deleted rows in red
- Changed values highlighted with yellow background
- Interactive tables allow sorting, filtering, and exploration
4. Beyond Basics: Complex Comparison Scenarios
| Capability | SAS (PROC COMPARE) | R |
|---|---|---|
| Complex key matching | Multiple ID variables |
Multiple keys variables |
| Custom comparators | Limited built-in | Custom functions |
| Comparison of list objects | Not applicable | listcompr package |
| Handling large datasets | Dataset options | Chunking/parallel processing |
| Structure vs. content | Limited options | Separate structural/content comparison |
SAS Example
/* Complex comparison with multiple options */
proc compare base=adverse_events_orig compare=adverse_events_new
criterion=0.01
method=absolute /* Use absolute differences */
briefsummary /* Shorter summary */
allstats /* All comparison statistics */
allobs; /* Compare all observations */
id USUBJID;
var AESTDT AEENDT;
run;
/* Compare variable attributes */
proc compare base=adverse_events_orig compare=adverse_events_new attlenmac;
var _all_;
run;
/* Handling large datasets */
%let memsize = 8G;
options compress=yes;
proc compare base=big_dataset(obs=1000) compare=new_big_dataset(obs=1000);
id patient_id;
run;
Explanation (SAS):
CRITERIONandMETHODcontrol how numeric differences are calculatedBRIEFSUMMARYandALLSTATScontrol output detail levelALLOBSforces comparison of all observationsATTLENMACcompares variable attributes- Dataset options like
OBS=can limit comparison for large datasets - System options like
MEMSIZEandCOMPRESShelp with large data
R Example
library(diffdf)
library(fuzzyjoin)
library(tictoc)
library(parallel)
adverse_events_orig <- data.frame(
USUBJID = c("001-001", "001-002", "001-003", "001-004"),
AEDECOD = c("HEADACHE", "FATIGUE", "NAUSEA", "FEVER"),
AESTDT = as.Date(c("2023-01-10", "2023-01-12", "2023-01-15", "2023-01-20")),
AEENDT = as.Date(c("2023-01-15", "2023-01-19", "2023-01-18", "2023-01-22"))
)
adverse_events_new <- data.frame(
USUBJID = c("001-001", "001-002", "001-003", "001-005"),
AEDECOD = c("HEADACHE", "FATIGUE", "NAUSEA", "DIZZINESS"),
AESTDT = as.Date(c("2023-01-10", "2023-01-14", "2023-01-15", "2023-01-25")),
AEENDT = as.Date(c("2023-01-15", "2023-01-19", "2023-01-20", "2023-01-28"))
)
# Custom comparison function with tolerance
compare_with_tolerance <- function(df1, df2, keys, numeric_vars, tol = 0.01) {
# Perform basic comparison
basic_diff <- diffdf(df1, df2, keys = keys)
# Create custom tolerance comparison for numeric variables
custom_results <- list()
# For each key combination
unique_keys <- unique(rbind(
df1[, keys, drop = FALSE],
df2[, keys, drop = FALSE]
))
for(i in 1:nrow(unique_keys)) {
key_vals <- unique_keys[i, , drop = FALSE]
# Find matching rows
df1_match <- merge(key_vals, df1)
df2_match <- merge(key_vals, df2)
if(nrow(df1_match) > 0 && nrow(df2_match) > 0) {
# Compare numeric variables with tolerance
for(var in numeric_vars) {
if(var %in% names(df1_match) && var %in% names(df2_match)) {
val1 <- df1_match[[var]]
val2 <- df2_match[[var]]
if(!is.na(val1) && !is.na(val2)) {
diff <- abs(val1 - val2)
if(diff <= tol) {
# These values are considered equal with tolerance
# Remove from basic differences if present
# This is simplified; actual implementation would be more complex
}
}
}
}
}
}
# Return enhanced results
return(basic_diff)
}
fuzzy_compare <- function(df1, df2, keys, text_vars, max_dist = 2) {
stopifnot(is.data.frame(df1), is.data.frame(df2))
stopifnot(all(keys %in% names(df1)), all(keys %in% names(df2)))
stopifnot(all(text_vars %in% names(df1)), all(text_vars %in% names(df2)))
# Find common keys
matched_keys <- dplyr::inner_join(
df1[, keys, drop = FALSE],
df2[, keys, drop = FALSE],
by = keys
)
results <- list()
for(i in seq_len(nrow(matched_keys))) {
key_vals <- matched_keys[i, , drop = FALSE]
row1 <- dplyr::inner_join(key_vals, df1, by = keys)
row2 <- dplyr::inner_join(key_vals, df2, by = keys)
for(var in text_vars) {
val1 <- as.character(row1[[var]])
val2 <- as.character(row2[[var]])
# Only compare if not both NA and not exactly identical
if (!(is.na(val1) && is.na(val2)) && (val1 != val2)) {
# Fuzzy string distance on lower/trimmed (to allow for case or typo)
text1 <- tolower(trimws(val1))
text2 <- tolower(trimws(val2))
dist_val <- as.integer(adist(text1, text2)[1])
if (dist_val <= max_dist) {
results[[length(results) + 1]] <- data.frame(
as.list(key_vals),
variable = var,
value1 = val1,
value2 = val2,
distance = dist_val,
stringsAsFactors = FALSE
)
}
}
}
}
if(length(results) > 0) {
final_df <- bind_rows(results)
rownames(final_df) <- NULL
return(final_df)
} else {
return(data.frame())
}
}
# Example usage
adverse_events_orig_extended <- adverse_events_orig
adverse_events_new_extended <- adverse_events_new
# Add some columns to demonstrate complex comparison
adverse_events_orig_extended$AESEV <- c("MILD", "MODERATE", "MILD", "SEVERE")
adverse_events_new_extended$AESEV <- c("MILD", "MODERATE", "mild", "SEVERE")
adverse_events_orig_extended$AEVALUE <- c(0.5, 1.2, 0.8, 2.1)
adverse_events_new_extended$AEVALUE <- c(0.51, 1.2, 0.79, 1.98)
# Custom comparison with tolerance
custom_comparison <- compare_with_tolerance(
adverse_events_orig_extended,
adverse_events_new_extended,
keys = "USUBJID",
numeric_vars = c("AEVALUE"),
tol = 0.05
)
print(custom_comparison)
# Fuzzy text comparison
text_diffs <- fuzzy_compare(
adverse_events_orig_extended,
adverse_events_new_extended,
keys = "USUBJID",
text_vars = c("AESEV"),
max_dist = 2
)
# Display fuzzy matches
print(text_diffs)
Explanation (R):
- Custom functions can implement specific comparison logic
fuzzyjoinallows approximate matching of text valuesparallelpackage enables multi-core processing for large datasets- Chunking approach allows comparing large datasets without memory issues
- Custom tolerances can be applied to specific numeric variables
- Fuzzy text matching identifies minor spelling differences
Input Data (Extended for Advanced Examples): Original dataset with additional columns:
| USUBJID | AEDECOD | AESTDT | AEENDT | AESEV | AEVALUE |
|---|---|---|---|---|---|
| 001-001 | HEADACHE | 2023-01-10 | 2023-01-15 | MILD | 0.5 |
| 001-002 | FATIGUE | 2023-01-12 | 2023-01-19 | MODERATE | 1.2 |
| 001-003 | NAUSEA | 2023-01-15 | 2023-01-18 | MILD | 0.8 |
| 001-004 | FEVER | 2023-01-20 | 2023-01-22 | SEVERE | 2.1 |
New dataset with additional columns:
| USUBJID | AEDECOD | AESTDT | AEENDT | AESEV | AEVALUE |
|---|---|---|---|---|---|
| 001-001 | HEADACHE | 2023-01-10 | 2023-01-15 | MILD | 0.51 |
| 001-002 | FATIGUE | 2023-01-14 | 2023-01-19 | MODERATE | 1.2 |
| 001-003 | NAUSEA | 2023-01-15 | 2023-01-20 | mild | 0.79 |
| 001-005 | DIZZINESS | 2023-01-25 | 2023-01-28 | SEVERE | 1.98 |
Expected Output of Fuzzy Text Comparison:
print(custom_comparison)
Differences found between the objects!
Summary of BASE and COMPARE
====================================
PROPERTY BASE COMP
------------------------------------
Name df1 df2
Class data.frame data.frame
Rows(#) 4 4
Columns(#) 6 6
------------------------------------
There are rows in BASE that are not in COMPARE !!
=========
USUBJID
---------
001-004
---------
There are rows in COMPARE that are not in BASE !!
=========
USUBJID
---------
001-005
---------
Not all Values Compared Equal
=============================
Variable No of Differences
-----------------------------
AESTDT 1
AEENDT 1
AESEV 1
AEVALUE 2
-----------------------------
===========================================
VARIABLE USUBJID BASE COMPARE
-------------------------------------------
AESTDT 001-002 2023-01-12 2023-01-14
-------------------------------------------
===========================================
VARIABLE USUBJID BASE COMPARE
-------------------------------------------
AEENDT 001-003 2023-01-18 2023-01-20
-------------------------------------------
==================================
VARIABLE USUBJID BASE COMPARE
----------------------------------
AESEV 001-003 MILD mild
----------------------------------
==================================
VARIABLE USUBJID BASE COMPARE
----------------------------------
AEVALUE 001-001 0.5 0.51
AEVALUE 001-003 0.8 0.79
----------------------------------
> print(text_diffs)
USUBJID variable value1 value2 distance
1 001-003 AESEV MILD mild 0
5. Comparing Dataset Structures
| Capability | SAS (PROC COMPARE) | R |
|---|---|---|
| Variable types comparison | ATTLEN option |
identical(sapply(df1, class), sapply(df2, class)) |
| Variables existence | Default output | setdiff(names(df1), names(df2)) |
| Metadata comparison | Limited options | compare_df_meta() |
| Labels and formats | ATTRCHAR option |
Custom attributes comparison |
| Structure-only comparison | NOVALUES option |
Structure-focused functions |
SAS Example
/* Compare structure without values */
proc compare base=adverse_events_orig compare=adverse_events_new
novalues /* Don't compare data values */
outnoequal /* Output variables that aren't equal */
attlen /* Compare lengths */
attrchar; /* Compare character attributes */
var _all_;
run;
/* Create dataset with metadata */
proc contents data=adverse_events_orig out=ae_orig_meta; run;
proc contents data=adverse_events_new out=ae_new_meta; run;
/* Compare metadata */
proc compare base=ae_orig_meta compare=ae_new_meta;
id name;
var type length format label;
run;
Explanation (SAS):
NOVALUESoption skips comparing actual valuesATTLENandATTRCHARcompare variable attributesPROC CONTENTSwithOUT=creates metadata datasets- These can be compared to find structural differences
- Useful for validating dataset consistency
R Example
library(diffdf)
library(dplyr)
library(tibble)
# Function to compare dataframe structures
compare_structure <- function(df1, df2, name1 = "DF1", name2 = "DF2") {
result <- list()
# Compare column names
cols1 <- names(df1)
cols2 <- names(df2)
result$columns_only_in_1 <- setdiff(cols1, cols2)
result$columns_only_in_2 <- setdiff(cols2, cols1)
result$common_columns <- intersect(cols1, cols2)
# Compare column types for common columns
if(length(result$common_columns) > 0) {
types1 <- sapply(df1[result$common_columns], class)
types2 <- sapply(df2[result$common_columns], class)
# Find type differences
type_diff <- which(types1 != types2)
if(length(type_diff) > 0) {
result$type_differences <- data.frame(
column = names(type_diff),
type1 = types1[type_diff],
type2 = types2[type_diff],
stringsAsFactors = FALSE
)
} else {
result$type_differences <- data.frame()
}
}
# Compare dimensions
result$dim1 <- dim(df1)
result$dim2 <- dim(df2)
# Compare attributes (focusing on column-level attributes)
result$attribute_differences <- list()
for(col in result$common_columns) {
attr1 <- attributes(df1[[col]])
attr2 <- attributes(df2[[col]])
# Remove class as it's already compared
attr1$class <- NULL
attr2$class <- NULL
# Compare remaining attributes
if(!identical(attr1, attr2)) {
result$attribute_differences[[col]] <- list(
name1 = attr1,
name2 = attr2
)
}
}
# Create summary report
cat("Structure Comparison:", name1, "vs", name2, "\n")
cat("---------------------------------------\n")
cat("Dimensions:", paste(result$dim1, collapse = "x"), "vs",
paste(result$dim2, collapse = "x"), "\n")
if(length(result$columns_only_in_1) > 0) {
cat("\nColumns only in", name1, ":\n")
cat(paste(" -", result$columns_only_in_1), sep = "\n")
}
if(length(result$columns_only_in_2) > 0) {
cat("\nColumns only in", name2, ":\n")
cat(paste(" -", result$columns_only_in_2), sep = "\n")
}
if(nrow(result$type_differences) > 0) {
cat("\nColumn type differences:\n")
print(result$type_differences)
}
if(length(result$attribute_differences) > 0) {
cat("\nColumn attribute differences found in",
length(result$attribute_differences), "columns\n")
}
# Return the full result object for further processing
invisible(result)
}
# Create dataframes with structure differences
ae_orig_modified <- adverse_events_orig
ae_new_modified <- adverse_events_new
# Add/modify columns to demonstrate structure differences
ae_orig_modified$AESEV <- factor(c("MILD", "MODERATE", "MILD", "SEVERE"))
ae_new_modified$AESEV <- c("MILD", "MODERATE", "MILD", "SEVERE")
ae_new_modified$AECAT <- c("GI", "SYSTEMIC", "GI", "SYSTEMIC")
# Convert a date to character to create type mismatch
ae_new_modified$AESTDT <- as.character(ae_new_modified$AESTDT)
# Compare structures
structure_diff <- compare_structure(
ae_orig_modified,
ae_new_modified,
"Original",
"New"
)
# Alternative approach for metadata comparison
meta_compare <- function(df1, df2) {
meta1 <- data.frame(
variable = names(df1),
type = sapply(df1, function(x) class(x)[1]),
length = sapply(df1, function(x) if(is.character(x)) max(nchar(x), na.rm=TRUE) else NA),
label = sapply(df1, function(x) { lbl <- attr(x, "label"); if(is.null(lbl)) NA_character_ else as.character(lbl) }),
stringsAsFactors = FALSE
)
meta2 <- data.frame(
variable = names(df2),
type = sapply(df2, function(x) class(x)[1]),
length = sapply(df2, function(x) if(is.character(x)) max(nchar(x), na.rm=TRUE) else NA),
label = sapply(df2, function(x) { lbl <- attr(x, "label"); if(is.null(lbl)) NA_character_ else as.character(lbl) }),
stringsAsFactors = FALSE
)
# Compare metadata using diffdf
diffdf(meta1, meta2, keys = "variable")
}
# Run metadata comparison
meta_compare(ae_orig_modified, ae_new_modified)
Explanation (R):
compare_structure()function focuses solely on structure, not data values- Compares column names, types, dimensions, and column-level attributes
diffdf::compare_df_meta()compares metadata between dataframes- Custom functions can extract and compare specific structural elements
- Useful for validating consistent data structures across datasets
Expected Output of Structure Comparison:
Structure Comparison: Original vs New
---------------------------------------
Dimensions: 4x5 vs 4x6
Columns only in New :
- AECAT
Column type differences:
column type1 type2
AESTDT AESTDT Date character
AESEV AESEV factor character
Column attribute differences found in 2 columns
> meta_compare(ae_orig_modified, ae_new_modified)
Differences found between the objects!
Summary of BASE and COMPARE
====================================
PROPERTY BASE COMP
------------------------------------
Name meta1 meta2
Class data.frame data.frame
Rows(#) 5 6
Columns(#) 4 4
------------------------------------
There are rows in COMPARE that are not in BASE !!
==========
variable
----------
AECAT
----------
Not all Values Compared Equal
=============================
Variable No of Differences
-----------------------------
type 2
length 3
-----------------------------
=======================================
VARIABLE variable BASE COMPARE
---------------------------------------
type AESEV factor character
type AESTDT Date character
---------------------------------------
===================================
VARIABLE variable BASE COMPARE
-----------------------------------
length AEDECOD 8 9
length AESEV NA 8
length AESTDT NA 10
-----------------------------------
6. Best Practices for Dataset Comparison
Establish Clear Comparison Goals
- Define what constitutes a "match" vs. "difference"
- Determine which differences matter for your analysis
- Consider tolerance thresholds for numeric data
Handle Keys Properly
- Use appropriate key variables for observation matching
- Consider composite keys when a single field isn't unique
- Check for duplicate keys before comparison
Validate Structures First
- Compare dataset structures before comparing values
- Handle missing variables appropriately
- Check variable types for compatibility
Document Comparison Parameters
- Record comparison options used
- Document tolerance levels and exclusions
- Include comparison settings in analysis metadata
Optimize for Large Data
- Consider sampling for initial comparisons
- Use chunking for large datasets
- Apply parallel processing for complex comparisons
Create Meaningful Reports
- Summarize differences in context
- Highlight critical differences
- Include counts and percentages for context
Follow Up on Differences
- Investigate unexpected differences
- Create standardized workflows for resolving differences
- Document justified differences and their resolutions
**Resource download links**
1.5.8.-PROC-COMPARE-in-SAS-vs-R-Equivalents.zip