contact@a2zlearners.com

1.4. Migrating from SAS to R: A Skill Conversion Guide

1.4.4. SAS MERGE vs. dplyr joins

Both SAS's MERGE statement and R's dplyr package (specifically its *_join() functions) are used to combine data from multiple sources. While they achieve similar goals, they have distinct philosophies and functionalities.

Sample Input Tables

prod_id name
P001 Widget
P002 Gadget
P003 Doohickey
prod_id price
P001 10.00
P003 12.50

1. Full Merge / full_join()

Behavior

  • Includes every prod_id from both tables
  • Unmatched columns set to missing (. in SAS, NA in R)
  • Builds a superset of keys (union of left & right)
  • Equivalent to SQL FULL OUTER JOIN
/* SAS: full merge */
proc sort data=products;     by prod_id; run;
proc sort data=current_price; by prod_id; run;

data all;
  merge products current_price;
  by prod_id;
run;

proc print data=all; run;

Expected Output (SAS “all”):

prod_id name price
P001 Widget 10.00
P002 Gadget .
P003 Doohickey 12.50
library(dplyr)

products <- data.frame(
  prod_id = c("P001", "P002", "P003"),
  name = c("Widget", "Gadget", "Doohickey")
)
current_price <- data.frame(
  prod_id = c("P001", "P003"),
  price = c(10.00, 12.50)
)

all <- full_join(products, current_price, by = "prod_id")
print(all)
# returns the data frame 'all'

Returns (R “all”):

prod_id name price
P001 Widget 10.00
P002 Gadget NA
P003 Doohickey 12.50

2. Inner Merge / inner_join()

Behavior

  • Keeps only keys present in both tables
  • Equivalent to SQL INNER JOIN
  • Filters out any row missing in either side
  • Ensures complete data from both sources for returned keys
/* SAS: inner merge */
proc sort data=products;     by prod_id; run;
proc sort data=current_price; by prod_id; run;

data both;
  merge products(in=a) current_price(in=b);
  by prod_id;
  if a and b;
run;

proc print data=both; run;

Expected Output (“both”):

prod_id name price
P001 Widget 10.00
library(dplyr)

products <- data.frame(
  prod_id = c("P001", "P002", "P003"),
  name = c("Widget", "Gadget", "Doohickey")
)
current_price <- data.frame(
  prod_id = c("P001", "P003"),
  price = c(10.00, 12.50)
)

both <- inner_join(products, current_price, by = "prod_id")
print(both)
# returns 'both'

Returns (both):

prod_id name price
P001 Widget 10.00

3. Left-Only Merge / left_join()

Behavior

  • All rows from products (left)
  • Matching prices if available; otherwise missing
  • Preserves left-table row order by default
  • Equivalent to SQL LEFT OUTER JOIN
/* SAS: left-only merge */
proc sort data=products;     by prod_id; run;
proc sort data=current_price; by prod_id; run;

data prod_with_price;
  merge products(in=a) current_price;
  by prod_id;
  if a;
run;

proc print data=prod_with_price; run;

Expected Output:

prod_id name price
P001 Widget 10.00
P002 Gadget .
library(dplyr)

products <- data.frame(
  prod_id = c("P001", "P002", "P003"),
  name = c("Widget", "Gadget", "Doohickey")
)
current_price <- data.frame(
  prod_id = c("P001", "P003"),
  price = c(10.00, 12.50)
)

prod_with_price <- left_join(products, current_price, by = "prod_id")
print(prod_with_price)
# returns 'prod_with_price'

Returns:

prod_id name price
P001 Widget 10.00
P002 Gadget NA

4. Right-Only Merge / right_join()

Behavior

  • All rows from current_price (right)
  • Matching product names if available; otherwise missing
  • Preserves right-table row order for unmatched keys
  • Equivalent to SQL RIGHT OUTER JOIN
/* SAS: right-only merge */
proc sort data=products;     by prod_id; run;
proc sort data=current_price; by prod_id; run;

data price_with_prod;
  merge products current_price(in=b);
  by prod_id;
  if b;
run;

proc print data=price_with_prod; run;

Expected Output:

prod_id name price
P001 Widget 10.00
P003 Doohickey 12.50
library(dplyr)

products <- data.frame(
  prod_id = c("P001", "P002", "P003"),
  name = c("Widget", "Gadget", "Doohickey")
)
current_price <- data.frame(
  prod_id = c("P001", "P003"),
  price = c(10.00, 12.50)
)

price_with_prod <- right_join(products, current_price, by = "prod_id")
print(price_with_prod)
# returns 'price_with_prod'

Returns:

prod_id name price
P001 Widget 10.00
P003 Doohickey 12.50

5. Anti-Join / “Only-in-Left”

Behavior

  • Rows in products not in current_price
  • Equivalent to SQL LEFT JOIN … WHERE right.key IS NULL
  • Useful for identifying or excluding unmatched left-side records
proc sort data=products;     by prod_id; run;
proc sort data=current_price; by prod_id; run;

data only_products;
  merge products(in=a) current_price(in=b);
  by prod_id;
  if a and not b;
run;

proc print data=only_products; run;

Expected Output:

prod_id name
P002 Gadget
library(dplyr)

products <- data.frame(
  prod_id = c("P001", "P002", "P003"),
  name = c("Widget", "Gadget", "Doohickey")
)
current_price <- data.frame(
  prod_id = c("P001", "P003"),
  price = c(10.00, 12.50)
)

anti <- anti_join(products, current_price, by = "prod_id")
print(anti)
# returns 'anti'

Returns:

prod_id name
P002 Gadget

6. Semi-Join / “Filter Left by Match”

Behavior

  • Keep products that have a price
  • Show only columns from products
  • Equivalent to SQL WHERE EXISTS (SELECT 1 …)
  • Efficient for filtering without bringing extra columns
proc sort data=products;     by prod_id; run;
proc sort data=current_price; by prod_id; run;

data semi_products(keep=prod_id name);
  merge products(in=a) current_price(in=b);
  by prod_id;
  if a and b;
run;

proc print data=semi_products; run;

Expected Output:

prod_id name
P001 Widget
library(dplyr)

products <- data.frame(
  prod_id = c("P001", "P002", "P003"),
  name = c("Widget", "Gadget", "Doohickey")
)
current_price <- data.frame(
  prod_id = c("P001", "P003"),
  price = c(10.00, 12.50)
)

semi <- semi_join(products, current_price, by = "prod_id")
print(semi)
# returns 'semi'

Returns:

prod_id name
P001 Widget

7. Coalescing Columns (SAS-Style Overwrite)

Behavior

  • Simulate SAS’s “right overwrites left” for duplicate column names
  • price.x = left-side, price.y = right-side after full_join()
  • coalesce() picks the first non-missing value across arguments
library(dplyr)

products <- data.frame(
  prod_id = c("P001", "P002", "P003"),
  name = c("Widget", "Gadget", "Doohickey")
)
current_price <- data.frame(
  prod_id = c("P001", "P003"),
  price = c(10.00, 12.50)
)

tj <- full_join(products, current_price, by = "prod_id")
tj <- tj %>%
  mutate(price = coalesce(price, price)) %>%
  select(prod_id, name, price)
print(tj)

Returns (tj):

prod_id name price
P001 Widget 10.00
P002 Gadget NA
P003 Doohickey 12.50

8. Multi-Key Merge / multi-column join

Behavior

  • Match on all specified keys (e.g. region + prod_id)
  • Only rows where every key is equal will merge
  • Equivalent to SQL JOIN … ON key1=… AND key2=…

Input:

region prod_id qty
West P001 5
region prod_id price
West P001 10.00
proc sort data=A; by region prod_id; run;
proc sort data=B; by region prod_id; run;

data AB;
  merge A B;
  by region prod_id;
run;

proc print data=AB; run;

Expected Output:

region prod_id qty price
West P001 5 10.00
library(dplyr)

A <- data.frame(
  region = "West",
  prod_id = "P001",
  qty = 5
)
B <- data.frame(
  region = "West",
  prod_id = "P001",
  price = 10.00
)

AB <- full_join(A, B, by = c("region", "prod_id"))
print(AB)
# returns 'AB'

Returns:

region prod_id qty price
West P001 5 10.00

9. SAS PROC SQL vs. dplyr

Behavior

  • SQL FULL JOIN mirrors full_join()
  • coalesce() in SAS and R both choose the first non-missing key
  • Shows how to replicate DATA-step merges in PROC SQL
proc sql;
  create table sql_full as
  select coalesce(a.prod_id, b.prod_id) as prod_id,
         a.name, b.price
  from products as a
  full join current_price as b
    on a.prod_id = b.prod_id;
quit;

proc print data=sql_full; run;

Expected Output:

prod_id name price
P001 Widget 10.00
P002 Gadget .
P003 Doohickey 12.50
# R/dplyr: SQL-style full join + coalesce
# Define base product info
products_c <- data.frame(
  prod_id = c("P001", "P002", "P003"),
  name = c("Widget", "Gadget", NA)
)

# Define alternate product name and pricing
current_price_c <- data.frame(
  prod_id = c("P001", "P003"),
  name = c("SuperWidget", "MegaDoohickey"),
  price = c(10.00, 12.50)
)

# Perform full join and use coalesce on overlapping 'name'
sql_full <- full_join(products_c, current_price_c, by = "prod_id") %>%
  mutate(name = coalesce(name.x, name.y)) %>%
  select(prod_id, name, price)

print(sql_full)

Returns (sql_full):

prod_id name price
P001 Widget 10.00
P002 Gadget NA
P003 MegaDoohickey 12.50

**Comparison Table SAS MERGE vs dplr joins**

Feature SAS MERGE dplyr joins
Syntax data X; merge A B; by key; run; X <- join_fn(A, B, by="key")
Pre-sort required Yes No
Row selection flags (in=a), (in=b) Implicit in join choice
Missing value symbol . NA
Overwrite vs suffixes Right table overwrites left Keeps both via .x / .y
Order of output Sorted by BY variable Original order + unmatched rows
Multi-key merge by k1 k2; by = c("k1","k2")
SQL-style alternative proc sql; … full join …; Same SQL or *_join()
Anti / Semi Via IN= flags + IF subsets anti_join(), semi_join()
Column coalesce Manual assignment (price=b.price;) coalesce(price.x, price.y)

10. Tagging Each Record’s Source

Goal: After merging, add a variable source indicating where the row came from:

  • "left" = only in products
  • "right" = only in current_price
  • "both" = in both

A. SAS

proc sort data=products;     by prod_id; run;
proc sort data=current_price; by prod_id; run;

data tagged;
  merge products(in=inP) current_price(in=inC);
  by prod_id;

  length source $5;
  select;
    when (inP and inC)          source = "both";
    when (inP and not inC)      source = "left";
    when (not inP and inC)      source = "right";
    otherwise                   source = "";
  end;
run;

proc print data=tagged noobs; run;

Returns (SAS tagged):

prod_id name price source
P001 Widget 10.00 both
P002 Gadget . left
P003 Doohickey 12.50 right

B. R / dplyr

Method 1: case_when() on is.na()

library(dplyr)

products <- data.frame(
  prod_id = c("P001", "P002", "P003"),
  name = c("Widget", "Gadget", "Doohickey")
)
current_price <- data.frame(
  prod_id = c("P001", "P003"),
  price = c(10.00, 12.50)
)

tagged <- full_join(products, current_price, by = "prod_id") %>%
  mutate(source = case_when(
    !is.na(name) &  !is.na(price) ~ "both",
    !is.na(name) &   is.na(price) ~ "left",
     is.na(name) &  !is.na(price) ~ "right",
    TRUE                          ~ NA_character_
  ))

print(tagged)
# returns 'tagged'

Method 2: Flag columns before join

library(dplyr)

products <- data.frame(
  prod_id = c("P001", "P002", "P003"),
  name = c("Widget", "Gadget", "Doohickey")
)
current_price <- data.frame(
  prod_id = c("P001", "P003"),
  price = c(10.00, 12.50)
)

p2 <- products %>% mutate(in_products = TRUE)
c2 <- current_price %>% mutate(in_price = TRUE)

tagged2 <- full_join(p2, c2, by="prod_id") %>%
  mutate(source = case_when(
    in_products & in_price  ~ "both",
    in_products & !in_price ~ "left",
    !in_products & in_price ~ "right"
  )) %>%
  select(prod_id, name, price, source)

print(tagged2)
# returns 'tagged2'

Returns (both R methods):

prod_id name price source
P001 Widget 10.00 both
P002 Gadget NA left
P003 Doohickey 12.50 right

11. Explore additional information

  • data.table High-Speed Joins
    • Convert with setDT() and index via setkey().
    DT1[DT2] = fast right join; DT2[DT1] = fast left join.
    • Example:

    library(data.table)
    setDT(products); setDT(current_price)
    setkey(products, prod_id); setkey(current_price, prod_id)
    
    # Full join
    dt_all <- merge(products, current_price, by="prod_id", all=TRUE)
    
    # Right join
    dt_price_with_prod <- products[current_price, nomatch=0]
    
  • fuzzyjoin for Approximate Key Matching
    • Use stringdist_join() to match on string distance; interval_join() for ranges.
    • Example:

    library(fuzzyjoin)
    
    # Join where prod_id differs by ≤1 edit
    fuzzy_products <- stringdist_join(
      products, current_price,
      by = "prod_id",
      max_dist = 1,
      distance_col = "dist"
    )
    
  • Performance Tuning for Very Large Tables
    • Chunk large tables by hash or range, join in parallel, then combine with rbind().
    • In SAS, use HASH objects; in R, leverage data.table keys or on-disk tools (DuckDB, arrow).
    • Profile memory vs. CPU, adjust nthreads or set options(datatable.print.nrows=…).

  • Encapsulate Join Logic in Reusable Functions/Macros
    SAS Macro:

    %macro do_join(left=, right=, out=, type=full, by=);
      proc sort data=&left; by &by; run;
      proc sort data=&right; by &by; run;
      data &out;
        merge &left (in=a) &right (in=b);
        by &by;
        %if &type=inner %then %do; if a and b; %end;
        %else %if &type=left  %then %do; if a; %end;
        %else %if &type=right %then %do; if b; %end;
      run;
    %mend;
    
    %do_join(left=products, right=current_price,
             out=both, type=inner, by=prod_id);
    

    R Function:

    join_data <- function(df1, df2, by, type="full") {
      library(dplyr)
      join_fn <- switch(type,
                        full = full_join,
                        inner = inner_join,
                        left  = left_join,
                        right = right_join)
      join_fn(df1, df2, by = by)
    }
    
    # Usage:
    both <- join_data(products, current_price,
                      by = "prod_id", type = "inner")
    

12. Common Error/Warning Scenarios in SAS MERGE and dplyr Joins

12.1. Key Not Unique (Duplicate Keys)

Scenario:
Both SAS and dplyr expect keys to be unique for a simple merge/join. Duplicate keys can cause unexpected row duplication or mismatches.

Input Example:

prod_id name
P001 Widget
P001 WidgetX
P002 Gadget
prod_id price
P001 10.00
P002 15.00

R Code:

products_dup <- tibble(prod_id = c("P001", "P001", "P002"),
                       name = c("Widget", "WidgetX", "Gadget"))
current_price <- tibble(prod_id = c("P001", "P002"),
                        price = c(10.00, 15.00))

result <- left_join(products_dup, current_price, by = "prod_id")
print(result)

Output:

prod_id name price
P001 Widget 10.00
P001 WidgetX 10.00
P002 Gadget 15.00

Solution:
If duplicates are unintended, remove them before joining:

products_nodup <- products_dup %>% distinct(prod_id, .keep_all = TRUE)
result <- left_join(products_nodup, current_price, by = "prod_id")

If duplicates are valid, be aware that all combinations will be returned.


12.2. Missing Key Columns

Scenario:
The join key is missing in one or both tables.

Input Example:

prod_id name
P001 Widget
P002 Gadget
id price
P001 10.00
P002 15.00

R Code:

products <- tibble(prod_id = c("P001", "P002"), name = c("Widget", "Gadget"))
current_price <- tibble(id = c("P001", "P002"), price = c(10.00, 15.00))

# This will give a warning or error
result <- left_join(products, current_price, by = "prod_id")

Output:
Error: by = "prod_id" not found in one of the tables.

Solution:
Rename columns to match before joining:

current_price <- current_price %>% rename(prod_id = id)
result <- left_join(products, current_price, by = "prod_id")

12.3. Type Mismatch on Join Key

Scenario:
Key columns have different types (e.g., character vs. numeric).

Input Example:

prod_id (character) name
P001 Widget
P002 Gadget
prod_id (numeric) price
1 10.00
2 15.00

R Code:

products <- tibble(prod_id = c("P001", "P002"), name = c("Widget", "Gadget"))
current_price <- tibble(prod_id = c(1, 2), price = c(10.00, 15.00))

result <- left_join(products, current_price, by = "prod_id")
print(result)

Output:

prod_id name price
P001 Widget NA
P002 Gadget NA

Solution:
Convert types to match before joining:

current_price <- current_price %>% mutate(prod_id = paste0("P00", prod_id))
result <- left_join(products, current_price, by = "prod_id")

12.4. Many-to-Many Join (Multiple Matches)

Scenario:
Both tables have duplicate keys, resulting in a Cartesian product for those keys.

Input Example:

prod_id name
P001 Widget
P001 WidgetX
prod_id price
P001 10.00
P001 12.00

R Code:

products <- tibble(prod_id = c("P001", "P001"), name = c("Widget", "WidgetX"))
current_price <- tibble(prod_id = c("P001", "P001"), price = c(10.00, 12.00))

result <- left_join(products, current_price, by = "prod_id")
print(result)

Output:

prod_id name price
P001 Widget 10.00
P001 Widget 12.00
P001 WidgetX 10.00
P001 WidgetX 12.00

Solution:
If only one match is desired, deduplicate or aggregate before joining.


12.5. Handling NA/Missing Values in Keys

Scenario:
NA values in key columns can cause warnings or unexpected results.

Input Example:

prod_id name
P001 Widget
NA Gadget
prod_id price
P001 10.00
NA 15.00

R Code:

products <- tibble(prod_id = c("P001", NA), name = c("Widget", "Gadget"))
current_price <- tibble(prod_id = c("P001", NA), price = c(10.00, 15.00))

result <- left_join(products, current_price, by = "prod_id")
print(result)

Output:

prod_id name price
P001 Widget 10.00
NA Gadget 15.00

Solution:
Remove or impute NA keys before joining if not desired:

products_clean <- products %>% filter(!is.na(prod_id))
current_price_clean <- current_price %>% filter(!is.na(prod_id))
result <- left_join(products_clean, current_price_clean, by = "prod_id")

12.6. Column Name Conflicts

Scenario:
Non-key columns with the same name in both tables will be suffixed .x and .y in dplyr.

Input Example:

prod_id name
P001 Widget
P002 Gadget
prod_id name price
P001 WidgetA 10.00
P002 GadgetB 15.00

R Code:

products <- tibble(prod_id = c("P001", "P002"), name = c("Widget", "Gadget"))
current_price <- tibble(prod_id = c("P001", "P002"), name = c("WidgetA", "GadgetB"), price = c(10.00, 15.00))

result <- left_join(products, current_price, by = "prod_id")
print(result)

Output:

prod_id name.x name.y price
P001 Widget WidgetA 10.00
P002 Gadget GadgetB 15.00

Solution:
Use mutate() and coalesce() to resolve:

result <- result %>%
  mutate(name = coalesce(name.y, name.x)) %>%
  select(prod_id, name, price)

12.7. dplyr Warning: Duplicate Key Columns

Scenario:
dplyr will warn if the join key appears in both tables but with different types or duplicated names.

Solution:
Ensure join keys are named and typed identically in both tables.


12.8. General Best Practices
  • Always check for duplicate keys before joining.
  • Ensure key columns are present and of the same type.
  • Handle NA values in keys appropriately.
  • Use distinct() or group_by() + summarise() to deduplicate or aggregate as needed.
  • For large tables, consider using data.table for performance.

12.9. Summary Table: Error/Warning Scenarios and Solutions
Scenario Error/Warning Solution/Code Example
Duplicate keys Unexpected rows distinct() or aggregate
Missing key columns Error rename() columns
Type mismatch No matches mutate() to convert types
Many-to-many join Cartesian product Deduplicate before join
NA in keys NA matches filter(!is.na(key))
Column name conflict .x/.y suffixes coalesce() or select()
SAS not sorted Incorrect merge proc sort before merge
dplyr duplicate key Warning Ensure unique, matching keys

**Resource download links**

1.4.4.-SAS-MERGE-vs-dplyr-joins.zip