1.4.4. SAS MERGE vs. dplyr joins

1.4. Migrating from SAS to R: A Skill Conversion Guide

1.4.4. SAS MERGE vs. dplyr joins

Both SAS's MERGE statement and R's dplyr package (specifically its *_join() functions) are used to combine data from multiple sources. While they achieve similar goals, they have distinct philosophies and functionalities.

Sample Input Tables

prod_id	name
P001	Widget
P002	Gadget
P003	Doohickey

prod_id	price
P001	10.00
P003	12.50

1. Full Merge / full_join()

Behavior

Includes every prod_id from both tables
Unmatched columns set to missing (. in SAS, NA in R)
Builds a superset of keys (union of left & right)
Equivalent to SQL FULL OUTER JOIN

/* SAS: full merge */
proc sort data=products;     by prod_id; run;
proc sort data=current_price; by prod_id; run;

data all;
  merge products current_price;
  by prod_id;
run;

proc print data=all; run;

Expected Output (SAS “all”):

prod_id	name	price
P001	Widget	10.00
P002	Gadget	.
P003	Doohickey	12.50

library(dplyr)

products <- data.frame(
  prod_id = c("P001", "P002", "P003"),
  name = c("Widget", "Gadget", "Doohickey")
)
current_price <- data.frame(
  prod_id = c("P001", "P003"),
  price = c(10.00, 12.50)
)

all <- full_join(products, current_price, by = "prod_id")
print(all)
# returns the data frame 'all'

Returns (R “all”):

prod_id	name	price
P001	Widget	10.00
P002	Gadget	NA
P003	Doohickey	12.50

2. Inner Merge / inner_join()

Behavior

Keeps only keys present in both tables
Equivalent to SQL INNER JOIN
Filters out any row missing in either side
Ensures complete data from both sources for returned keys

/* SAS: inner merge */
proc sort data=products;     by prod_id; run;
proc sort data=current_price; by prod_id; run;

data both;
  merge products(in=a) current_price(in=b);
  by prod_id;
  if a and b;
run;

proc print data=both; run;

Expected Output (“both”):

prod_id	name	price
P001	Widget	10.00

library(dplyr)

products <- data.frame(
  prod_id = c("P001", "P002", "P003"),
  name = c("Widget", "Gadget", "Doohickey")
)
current_price <- data.frame(
  prod_id = c("P001", "P003"),
  price = c(10.00, 12.50)
)

both <- inner_join(products, current_price, by = "prod_id")
print(both)
# returns 'both'

Returns (both):

prod_id	name	price
P001	Widget	10.00

3. Left-Only Merge / left_join()

Behavior

All rows from products (left)
Matching prices if available; otherwise missing
Preserves left-table row order by default
Equivalent to SQL LEFT OUTER JOIN

/* SAS: left-only merge */
proc sort data=products;     by prod_id; run;
proc sort data=current_price; by prod_id; run;

data prod_with_price;
  merge products(in=a) current_price;
  by prod_id;
  if a;
run;

proc print data=prod_with_price; run;

Expected Output:

prod_id	name	price
P001	Widget	10.00
P002	Gadget	.

library(dplyr)

products <- data.frame(
  prod_id = c("P001", "P002", "P003"),
  name = c("Widget", "Gadget", "Doohickey")
)
current_price <- data.frame(
  prod_id = c("P001", "P003"),
  price = c(10.00, 12.50)
)

prod_with_price <- left_join(products, current_price, by = "prod_id")
print(prod_with_price)
# returns 'prod_with_price'

Returns:

prod_id	name	price
P001	Widget	10.00
P002	Gadget	NA

4. Right-Only Merge / right_join()

Behavior

All rows from current_price (right)
Matching product names if available; otherwise missing
Preserves right-table row order for unmatched keys
Equivalent to SQL RIGHT OUTER JOIN

/* SAS: right-only merge */
proc sort data=products;     by prod_id; run;
proc sort data=current_price; by prod_id; run;

data price_with_prod;
  merge products current_price(in=b);
  by prod_id;
  if b;
run;

proc print data=price_with_prod; run;

Expected Output:

prod_id	name	price
P001	Widget	10.00
P003	Doohickey	12.50

library(dplyr)

products <- data.frame(
  prod_id = c("P001", "P002", "P003"),
  name = c("Widget", "Gadget", "Doohickey")
)
current_price <- data.frame(
  prod_id = c("P001", "P003"),
  price = c(10.00, 12.50)
)

price_with_prod <- right_join(products, current_price, by = "prod_id")
print(price_with_prod)
# returns 'price_with_prod'

Returns:

prod_id	name	price
P001	Widget	10.00
P003	Doohickey	12.50

5. Anti-Join / “Only-in-Left”

Behavior

Rows in products not in current_price
Equivalent to SQL LEFT JOIN … WHERE right.key IS NULL
Useful for identifying or excluding unmatched left-side records

proc sort data=products;     by prod_id; run;
proc sort data=current_price; by prod_id; run;

data only_products;
  merge products(in=a) current_price(in=b);
  by prod_id;
  if a and not b;
run;

proc print data=only_products; run;

Expected Output:

prod_id	name
P002	Gadget

library(dplyr)

products <- data.frame(
  prod_id = c("P001", "P002", "P003"),
  name = c("Widget", "Gadget", "Doohickey")
)
current_price <- data.frame(
  prod_id = c("P001", "P003"),
  price = c(10.00, 12.50)
)

anti <- anti_join(products, current_price, by = "prod_id")
print(anti)
# returns 'anti'

Returns:

prod_id	name
P002	Gadget

6. Semi-Join / “Filter Left by Match”

Behavior

Keep products that have a price
Show only columns from products
Equivalent to SQL WHERE EXISTS (SELECT 1 …)
Efficient for filtering without bringing extra columns

proc sort data=products;     by prod_id; run;
proc sort data=current_price; by prod_id; run;

data semi_products(keep=prod_id name);
  merge products(in=a) current_price(in=b);
  by prod_id;
  if a and b;
run;

proc print data=semi_products; run;

Expected Output:

prod_id	name
P001	Widget

library(dplyr)

products <- data.frame(
  prod_id = c("P001", "P002", "P003"),
  name = c("Widget", "Gadget", "Doohickey")
)
current_price <- data.frame(
  prod_id = c("P001", "P003"),
  price = c(10.00, 12.50)
)

semi <- semi_join(products, current_price, by = "prod_id")
print(semi)
# returns 'semi'

Returns:

prod_id	name
P001	Widget

7. Coalescing Columns (SAS-Style Overwrite)

Behavior

Simulate SAS’s “right overwrites left” for duplicate column names
price.x = left-side, price.y = right-side after full_join()
coalesce() picks the first non-missing value across arguments

library(dplyr)

products <- data.frame(
  prod_id = c("P001", "P002", "P003"),
  name = c("Widget", "Gadget", "Doohickey")
)
current_price <- data.frame(
  prod_id = c("P001", "P003"),
  price = c(10.00, 12.50)
)

tj <- full_join(products, current_price, by = "prod_id")
tj <- tj %>%
  mutate(price = coalesce(price, price)) %>%
  select(prod_id, name, price)
print(tj)

Returns (tj):

prod_id	name	price
P001	Widget	10.00
P002	Gadget	NA
P003	Doohickey	12.50

8. Multi-Key Merge / multi-column join

Behavior

Match on all specified keys (e.g. region + prod_id)
Only rows where every key is equal will merge
Equivalent to SQL JOIN … ON key1=… AND key2=…

Input:

region	prod_id	qty
West	P001	5

region	prod_id	price
West	P001	10.00

proc sort data=A; by region prod_id; run;
proc sort data=B; by region prod_id; run;

data AB;
  merge A B;
  by region prod_id;
run;

proc print data=AB; run;

Expected Output:

region	prod_id	qty	price
West	P001	5	10.00

library(dplyr)

A <- data.frame(
  region = "West",
  prod_id = "P001",
  qty = 5
)
B <- data.frame(
  region = "West",
  prod_id = "P001",
  price = 10.00
)

AB <- full_join(A, B, by = c("region", "prod_id"))
print(AB)
# returns 'AB'

Returns:

region	prod_id	qty	price
West	P001	5	10.00

9. SAS PROC SQL vs. dplyr

Behavior

SQL FULL JOIN mirrors full_join()
coalesce() in SAS and R both choose the first non-missing key
Shows how to replicate DATA-step merges in PROC SQL

proc sql;
  create table sql_full as
  select coalesce(a.prod_id, b.prod_id) as prod_id,
         a.name, b.price
  from products as a
  full join current_price as b
    on a.prod_id = b.prod_id;
quit;

proc print data=sql_full; run;

Expected Output:

prod_id	name	price
P001	Widget	10.00
P002	Gadget	.
P003	Doohickey	12.50

# R/dplyr: SQL-style full join + coalesce
# Define base product info
products_c <- data.frame(
  prod_id = c("P001", "P002", "P003"),
  name = c("Widget", "Gadget", NA)
)

# Define alternate product name and pricing
current_price_c <- data.frame(
  prod_id = c("P001", "P003"),
  name = c("SuperWidget", "MegaDoohickey"),
  price = c(10.00, 12.50)
)

# Perform full join and use coalesce on overlapping 'name'
sql_full <- full_join(products_c, current_price_c, by = "prod_id") %>%
  mutate(name = coalesce(name.x, name.y)) %>%
  select(prod_id, name, price)

print(sql_full)

Returns (sql_full):

prod_id	name	price
P001	Widget	10.00
P002	Gadget	NA
P003	MegaDoohickey	12.50

Comparison Table SAS MERGE vs dplr joins

Feature	SAS MERGE	dplyr joins
Syntax	`data X; merge A B; by key; run;`	`X <- join_fn(A, B, by="key")`
Pre-sort required	Yes	No
Row selection flags	`(in=a)`, `(in=b)`	Implicit in join choice
Missing value symbol	`.`	`NA`
Overwrite vs suffixes	Right table overwrites left	Keeps both via `.x` / `.y`
Order of output	Sorted by BY variable	Original order + unmatched rows
Multi-key merge	`by k1 k2;`	`by = c("k1","k2")`
SQL-style alternative	`proc sql; … full join …;`	Same SQL or `*_join()`
Anti / Semi	Via IN= flags + IF subsets	`anti_join()`, `semi_join()`
Column coalesce	Manual assignment (`price=b.price;`)	`coalesce(price.x, price.y)`

10. Tagging Each Record’s Source

Goal: After merging, add a variable source indicating where the row came from:

"left" = only in products
"right" = only in current_price
"both" = in both

A. SAS

proc sort data=products;     by prod_id; run;
proc sort data=current_price; by prod_id; run;

data tagged;
  merge products(in=inP) current_price(in=inC);
  by prod_id;

  length source $5;
  select;
    when (inP and inC)          source = "both";
    when (inP and not inC)      source = "left";
    when (not inP and inC)      source = "right";
    otherwise                   source = "";
  end;
run;

proc print data=tagged noobs; run;

Returns (SAS tagged):

prod_id	name	price	source
P001	Widget	10.00	both
P002	Gadget	.	left
P003	Doohickey	12.50	right

B. R / dplyr

Method 1: case_when() on is.na()

library(dplyr)

products <- data.frame(
  prod_id = c("P001", "P002", "P003"),
  name = c("Widget", "Gadget", "Doohickey")
)
current_price <- data.frame(
  prod_id = c("P001", "P003"),
  price = c(10.00, 12.50)
)

tagged <- full_join(products, current_price, by = "prod_id") %>%
  mutate(source = case_when(
    !is.na(name) &  !is.na(price) ~ "both",
    !is.na(name) &   is.na(price) ~ "left",
     is.na(name) &  !is.na(price) ~ "right",
    TRUE                          ~ NA_character_
  ))

print(tagged)
# returns 'tagged'

Method 2: Flag columns before join

library(dplyr)

products <- data.frame(
  prod_id = c("P001", "P002", "P003"),
  name = c("Widget", "Gadget", "Doohickey")
)
current_price <- data.frame(
  prod_id = c("P001", "P003"),
  price = c(10.00, 12.50)
)

p2 <- products %>% mutate(in_products = TRUE)
c2 <- current_price %>% mutate(in_price = TRUE)

tagged2 <- full_join(p2, c2, by="prod_id") %>%
  mutate(source = case_when(
    in_products & in_price  ~ "both",
    in_products & !in_price ~ "left",
    !in_products & in_price ~ "right"
  )) %>%
  select(prod_id, name, price, source)

print(tagged2)
# returns 'tagged2'

Returns (both R methods):

prod_id	name	price	source
P001	Widget	10.00	both
P002	Gadget	NA	left
P003	Doohickey	12.50	right

11. Explore additional information

data.table High-Speed Joins
• Convert with setDT() and index via setkey().
• DT1[DT2] = fast right join; DT2[DT1] = fast left join.
• Example:

library(data.table)
setDT(products); setDT(current_price)
setkey(products, prod_id); setkey(current_price, prod_id)

# Full join
dt_all <- merge(products, current_price, by="prod_id", all=TRUE)

# Right join
dt_price_with_prod <- products[current_price, nomatch=0]

fuzzyjoin for Approximate Key Matching
• Use stringdist_join() to match on string distance; interval_join() for ranges.
• Example:

library(fuzzyjoin)

# Join where prod_id differs by ≤1 edit
fuzzy_products <- stringdist_join(
  products, current_price,
  by = "prod_id",
  max_dist = 1,
  distance_col = "dist"
)

Performance Tuning for Very Large Tables
• Chunk large tables by hash or range, join in parallel, then combine with rbind().
• In SAS, use HASH objects; in R, leverage data.table keys or on-disk tools (DuckDB, arrow).
• Profile memory vs. CPU, adjust nthreads or set options(datatable.print.nrows=…).

Encapsulate Join Logic in Reusable Functions/Macros
• SAS Macro:

%macro do_join(left=, right=, out=, type=full, by=);
  proc sort data=&left; by &by; run;
  proc sort data=&right; by &by; run;
  data &out;
    merge &left (in=a) &right (in=b);
    by &by;
    %if &type=inner %then %do; if a and b; %end;
    %else %if &type=left  %then %do; if a; %end;
    %else %if &type=right %then %do; if b; %end;
  run;
%mend;

%do_join(left=products, right=current_price,
         out=both, type=inner, by=prod_id);

• R Function:

join_data <- function(df1, df2, by, type="full") {
  library(dplyr)
  join_fn <- switch(type,
                    full = full_join,
                    inner = inner_join,
                    left  = left_join,
                    right = right_join)
  join_fn(df1, df2, by = by)
}

# Usage:
both <- join_data(products, current_price,
                  by = "prod_id", type = "inner")

12. Common Error/Warning Scenarios in SAS MERGE and dplyr Joins

12.1. Key Not Unique (Duplicate Keys)

Scenario:
Both SAS and dplyr expect keys to be unique for a simple merge/join. Duplicate keys can cause unexpected row duplication or mismatches.

Input Example:

prod_id	name
P001	Widget
P001	WidgetX
P002	Gadget

prod_id	price
P001	10.00
P002	15.00

R Code:

products_dup <- tibble(prod_id = c("P001", "P001", "P002"),
                       name = c("Widget", "WidgetX", "Gadget"))
current_price <- tibble(prod_id = c("P001", "P002"),
                        price = c(10.00, 15.00))

result <- left_join(products_dup, current_price, by = "prod_id")
print(result)

Output:

prod_id	name	price
P001	Widget	10.00
P001	WidgetX	10.00
P002	Gadget	15.00

Solution:
If duplicates are unintended, remove them before joining:

products_nodup <- products_dup %>% distinct(prod_id, .keep_all = TRUE)
result <- left_join(products_nodup, current_price, by = "prod_id")

If duplicates are valid, be aware that all combinations will be returned.

12.2. Missing Key Columns

Scenario:
The join key is missing in one or both tables.

Input Example:

prod_id	name
P001	Widget
P002	Gadget

id	price
P001	10.00
P002	15.00

R Code:

products <- tibble(prod_id = c("P001", "P002"), name = c("Widget", "Gadget"))
current_price <- tibble(id = c("P001", "P002"), price = c(10.00, 15.00))

# This will give a warning or error
result <- left_join(products, current_price, by = "prod_id")

Output:
Error: by = "prod_id" not found in one of the tables.

Solution:
Rename columns to match before joining:

current_price <- current_price %>% rename(prod_id = id)
result <- left_join(products, current_price, by = "prod_id")

12.3. Type Mismatch on Join Key

Scenario:
Key columns have different types (e.g., character vs. numeric).

Input Example:

prod_id (character)	name
P001	Widget
P002	Gadget

prod_id (numeric)	price
1	10.00
2	15.00

R Code:

products <- tibble(prod_id = c("P001", "P002"), name = c("Widget", "Gadget"))
current_price <- tibble(prod_id = c(1, 2), price = c(10.00, 15.00))

result <- left_join(products, current_price, by = "prod_id")
print(result)

Output:

prod_id	name	price
P001	Widget	NA
P002	Gadget	NA

Solution:
Convert types to match before joining:

current_price <- current_price %>% mutate(prod_id = paste0("P00", prod_id))
result <- left_join(products, current_price, by = "prod_id")

12.4. Many-to-Many Join (Multiple Matches)

Scenario:
Both tables have duplicate keys, resulting in a Cartesian product for those keys.

Input Example:

prod_id	name
P001	Widget
P001	WidgetX

prod_id	price
P001	10.00
P001	12.00

R Code:

products <- tibble(prod_id = c("P001", "P001"), name = c("Widget", "WidgetX"))
current_price <- tibble(prod_id = c("P001", "P001"), price = c(10.00, 12.00))

result <- left_join(products, current_price, by = "prod_id")
print(result)

Output:

prod_id	name	price
P001	Widget	10.00
P001	Widget	12.00
P001	WidgetX	10.00
P001	WidgetX	12.00

Solution:
If only one match is desired, deduplicate or aggregate before joining.

12.5. Handling NA/Missing Values in Keys

Scenario:
NA values in key columns can cause warnings or unexpected results.

Input Example:

prod_id	name
P001	Widget
NA	Gadget

prod_id	price
P001	10.00
NA	15.00

R Code:

products <- tibble(prod_id = c("P001", NA), name = c("Widget", "Gadget"))
current_price <- tibble(prod_id = c("P001", NA), price = c(10.00, 15.00))

result <- left_join(products, current_price, by = "prod_id")
print(result)

Output:

prod_id	name	price
P001	Widget	10.00
NA	Gadget	15.00

Solution:
Remove or impute NA keys before joining if not desired:

products_clean <- products %>% filter(!is.na(prod_id))
current_price_clean <- current_price %>% filter(!is.na(prod_id))
result <- left_join(products_clean, current_price_clean, by = "prod_id")

12.6. Column Name Conflicts

Scenario:
Non-key columns with the same name in both tables will be suffixed .x and .y in dplyr.

Input Example:

prod_id	name
P001	Widget
P002	Gadget

prod_id	name	price
P001	WidgetA	10.00
P002	GadgetB	15.00

R Code:

products <- tibble(prod_id = c("P001", "P002"), name = c("Widget", "Gadget"))
current_price <- tibble(prod_id = c("P001", "P002"), name = c("WidgetA", "GadgetB"), price = c(10.00, 15.00))

result <- left_join(products, current_price, by = "prod_id")
print(result)

Output:

prod_id	name.x	name.y	price
P001	Widget	WidgetA	10.00
P002	Gadget	GadgetB	15.00

Solution:
Use mutate() and coalesce() to resolve:

result <- result %>%
  mutate(name = coalesce(name.y, name.x)) %>%
  select(prod_id, name, price)

12.7. dplyr Warning: Duplicate Key Columns

Scenario:
dplyr will warn if the join key appears in both tables but with different types or duplicated names.

Solution:
Ensure join keys are named and typed identically in both tables.

12.8. General Best Practices

Always check for duplicate keys before joining.
Ensure key columns are present and of the same type.
Handle NA values in keys appropriately.
Use distinct() or group_by() + summarise() to deduplicate or aggregate as needed.
For large tables, consider using data.table for performance.

12.9. Summary Table: Error/Warning Scenarios and Solutions

Scenario	Error/Warning	Solution/Code Example
Duplicate keys	Unexpected rows	`distinct()` or aggregate
Missing key columns	Error	`rename()` columns
Type mismatch	No matches	`mutate()` to convert types
Many-to-many join	Cartesian product	Deduplicate before join
NA in keys	NA matches	`filter(!is.na(key))`
Column name conflict	`.x`/`.y` suffixes	`coalesce()` or `select()`
SAS not sorted	Incorrect merge	`proc sort` before `merge`
dplyr duplicate key	Warning	Ensure unique, matching keys

Resource download links

1.4.4.-SAS-MERGE-vs-dplyr-joins.zip

⁂

1.4. Migrating from SAS to R: A Skill Conversion Guide