1.4. Migrating from SAS to R: A Skill Conversion Guide
1.4.4. SAS MERGE vs. dplyr joins
Both SAS's MERGE statement and R's dplyr package (specifically its *_join() functions) are used to combine data from multiple sources. While they achieve similar goals, they have distinct philosophies and functionalities.
Sample Input Tables
| prod_id | name |
|---|---|
| P001 | Widget |
| P002 | Gadget |
| P003 | Doohickey |
| prod_id | price |
|---|---|
| P001 | 10.00 |
| P003 | 12.50 |
1. Full Merge / full_join()
Behavior
- Includes every
prod_idfrom both tables - Unmatched columns set to missing (
.in SAS,NAin R) - Builds a superset of keys (union of left & right)
- Equivalent to SQL FULL OUTER JOIN
/* SAS: full merge */
proc sort data=products; by prod_id; run;
proc sort data=current_price; by prod_id; run;
data all;
merge products current_price;
by prod_id;
run;
proc print data=all; run;
Expected Output (SAS “all”):
| prod_id | name | price |
|---|---|---|
| P001 | Widget | 10.00 |
| P002 | Gadget | . |
| P003 | Doohickey | 12.50 |
library(dplyr)
products <- data.frame(
prod_id = c("P001", "P002", "P003"),
name = c("Widget", "Gadget", "Doohickey")
)
current_price <- data.frame(
prod_id = c("P001", "P003"),
price = c(10.00, 12.50)
)
all <- full_join(products, current_price, by = "prod_id")
print(all)
# returns the data frame 'all'
Returns (R “all”):
| prod_id | name | price |
|---|---|---|
| P001 | Widget | 10.00 |
| P002 | Gadget | NA |
| P003 | Doohickey | 12.50 |
2. Inner Merge / inner_join()
Behavior
- Keeps only keys present in both tables
- Equivalent to SQL INNER JOIN
- Filters out any row missing in either side
- Ensures complete data from both sources for returned keys
/* SAS: inner merge */
proc sort data=products; by prod_id; run;
proc sort data=current_price; by prod_id; run;
data both;
merge products(in=a) current_price(in=b);
by prod_id;
if a and b;
run;
proc print data=both; run;
Expected Output (“both”):
| prod_id | name | price |
|---|---|---|
| P001 | Widget | 10.00 |
library(dplyr)
products <- data.frame(
prod_id = c("P001", "P002", "P003"),
name = c("Widget", "Gadget", "Doohickey")
)
current_price <- data.frame(
prod_id = c("P001", "P003"),
price = c(10.00, 12.50)
)
both <- inner_join(products, current_price, by = "prod_id")
print(both)
# returns 'both'
Returns (both):
| prod_id | name | price |
|---|---|---|
| P001 | Widget | 10.00 |
3. Left-Only Merge / left_join()
Behavior
- All rows from products (left)
- Matching prices if available; otherwise missing
- Preserves left-table row order by default
- Equivalent to SQL LEFT OUTER JOIN
/* SAS: left-only merge */
proc sort data=products; by prod_id; run;
proc sort data=current_price; by prod_id; run;
data prod_with_price;
merge products(in=a) current_price;
by prod_id;
if a;
run;
proc print data=prod_with_price; run;
Expected Output:
| prod_id | name | price |
|---|---|---|
| P001 | Widget | 10.00 |
| P002 | Gadget | . |
library(dplyr)
products <- data.frame(
prod_id = c("P001", "P002", "P003"),
name = c("Widget", "Gadget", "Doohickey")
)
current_price <- data.frame(
prod_id = c("P001", "P003"),
price = c(10.00, 12.50)
)
prod_with_price <- left_join(products, current_price, by = "prod_id")
print(prod_with_price)
# returns 'prod_with_price'
Returns:
| prod_id | name | price |
|---|---|---|
| P001 | Widget | 10.00 |
| P002 | Gadget | NA |
4. Right-Only Merge / right_join()
Behavior
- All rows from current_price (right)
- Matching product names if available; otherwise missing
- Preserves right-table row order for unmatched keys
- Equivalent to SQL RIGHT OUTER JOIN
/* SAS: right-only merge */
proc sort data=products; by prod_id; run;
proc sort data=current_price; by prod_id; run;
data price_with_prod;
merge products current_price(in=b);
by prod_id;
if b;
run;
proc print data=price_with_prod; run;
Expected Output:
| prod_id | name | price |
|---|---|---|
| P001 | Widget | 10.00 |
| P003 | Doohickey | 12.50 |
library(dplyr)
products <- data.frame(
prod_id = c("P001", "P002", "P003"),
name = c("Widget", "Gadget", "Doohickey")
)
current_price <- data.frame(
prod_id = c("P001", "P003"),
price = c(10.00, 12.50)
)
price_with_prod <- right_join(products, current_price, by = "prod_id")
print(price_with_prod)
# returns 'price_with_prod'
Returns:
| prod_id | name | price |
|---|---|---|
| P001 | Widget | 10.00 |
| P003 | Doohickey | 12.50 |
5. Anti-Join / “Only-in-Left”
Behavior
- Rows in products not in current_price
- Equivalent to SQL
LEFT JOIN … WHERE right.key IS NULL - Useful for identifying or excluding unmatched left-side records
proc sort data=products; by prod_id; run;
proc sort data=current_price; by prod_id; run;
data only_products;
merge products(in=a) current_price(in=b);
by prod_id;
if a and not b;
run;
proc print data=only_products; run;
Expected Output:
| prod_id | name |
|---|---|
| P002 | Gadget |
library(dplyr)
products <- data.frame(
prod_id = c("P001", "P002", "P003"),
name = c("Widget", "Gadget", "Doohickey")
)
current_price <- data.frame(
prod_id = c("P001", "P003"),
price = c(10.00, 12.50)
)
anti <- anti_join(products, current_price, by = "prod_id")
print(anti)
# returns 'anti'
Returns:
| prod_id | name |
|---|---|
| P002 | Gadget |
6. Semi-Join / “Filter Left by Match”
Behavior
- Keep products that have a price
- Show only columns from
products - Equivalent to SQL
WHERE EXISTS (SELECT 1 …) - Efficient for filtering without bringing extra columns
proc sort data=products; by prod_id; run;
proc sort data=current_price; by prod_id; run;
data semi_products(keep=prod_id name);
merge products(in=a) current_price(in=b);
by prod_id;
if a and b;
run;
proc print data=semi_products; run;
Expected Output:
| prod_id | name |
|---|---|
| P001 | Widget |
library(dplyr)
products <- data.frame(
prod_id = c("P001", "P002", "P003"),
name = c("Widget", "Gadget", "Doohickey")
)
current_price <- data.frame(
prod_id = c("P001", "P003"),
price = c(10.00, 12.50)
)
semi <- semi_join(products, current_price, by = "prod_id")
print(semi)
# returns 'semi'
Returns:
| prod_id | name |
|---|---|
| P001 | Widget |
7. Coalescing Columns (SAS-Style Overwrite)
Behavior
- Simulate SAS’s “right overwrites left” for duplicate column names
price.x= left-side,price.y= right-side afterfull_join()coalesce()picks the first non-missing value across arguments
library(dplyr)
products <- data.frame(
prod_id = c("P001", "P002", "P003"),
name = c("Widget", "Gadget", "Doohickey")
)
current_price <- data.frame(
prod_id = c("P001", "P003"),
price = c(10.00, 12.50)
)
tj <- full_join(products, current_price, by = "prod_id")
tj <- tj %>%
mutate(price = coalesce(price, price)) %>%
select(prod_id, name, price)
print(tj)
Returns (tj):
| prod_id | name | price |
|---|---|---|
| P001 | Widget | 10.00 |
| P002 | Gadget | NA |
| P003 | Doohickey | 12.50 |
8. Multi-Key Merge / multi-column join
Behavior
- Match on all specified keys (e.g.
region+prod_id) - Only rows where every key is equal will merge
- Equivalent to SQL
JOIN … ON key1=… AND key2=…
Input:
| region | prod_id | qty |
|---|---|---|
| West | P001 | 5 |
| region | prod_id | price |
|---|---|---|
| West | P001 | 10.00 |
proc sort data=A; by region prod_id; run;
proc sort data=B; by region prod_id; run;
data AB;
merge A B;
by region prod_id;
run;
proc print data=AB; run;
Expected Output:
| region | prod_id | qty | price |
|---|---|---|---|
| West | P001 | 5 | 10.00 |
library(dplyr)
A <- data.frame(
region = "West",
prod_id = "P001",
qty = 5
)
B <- data.frame(
region = "West",
prod_id = "P001",
price = 10.00
)
AB <- full_join(A, B, by = c("region", "prod_id"))
print(AB)
# returns 'AB'
Returns:
| region | prod_id | qty | price |
|---|---|---|---|
| West | P001 | 5 | 10.00 |
9. SAS PROC SQL vs. dplyr
Behavior
- SQL
FULL JOINmirrorsfull_join() coalesce()in SAS and R both choose the first non-missing key- Shows how to replicate DATA-step merges in PROC SQL
proc sql;
create table sql_full as
select coalesce(a.prod_id, b.prod_id) as prod_id,
a.name, b.price
from products as a
full join current_price as b
on a.prod_id = b.prod_id;
quit;
proc print data=sql_full; run;
Expected Output:
| prod_id | name | price |
|---|---|---|
| P001 | Widget | 10.00 |
| P002 | Gadget | . |
| P003 | Doohickey | 12.50 |
# R/dplyr: SQL-style full join + coalesce
# Define base product info
products_c <- data.frame(
prod_id = c("P001", "P002", "P003"),
name = c("Widget", "Gadget", NA)
)
# Define alternate product name and pricing
current_price_c <- data.frame(
prod_id = c("P001", "P003"),
name = c("SuperWidget", "MegaDoohickey"),
price = c(10.00, 12.50)
)
# Perform full join and use coalesce on overlapping 'name'
sql_full <- full_join(products_c, current_price_c, by = "prod_id") %>%
mutate(name = coalesce(name.x, name.y)) %>%
select(prod_id, name, price)
print(sql_full)
Returns (sql_full):
| prod_id | name | price |
|---|---|---|
| P001 | Widget | 10.00 |
| P002 | Gadget | NA |
| P003 | MegaDoohickey | 12.50 |
**Comparison Table SAS MERGE vs dplr joins**
| Feature | SAS MERGE | dplyr joins |
|---|---|---|
| Syntax | data X; merge A B; by key; run; |
X <- join_fn(A, B, by="key") |
| Pre-sort required | Yes | No |
| Row selection flags | (in=a), (in=b) |
Implicit in join choice |
| Missing value symbol | . |
NA |
| Overwrite vs suffixes | Right table overwrites left | Keeps both via .x / .y |
| Order of output | Sorted by BY variable | Original order + unmatched rows |
| Multi-key merge | by k1 k2; |
by = c("k1","k2") |
| SQL-style alternative | proc sql; … full join …; |
Same SQL or *_join() |
| Anti / Semi | Via IN= flags + IF subsets | anti_join(), semi_join() |
| Column coalesce | Manual assignment (price=b.price;) |
coalesce(price.x, price.y) |
10. Tagging Each Record’s Source
Goal: After merging, add a variable source indicating where the row came from:
"left"= only in products"right"= only in current_price"both"= in both
A. SAS
proc sort data=products; by prod_id; run;
proc sort data=current_price; by prod_id; run;
data tagged;
merge products(in=inP) current_price(in=inC);
by prod_id;
length source $5;
select;
when (inP and inC) source = "both";
when (inP and not inC) source = "left";
when (not inP and inC) source = "right";
otherwise source = "";
end;
run;
proc print data=tagged noobs; run;
Returns (SAS tagged):
| prod_id | name | price | source |
|---|---|---|---|
| P001 | Widget | 10.00 | both |
| P002 | Gadget | . | left |
| P003 | Doohickey | 12.50 | right |
B. R / dplyr
Method 1: case_when() on is.na()
library(dplyr)
products <- data.frame(
prod_id = c("P001", "P002", "P003"),
name = c("Widget", "Gadget", "Doohickey")
)
current_price <- data.frame(
prod_id = c("P001", "P003"),
price = c(10.00, 12.50)
)
tagged <- full_join(products, current_price, by = "prod_id") %>%
mutate(source = case_when(
!is.na(name) & !is.na(price) ~ "both",
!is.na(name) & is.na(price) ~ "left",
is.na(name) & !is.na(price) ~ "right",
TRUE ~ NA_character_
))
print(tagged)
# returns 'tagged'
Method 2: Flag columns before join
library(dplyr)
products <- data.frame(
prod_id = c("P001", "P002", "P003"),
name = c("Widget", "Gadget", "Doohickey")
)
current_price <- data.frame(
prod_id = c("P001", "P003"),
price = c(10.00, 12.50)
)
p2 <- products %>% mutate(in_products = TRUE)
c2 <- current_price %>% mutate(in_price = TRUE)
tagged2 <- full_join(p2, c2, by="prod_id") %>%
mutate(source = case_when(
in_products & in_price ~ "both",
in_products & !in_price ~ "left",
!in_products & in_price ~ "right"
)) %>%
select(prod_id, name, price, source)
print(tagged2)
# returns 'tagged2'
Returns (both R methods):
| prod_id | name | price | source |
|---|---|---|---|
| P001 | Widget | 10.00 | both |
| P002 | Gadget | NA | left |
| P003 | Doohickey | 12.50 | right |
11. Explore additional information
data.table High-Speed Joins
• Convert withsetDT()and index viasetkey().
•DT1[DT2]= fast right join;DT2[DT1]= fast left join.
• Example:library(data.table) setDT(products); setDT(current_price) setkey(products, prod_id); setkey(current_price, prod_id) # Full join dt_all <- merge(products, current_price, by="prod_id", all=TRUE) # Right join dt_price_with_prod <- products[current_price, nomatch=0]fuzzyjoin for Approximate Key Matching
• Usestringdist_join()to match on string distance;interval_join()for ranges.
• Example:library(fuzzyjoin) # Join where prod_id differs by ≤1 edit fuzzy_products <- stringdist_join( products, current_price, by = "prod_id", max_dist = 1, distance_col = "dist" )Performance Tuning for Very Large Tables
• Chunk large tables by hash or range, join in parallel, then combine withrbind().
• In SAS, use HASH objects; in R, leverage data.table keys or on-disk tools (DuckDB, arrow).
• Profile memory vs. CPU, adjustnthreadsor setoptions(datatable.print.nrows=…).Encapsulate Join Logic in Reusable Functions/Macros
• SAS Macro:%macro do_join(left=, right=, out=, type=full, by=); proc sort data=&left; by &by; run; proc sort data=&right; by &by; run; data &out; merge &left (in=a) &right (in=b); by &by; %if &type=inner %then %do; if a and b; %end; %else %if &type=left %then %do; if a; %end; %else %if &type=right %then %do; if b; %end; run; %mend; %do_join(left=products, right=current_price, out=both, type=inner, by=prod_id);• R Function:
join_data <- function(df1, df2, by, type="full") { library(dplyr) join_fn <- switch(type, full = full_join, inner = inner_join, left = left_join, right = right_join) join_fn(df1, df2, by = by) } # Usage: both <- join_data(products, current_price, by = "prod_id", type = "inner")
12. Common Error/Warning Scenarios in SAS MERGE and dplyr Joins
12.1. Key Not Unique (Duplicate Keys)
Scenario:
Both SAS and dplyr expect keys to be unique for a simple merge/join. Duplicate keys can cause unexpected row duplication or mismatches.
Input Example:
| prod_id | name |
|---|---|
| P001 | Widget |
| P001 | WidgetX |
| P002 | Gadget |
| prod_id | price |
|---|---|
| P001 | 10.00 |
| P002 | 15.00 |
R Code:
products_dup <- tibble(prod_id = c("P001", "P001", "P002"),
name = c("Widget", "WidgetX", "Gadget"))
current_price <- tibble(prod_id = c("P001", "P002"),
price = c(10.00, 15.00))
result <- left_join(products_dup, current_price, by = "prod_id")
print(result)
Output:
| prod_id | name | price |
|---|---|---|
| P001 | Widget | 10.00 |
| P001 | WidgetX | 10.00 |
| P002 | Gadget | 15.00 |
Solution:
If duplicates are unintended, remove them before joining:
products_nodup <- products_dup %>% distinct(prod_id, .keep_all = TRUE)
result <- left_join(products_nodup, current_price, by = "prod_id")
If duplicates are valid, be aware that all combinations will be returned.
12.2. Missing Key Columns
Scenario:
The join key is missing in one or both tables.
Input Example:
| prod_id | name |
|---|---|
| P001 | Widget |
| P002 | Gadget |
| id | price |
|---|---|
| P001 | 10.00 |
| P002 | 15.00 |
R Code:
products <- tibble(prod_id = c("P001", "P002"), name = c("Widget", "Gadget"))
current_price <- tibble(id = c("P001", "P002"), price = c(10.00, 15.00))
# This will give a warning or error
result <- left_join(products, current_price, by = "prod_id")
Output:
Error: by = "prod_id" not found in one of the tables.
Solution:
Rename columns to match before joining:
current_price <- current_price %>% rename(prod_id = id)
result <- left_join(products, current_price, by = "prod_id")
12.3. Type Mismatch on Join Key
Scenario:
Key columns have different types (e.g., character vs. numeric).
Input Example:
| prod_id (character) | name |
|---|---|
| P001 | Widget |
| P002 | Gadget |
| prod_id (numeric) | price |
|---|---|
| 1 | 10.00 |
| 2 | 15.00 |
R Code:
products <- tibble(prod_id = c("P001", "P002"), name = c("Widget", "Gadget"))
current_price <- tibble(prod_id = c(1, 2), price = c(10.00, 15.00))
result <- left_join(products, current_price, by = "prod_id")
print(result)
Output:
| prod_id | name | price |
|---|---|---|
| P001 | Widget | NA |
| P002 | Gadget | NA |
Solution:
Convert types to match before joining:
current_price <- current_price %>% mutate(prod_id = paste0("P00", prod_id))
result <- left_join(products, current_price, by = "prod_id")
12.4. Many-to-Many Join (Multiple Matches)
Scenario:
Both tables have duplicate keys, resulting in a Cartesian product for those keys.
Input Example:
| prod_id | name |
|---|---|
| P001 | Widget |
| P001 | WidgetX |
| prod_id | price |
|---|---|
| P001 | 10.00 |
| P001 | 12.00 |
R Code:
products <- tibble(prod_id = c("P001", "P001"), name = c("Widget", "WidgetX"))
current_price <- tibble(prod_id = c("P001", "P001"), price = c(10.00, 12.00))
result <- left_join(products, current_price, by = "prod_id")
print(result)
Output:
| prod_id | name | price |
|---|---|---|
| P001 | Widget | 10.00 |
| P001 | Widget | 12.00 |
| P001 | WidgetX | 10.00 |
| P001 | WidgetX | 12.00 |
Solution:
If only one match is desired, deduplicate or aggregate before joining.
12.5. Handling NA/Missing Values in Keys
Scenario:
NA values in key columns can cause warnings or unexpected results.
Input Example:
| prod_id | name |
|---|---|
| P001 | Widget |
| NA | Gadget |
| prod_id | price |
|---|---|
| P001 | 10.00 |
| NA | 15.00 |
R Code:
products <- tibble(prod_id = c("P001", NA), name = c("Widget", "Gadget"))
current_price <- tibble(prod_id = c("P001", NA), price = c(10.00, 15.00))
result <- left_join(products, current_price, by = "prod_id")
print(result)
Output:
| prod_id | name | price |
|---|---|---|
| P001 | Widget | 10.00 |
| NA | Gadget | 15.00 |
Solution:
Remove or impute NA keys before joining if not desired:
products_clean <- products %>% filter(!is.na(prod_id))
current_price_clean <- current_price %>% filter(!is.na(prod_id))
result <- left_join(products_clean, current_price_clean, by = "prod_id")
12.6. Column Name Conflicts
Scenario:
Non-key columns with the same name in both tables will be suffixed .x and .y in dplyr.
Input Example:
| prod_id | name |
|---|---|
| P001 | Widget |
| P002 | Gadget |
| prod_id | name | price |
|---|---|---|
| P001 | WidgetA | 10.00 |
| P002 | GadgetB | 15.00 |
R Code:
products <- tibble(prod_id = c("P001", "P002"), name = c("Widget", "Gadget"))
current_price <- tibble(prod_id = c("P001", "P002"), name = c("WidgetA", "GadgetB"), price = c(10.00, 15.00))
result <- left_join(products, current_price, by = "prod_id")
print(result)
Output:
| prod_id | name.x | name.y | price |
|---|---|---|---|
| P001 | Widget | WidgetA | 10.00 |
| P002 | Gadget | GadgetB | 15.00 |
Solution:
Use mutate() and coalesce() to resolve:
result <- result %>%
mutate(name = coalesce(name.y, name.x)) %>%
select(prod_id, name, price)
12.7. dplyr Warning: Duplicate Key Columns
Scenario:
dplyr will warn if the join key appears in both tables but with different types or duplicated names.
Solution:
Ensure join keys are named and typed identically in both tables.
12.8. General Best Practices
- Always check for duplicate keys before joining.
- Ensure key columns are present and of the same type.
- Handle NA values in keys appropriately.
- Use
distinct()orgroup_by()+summarise()to deduplicate or aggregate as needed. - For large tables, consider using
data.tablefor performance.
12.9. Summary Table: Error/Warning Scenarios and Solutions
| Scenario | Error/Warning | Solution/Code Example |
|---|---|---|
| Duplicate keys | Unexpected rows | distinct() or aggregate |
| Missing key columns | Error | rename() columns |
| Type mismatch | No matches | mutate() to convert types |
| Many-to-many join | Cartesian product | Deduplicate before join |
| NA in keys | NA matches | filter(!is.na(key)) |
| Column name conflict | .x/.y suffixes |
coalesce() or select() |
| SAS not sorted | Incorrect merge | proc sort before merge |
| dplyr duplicate key | Warning | Ensure unique, matching keys |
**Resource download links**
1.4.4.-SAS-MERGE-vs-dplyr-joins.zip