2.3.5. Joining Two Tables

1. Introduction

Joining (merging) tables is a core data wrangling skill in R. The dplyr package provides a family of join functions that let you combine information from two data frames based on common keys. This is essential for relational data, such as combining subject-level data with visit-level or event-level data in SDTM.

2. Why Join Tables?

Combine subject-level and event-level data for analysis.
Add metadata or lookup information to main datasets.
Filter or subset data based on presence in another table.
Prepare data for reporting, visualization, or modeling.

3. Example Data

Suppose you have two SDTM-style tables:

dm (Demographics):

USUBJID	AGE	SEX	ARM
01-701-101	34	M	Placebo
01-701-102	29	F	Drug X
01-701-103	67	F	Placebo
01-701-104	54	M	Drug X

ae (Adverse Events):

USUBJID	AEDECOD	AESER
01-701-101	HEADACHE	N
01-701-102	NAUSEA	N
01-701-103	VOMITING	Y
01-701-105	DIZZINESS	N

4. Types of Joins in dplyr

inner_join(x, y): Keep only rows with keys in both x and y.
left_join(x, y): Keep all rows from x, add columns from y where keys match.
right_join(x, y): Keep all rows from y, add columns from x where keys match.
full_join(x, y): Keep all rows from both x and y.
semi_join(x, y): Keep rows from x with a match in y (columns from x only).
anti_join(x, y): Keep rows from x with no match in y (columns from x only).

5. Join Examples

5.1. inner_join()

R Code:

library(dplyr)
inner_join(dm, ae, by = "USUBJID")

inner_join(dm, ae, by = "USUBJID"):
- Combines rows from dm and ae where USUBJID matches in both tables.
- Only keeps rows with matching keys in both tables.
- Returns all columns from both tables.

Input Tables:

USUBJID	AGE	SEX	ARM
01-701-101	34	M	Placebo
01-701-102	29	F	Drug X
01-701-103	67	F	Placebo
01-701-104	54	M	Drug X

USUBJID	AEDECOD	AESER
01-701-101	HEADACHE	N
01-701-102	NAUSEA	N
01-701-103	VOMITING	Y
01-701-105	DIZZINESS	N

Output Table:

USUBJID	AGE	SEX	ARM	AEDECOD	AESER
01-701-101	34	M	Placebo	HEADACHE	N
01-701-102	29	F	Drug X	NAUSEA	N
01-701-103	67	F	Placebo	VOMITING	Y

5.2. left_join()

R Code:

left_join(dm, ae, by = "USUBJID")

left_join(dm, ae, by = "USUBJID"):
- Keeps all rows from dm.
- Adds columns from ae where USUBJID matches.
- Fills with NA where there is no match in ae.

Output Table:

USUBJID	AGE	SEX	ARM	AEDECOD	AESER
01-701-101	34	M	Placebo	HEADACHE	N
01-701-102	29	F	Drug X	NAUSEA	N
01-701-103	67	F	Placebo	VOMITING	Y
01-701-104	54	M	Drug X	NA	NA

5.3. right_join()

R Code:

right_join(dm, ae, by = "USUBJID")

right_join(dm, ae, by = "USUBJID"):
- Keeps all rows from ae.
- Adds columns from dm where USUBJID matches.
- Fills with NA where there is no match in dm.

Output Table:

USUBJID	AGE	SEX	ARM	AEDECOD	AESER
01-701-101	34	M	Placebo	HEADACHE	N
01-701-102	29	F	Drug X	NAUSEA	N
01-701-103	67	F	Placebo	VOMITING	Y
01-701-105	NA	NA	NA	DIZZINESS	N

5.4. full_join()

R Code:

full_join(dm, ae, by = "USUBJID")

full_join(dm, ae, by = "USUBJID"):
- Keeps all rows from both dm and ae.
- Fills with NA where there is no match in either table.

Output Table:

USUBJID	AGE	SEX	ARM	AEDECOD	AESER
01-701-101	34	M	Placebo	HEADACHE	N
01-701-102	29	F	Drug X	NAUSEA	N
01-701-103	67	F	Placebo	VOMITING	Y
01-701-104	54	M	Drug X	NA	NA
01-701-105	NA	NA	NA	DIZZINESS	N

5.5. semi_join()

R Code:

semi_join(dm, ae, by = "USUBJID")

semi_join(dm, ae, by = "USUBJID"):
- Keeps only rows from dm where USUBJID exists in ae.
- Returns only columns from dm.

Output Table:

USUBJID	AGE	SEX	ARM
01-701-101	34	M	Placebo
01-701-102	29	F	Drug X
01-701-103	67	F	Placebo

5.6. anti_join()

R Code:

anti_join(dm, ae, by = "USUBJID")

anti_join(dm, ae, by = "USUBJID"):
- Keeps only rows from dm where USUBJID does NOT exist in ae.
- Returns only columns from dm.

Output Table:

USUBJID	AGE	SEX	ARM
01-701-104	54	M	Drug X

6. Beyond the Basics

Join by multiple keys:
Sometimes you need to join by more than one column (e.g., STUDYID and USUBJID).

Input Tables:

STUDYID USUBJID AGE

A1001 01-701-101 34

A1001 01-701-102 29

STUDYID USUBJID VISIT AEDECOD

A1001 01-701-101 1 HEADACHE

A1001 01-701-102 1 NAUSEA

R Code:
```
left_join(dm, ae, by = c("STUDYID", "USUBJID"))
```
- Joins on both STUDYID and USUBJID columns.
Output Table:

STUDYID USUBJID AGE VISIT AEDECOD

A1001 01-701-101 34 1 HEADACHE

A1001 01-701-102 29 1 NAUSEA

STUDYID	USUBJID	AGE
A1001	01-701-101	34
A1001	01-701-102	29

STUDYID	USUBJID	VISIT	AEDECOD
A1001	01-701-101	1	HEADACHE
A1001	01-701-102	1	NAUSEA

STUDYID	USUBJID	AGE	VISIT	AEDECOD
A1001	01-701-101	34	1	HEADACHE
A1001	01-701-102	29	1	NAUSEA

Suffixes for overlapping columns:
If both tables have columns with the same name (other than the join key), use suffix = c(".dm", ".ae") to distinguish them.

Input Tables:

USUBJID AGE ARM

01-701-101 34 Placebo

USUBJID AGE AEDECOD

01-701-101 35 HEADACHE

R Code:
```
left_join(dm, ae, by = "USUBJID", suffix = c(".dm", ".ae"))
```
- The AGE column from dm becomes AGE.dm, from ae becomes AGE.ae.
Output Table:

USUBJID AGE.dm ARM AGE.ae AEDECOD

01-701-101 34 Placebo 35 HEADACHE

USUBJID	AGE	ARM
01-701-101	34	Placebo

USUBJID	AGE	AEDECOD
01-701-101	35	HEADACHE

USUBJID	AGE.dm	ARM	AGE.ae	AEDECOD
01-701-101	34	Placebo	35	HEADACHE

Join with different column names:
If the key columns have different names in each table, specify them in by.

Input Tables:

SUBJID AGE

01-701-101 34

USUBJID AEDECOD

01-701-101 HEADACHE

R Code:
```
left_join(dm, ae, by = c("SUBJID" = "USUBJID"))
```
- Joins dm$SUBJID to ae$USUBJID.
Output Table:

SUBJID AGE AEDECOD

01-701-101 34 HEADACHE

SUBJID	AGE
01-701-101	34

USUBJID	AEDECOD
01-701-101	HEADACHE

SUBJID	AGE	AEDECOD
01-701-101	34	HEADACHE

Filtering joins for subsetting:
Use semi_join() and anti_join() to filter one table based on the presence or absence of keys in another.

Input Tables:

USUBJID AGE SEX ARM

01-701-101 34 M Placebo

01-701-102 29 F Drug X

01-701-103 67 F Placebo

01-701-104 54 M Drug X

USUBJID AEDECOD AESER

01-701-101 HEADACHE N

01-701-102 NAUSEA N

01-701-103 VOMITING Y

01-701-105 DIZZINESS N

R Code:
```
# Keep only subjects in dm who have an AE
semi_join(dm, ae, by = "USUBJID")
# Keep only subjects in dm who do NOT have an AE
anti_join(dm, ae, by = "USUBJID")
```
semi_join Output Table:

USUBJID AGE SEX ARM

01-701-101 34 M Placebo

01-701-102 29 F Drug X

01-701-103 67 F Placebo

anti_join Output Table:

USUBJID AGE SEX ARM

01-701-104 54 M Drug X

Chaining joins:
You can join multiple tables in sequence to build a comprehensive dataset.

Input Tables:

dm:

USUBJID AGE SEX ARM

01-701-101 34 M Placebo

01-701-102 29 F Drug X

ae:

USUBJID AEDECOD AESER

01-701-101 HEADACHE N

01-701-102 NAUSEA N

lb:

USUBJID PARAM VALUE

01-701-101 HGB 13.5

01-701-102 HGB 12.8

R Code:
```
dm %>%
  left_join(ae, by = "USUBJID") %>%
  left_join(lb, by = "USUBJID")
```
- Combines demographics, adverse events, and labs into one table by subject.
Output Table:

USUBJID AGE SEX ARM AEDECOD AESER PARAM VALUE

01-701-101 34 M Placebo HEADACHE N HGB 13.5

01-701-102 29 F Drug X NAUSEA N HGB 12.8

USUBJID	PARAM	VALUE
01-701-101	HGB	13.5
01-701-102	HGB	12.8

USUBJID	AGE	SEX	ARM	AEDECOD	AESER	PARAM	VALUE
01-701-101	34	M	Placebo	HEADACHE	N	HGB	13.5
01-701-102	29	F	Drug X	NAUSEA	N	HGB	12.8

Tagging Each Record’s Source After a Full Join

Sometimes, after merging two tables, you want to know where each record originated: only in the left table, only in the right table, or in both. This is useful for data reconciliation, audit, or reporting.

Example Data:

prod_id	name
P001	Widget
P002	Gadget

prod_id	price
P001	10.00
P003	12.50

Method 1: Use case_when() on is.na() after full_join()

R Code:

library(dplyr)
tagged <- full_join(products, current_price, by = "prod_id") %>%
  mutate(source = case_when(
    !is.na(name) &  !is.na(price) ~ "both",
    !is.na(name) &   is.na(price) ~ "left",
     is.na(name) &  !is.na(price) ~ "right",
    TRUE                          ~ NA_character_
  ))

full_join(products, current_price, by = "prod_id"): merges all rows from both tables.
mutate(source = case_when(...)): tags each row as "both", "left", or "right" based on which columns are present.

Output Table:

prod_id	name	price	source
P001	Widget	10.00	both
P002	Gadget	NA	left
P003	Doohickey	12.50	right

Method 2: Add flag columns before join

R Code:

p2 <- products %>% mutate(in_products = TRUE)
c2 <- current_price %>% mutate(in_price = TRUE)

tagged2 <- full_join(p2, c2, by = "prod_id") %>%
  mutate(source = case_when(
    in_products & in_price  ~ "both",
    in_products & !in_price ~ "left",
    !in_products & in_price ~ "right"
  )) %>%
  select(prod_id, name, price, source)

Adds logical flags to each table before joining.
After the join, uses these flags to determine the source of each row.

Output Table:

prod_id	name	price	source
P001	Widget	10.00	both
P002	Gadget	NA	left
P003	Doohickey	12.50	right

Explanation:
- "both": present in both tables (matched by prod_id)
- "left": present only in the left table (products)
- "right": present only in the right table (current_price)
- This approach is useful for tracking data provenance after merges.

Encapsulate Join Logic in Reusable Functions

For cleaner, more maintainable code, you can wrap join logic in a reusable R function. This is especially useful when you need to perform different types of joins in multiple places.

R Function:

join_data <- function(df1, df2, by, type = "full") {
  library(dplyr)
  join_fn <- switch(type,
                    full = full_join,
                    inner = inner_join,
                    left  = left_join,
                    right = right_join)
  join_fn(df1, df2, by = by)
}

df1, df2: Data frames to join.
by: Key(s) to join by (e.g., "prod_id" or c("USUBJID", "VISIT")).
type: Type of join ("full", "inner", "left", or "right").

Usage Example:

both <- join_data(products, current_price, by = "prod_id", type = "inner")

Performs an inner join between products and current_price by prod_id.

Output Table (for the example above):

prod_id	name	price
P001	Widget	10.00

Explanation:
- The function makes your join code concise and reusable.
- You can easily switch join types by changing the type argument.

7. Conclusion

Joins are essential for combining and analyzing relational data in R.
Use the appropriate join type for your analysis goal.
Always check your join keys and output for expected results.
Explore advanced join options for more complex data wrangling, including multi-key joins, suffixes, and chaining.

Resource download links

2.3.5.-Joining-Two-Tables.zip

⁂