2.3.5. Joining Two Tables
1. Introduction
Joining (merging) tables is a core data wrangling skill in R. The dplyr package provides a family of join functions that let you combine information from two data frames based on common keys. This is essential for relational data, such as combining subject-level data with visit-level or event-level data in SDTM.
2. Why Join Tables?
- Combine subject-level and event-level data for analysis.
- Add metadata or lookup information to main datasets.
- Filter or subset data based on presence in another table.
- Prepare data for reporting, visualization, or modeling.
3. Example Data
Suppose you have two SDTM-style tables:
dm (Demographics):
| USUBJID | AGE | SEX | ARM |
|---|---|---|---|
| 01-701-101 | 34 | M | Placebo |
| 01-701-102 | 29 | F | Drug X |
| 01-701-103 | 67 | F | Placebo |
| 01-701-104 | 54 | M | Drug X |
ae (Adverse Events):
| USUBJID | AEDECOD | AESER |
|---|---|---|
| 01-701-101 | HEADACHE | N |
| 01-701-102 | NAUSEA | N |
| 01-701-103 | VOMITING | Y |
| 01-701-105 | DIZZINESS | N |
4. Types of Joins in dplyr
- inner_join(x, y): Keep only rows with keys in both x and y.
- left_join(x, y): Keep all rows from x, add columns from y where keys match.
- right_join(x, y): Keep all rows from y, add columns from x where keys match.
- full_join(x, y): Keep all rows from both x and y.
- semi_join(x, y): Keep rows from x with a match in y (columns from x only).
- anti_join(x, y): Keep rows from x with no match in y (columns from x only).
5. Join Examples
5.1. inner_join()
R Code:
library(dplyr)
inner_join(dm, ae, by = "USUBJID")
inner_join(dm, ae, by = "USUBJID"):- Combines rows from
dmandaewhereUSUBJIDmatches in both tables. - Only keeps rows with matching keys in both tables.
- Returns all columns from both tables.
- Combines rows from
Input Tables:
| USUBJID | AGE | SEX | ARM |
|---|---|---|---|
| 01-701-101 | 34 | M | Placebo |
| 01-701-102 | 29 | F | Drug X |
| 01-701-103 | 67 | F | Placebo |
| 01-701-104 | 54 | M | Drug X |
| USUBJID | AEDECOD | AESER |
|---|---|---|
| 01-701-101 | HEADACHE | N |
| 01-701-102 | NAUSEA | N |
| 01-701-103 | VOMITING | Y |
| 01-701-105 | DIZZINESS | N |
Output Table:
| USUBJID | AGE | SEX | ARM | AEDECOD | AESER |
|---|---|---|---|---|---|
| 01-701-101 | 34 | M | Placebo | HEADACHE | N |
| 01-701-102 | 29 | F | Drug X | NAUSEA | N |
| 01-701-103 | 67 | F | Placebo | VOMITING | Y |
5.2. left_join()
R Code:
left_join(dm, ae, by = "USUBJID")
left_join(dm, ae, by = "USUBJID"):- Keeps all rows from
dm. - Adds columns from
aewhereUSUBJIDmatches. - Fills with NA where there is no match in
ae.
- Keeps all rows from
Output Table:
| USUBJID | AGE | SEX | ARM | AEDECOD | AESER |
|---|---|---|---|---|---|
| 01-701-101 | 34 | M | Placebo | HEADACHE | N |
| 01-701-102 | 29 | F | Drug X | NAUSEA | N |
| 01-701-103 | 67 | F | Placebo | VOMITING | Y |
| 01-701-104 | 54 | M | Drug X | NA | NA |
5.3. right_join()
R Code:
right_join(dm, ae, by = "USUBJID")
right_join(dm, ae, by = "USUBJID"):- Keeps all rows from
ae. - Adds columns from
dmwhereUSUBJIDmatches. - Fills with NA where there is no match in
dm.
- Keeps all rows from
Output Table:
| USUBJID | AGE | SEX | ARM | AEDECOD | AESER |
|---|---|---|---|---|---|
| 01-701-101 | 34 | M | Placebo | HEADACHE | N |
| 01-701-102 | 29 | F | Drug X | NAUSEA | N |
| 01-701-103 | 67 | F | Placebo | VOMITING | Y |
| 01-701-105 | NA | NA | NA | DIZZINESS | N |
5.4. full_join()
R Code:
full_join(dm, ae, by = "USUBJID")
full_join(dm, ae, by = "USUBJID"):- Keeps all rows from both
dmandae. - Fills with NA where there is no match in either table.
- Keeps all rows from both
Output Table:
| USUBJID | AGE | SEX | ARM | AEDECOD | AESER |
|---|---|---|---|---|---|
| 01-701-101 | 34 | M | Placebo | HEADACHE | N |
| 01-701-102 | 29 | F | Drug X | NAUSEA | N |
| 01-701-103 | 67 | F | Placebo | VOMITING | Y |
| 01-701-104 | 54 | M | Drug X | NA | NA |
| 01-701-105 | NA | NA | NA | DIZZINESS | N |
5.5. semi_join()
R Code:
semi_join(dm, ae, by = "USUBJID")
semi_join(dm, ae, by = "USUBJID"):- Keeps only rows from
dmwhereUSUBJIDexists inae. - Returns only columns from
dm.
- Keeps only rows from
Output Table:
| USUBJID | AGE | SEX | ARM |
|---|---|---|---|
| 01-701-101 | 34 | M | Placebo |
| 01-701-102 | 29 | F | Drug X |
| 01-701-103 | 67 | F | Placebo |
5.6. anti_join()
R Code:
anti_join(dm, ae, by = "USUBJID")
anti_join(dm, ae, by = "USUBJID"):- Keeps only rows from
dmwhereUSUBJIDdoes NOT exist inae. - Returns only columns from
dm.
- Keeps only rows from
Output Table:
| USUBJID | AGE | SEX | ARM |
|---|---|---|---|
| 01-701-104 | 54 | M | Drug X |
6. Beyond the Basics
Join by multiple keys:
Sometimes you need to join by more than one column (e.g.,STUDYIDandUSUBJID).Input Tables:
STUDYID USUBJID AGE A1001 01-701-101 34 A1001 01-701-102 29 STUDYID USUBJID VISIT AEDECOD A1001 01-701-101 1 HEADACHE A1001 01-701-102 1 NAUSEA R Code:
left_join(dm, ae, by = c("STUDYID", "USUBJID"))- Joins on both
STUDYIDandUSUBJIDcolumns.
Output Table:
STUDYID USUBJID AGE VISIT AEDECOD A1001 01-701-101 34 1 HEADACHE A1001 01-701-102 29 1 NAUSEA - Joins on both
Suffixes for overlapping columns:
If both tables have columns with the same name (other than the join key), usesuffix = c(".dm", ".ae")to distinguish them.Input Tables:
USUBJID AGE ARM 01-701-101 34 Placebo USUBJID AGE AEDECOD 01-701-101 35 HEADACHE R Code:
left_join(dm, ae, by = "USUBJID", suffix = c(".dm", ".ae"))- The
AGEcolumn fromdmbecomesAGE.dm, fromaebecomesAGE.ae.
Output Table:
USUBJID AGE.dm ARM AGE.ae AEDECOD 01-701-101 34 Placebo 35 HEADACHE - The
Join with different column names:
If the key columns have different names in each table, specify them inby.Input Tables:
SUBJID AGE 01-701-101 34 USUBJID AEDECOD 01-701-101 HEADACHE R Code:
left_join(dm, ae, by = c("SUBJID" = "USUBJID"))- Joins
dm$SUBJIDtoae$USUBJID.
Output Table:
SUBJID AGE AEDECOD 01-701-101 34 HEADACHE - Joins
Filtering joins for subsetting:
Usesemi_join()andanti_join()to filter one table based on the presence or absence of keys in another.Input Tables:
USUBJID AGE SEX ARM 01-701-101 34 M Placebo 01-701-102 29 F Drug X 01-701-103 67 F Placebo 01-701-104 54 M Drug X USUBJID AEDECOD AESER 01-701-101 HEADACHE N 01-701-102 NAUSEA N 01-701-103 VOMITING Y 01-701-105 DIZZINESS N R Code:
# Keep only subjects in dm who have an AE semi_join(dm, ae, by = "USUBJID") # Keep only subjects in dm who do NOT have an AE anti_join(dm, ae, by = "USUBJID")semi_join Output Table:
USUBJID AGE SEX ARM 01-701-101 34 M Placebo 01-701-102 29 F Drug X 01-701-103 67 F Placebo anti_join Output Table:
USUBJID AGE SEX ARM 01-701-104 54 M Drug X
Chaining joins:
You can join multiple tables in sequence to build a comprehensive dataset.Input Tables:
dm:
USUBJID AGE SEX ARM 01-701-101 34 M Placebo 01-701-102 29 F Drug X ae:
USUBJID AEDECOD AESER 01-701-101 HEADACHE N 01-701-102 NAUSEA N lb:
USUBJID PARAM VALUE 01-701-101 HGB 13.5 01-701-102 HGB 12.8 R Code:
dm %>% left_join(ae, by = "USUBJID") %>% left_join(lb, by = "USUBJID")- Combines demographics, adverse events, and labs into one table by subject.
Output Table:
USUBJID AGE SEX ARM AEDECOD AESER PARAM VALUE 01-701-101 34 M Placebo HEADACHE N HGB 13.5 01-701-102 29 F Drug X NAUSEA N HGB 12.8
- Tagging Each Record’s Source After a Full Join
Sometimes, after merging two tables, you want to know where each record originated: only in the left table, only in the right table, or in both. This is useful for data reconciliation, audit, or reporting.
Example Data:
| prod_id | name |
|---|---|
| P001 | Widget |
| P002 | Gadget |
| prod_id | price |
|---|---|
| P001 | 10.00 |
| P003 | 12.50 |
Method 1: Use case_when() on is.na() after full_join()
R Code:
library(dplyr)
tagged <- full_join(products, current_price, by = "prod_id") %>%
mutate(source = case_when(
!is.na(name) & !is.na(price) ~ "both",
!is.na(name) & is.na(price) ~ "left",
is.na(name) & !is.na(price) ~ "right",
TRUE ~ NA_character_
))
full_join(products, current_price, by = "prod_id"): merges all rows from both tables.mutate(source = case_when(...)): tags each row as "both", "left", or "right" based on which columns are present.
Output Table:
| prod_id | name | price | source |
|---|---|---|---|
| P001 | Widget | 10.00 | both |
| P002 | Gadget | NA | left |
| P003 | Doohickey | 12.50 | right |
Method 2: Add flag columns before join
R Code:
p2 <- products %>% mutate(in_products = TRUE)
c2 <- current_price %>% mutate(in_price = TRUE)
tagged2 <- full_join(p2, c2, by = "prod_id") %>%
mutate(source = case_when(
in_products & in_price ~ "both",
in_products & !in_price ~ "left",
!in_products & in_price ~ "right"
)) %>%
select(prod_id, name, price, source)
- Adds logical flags to each table before joining.
- After the join, uses these flags to determine the source of each row.
Output Table:
| prod_id | name | price | source |
|---|---|---|---|
| P001 | Widget | 10.00 | both |
| P002 | Gadget | NA | left |
| P003 | Doohickey | 12.50 | right |
- Explanation:
- "both": present in both tables (matched by
prod_id) - "left": present only in the left table (
products) - "right": present only in the right table (
current_price) - This approach is useful for tracking data provenance after merges.
- "both": present in both tables (matched by
- Encapsulate Join Logic in Reusable Functions
For cleaner, more maintainable code, you can wrap join logic in a reusable R function. This is especially useful when you need to perform different types of joins in multiple places.
R Function:
join_data <- function(df1, df2, by, type = "full") {
library(dplyr)
join_fn <- switch(type,
full = full_join,
inner = inner_join,
left = left_join,
right = right_join)
join_fn(df1, df2, by = by)
}
df1,df2: Data frames to join.by: Key(s) to join by (e.g.,"prod_id"orc("USUBJID", "VISIT")).type: Type of join ("full","inner","left", or"right").
Usage Example:
both <- join_data(products, current_price, by = "prod_id", type = "inner")
- Performs an inner join between
productsandcurrent_pricebyprod_id.
Output Table (for the example above):
| prod_id | name | price |
|---|---|---|
| P001 | Widget | 10.00 |
- Explanation:
- The function makes your join code concise and reusable.
- You can easily switch join types by changing the
typeargument.
7. Conclusion
- Joins are essential for combining and analyzing relational data in R.
- Use the appropriate join type for your analysis goal.
- Always check your join keys and output for expected results.
- Explore advanced join options for more complex data wrangling, including multi-key joins, suffixes, and chaining.