contact@a2zlearners.com

2.3.5. Joining Two Tables

1. Introduction

Joining (merging) tables is a core data wrangling skill in R. The dplyr package provides a family of join functions that let you combine information from two data frames based on common keys. This is essential for relational data, such as combining subject-level data with visit-level or event-level data in SDTM.


2. Why Join Tables?

  • Combine subject-level and event-level data for analysis.
  • Add metadata or lookup information to main datasets.
  • Filter or subset data based on presence in another table.
  • Prepare data for reporting, visualization, or modeling.

3. Example Data

Suppose you have two SDTM-style tables:

dm (Demographics):

USUBJID AGE SEX ARM
01-701-101 34 M Placebo
01-701-102 29 F Drug X
01-701-103 67 F Placebo
01-701-104 54 M Drug X

ae (Adverse Events):

USUBJID AEDECOD AESER
01-701-101 HEADACHE N
01-701-102 NAUSEA N
01-701-103 VOMITING Y
01-701-105 DIZZINESS N

4. Types of Joins in dplyr

  • inner_join(x, y): Keep only rows with keys in both x and y.
  • left_join(x, y): Keep all rows from x, add columns from y where keys match.
  • right_join(x, y): Keep all rows from y, add columns from x where keys match.
  • full_join(x, y): Keep all rows from both x and y.
  • semi_join(x, y): Keep rows from x with a match in y (columns from x only).
  • anti_join(x, y): Keep rows from x with no match in y (columns from x only).

5. Join Examples

5.1. inner_join()

R Code:

library(dplyr)
inner_join(dm, ae, by = "USUBJID")
  • inner_join(dm, ae, by = "USUBJID"):
    • Combines rows from dm and ae where USUBJID matches in both tables.
    • Only keeps rows with matching keys in both tables.
    • Returns all columns from both tables.

Input Tables:

USUBJID AGE SEX ARM
01-701-101 34 M Placebo
01-701-102 29 F Drug X
01-701-103 67 F Placebo
01-701-104 54 M Drug X
USUBJID AEDECOD AESER
01-701-101 HEADACHE N
01-701-102 NAUSEA N
01-701-103 VOMITING Y
01-701-105 DIZZINESS N

Output Table:

USUBJID AGE SEX ARM AEDECOD AESER
01-701-101 34 M Placebo HEADACHE N
01-701-102 29 F Drug X NAUSEA N
01-701-103 67 F Placebo VOMITING Y

5.2. left_join()

R Code:

left_join(dm, ae, by = "USUBJID")
  • left_join(dm, ae, by = "USUBJID"):
    • Keeps all rows from dm.
    • Adds columns from ae where USUBJID matches.
    • Fills with NA where there is no match in ae.

Output Table:

USUBJID AGE SEX ARM AEDECOD AESER
01-701-101 34 M Placebo HEADACHE N
01-701-102 29 F Drug X NAUSEA N
01-701-103 67 F Placebo VOMITING Y
01-701-104 54 M Drug X NA NA

5.3. right_join()

R Code:

right_join(dm, ae, by = "USUBJID")
  • right_join(dm, ae, by = "USUBJID"):
    • Keeps all rows from ae.
    • Adds columns from dm where USUBJID matches.
    • Fills with NA where there is no match in dm.

Output Table:

USUBJID AGE SEX ARM AEDECOD AESER
01-701-101 34 M Placebo HEADACHE N
01-701-102 29 F Drug X NAUSEA N
01-701-103 67 F Placebo VOMITING Y
01-701-105 NA NA NA DIZZINESS N

5.4. full_join()

R Code:

full_join(dm, ae, by = "USUBJID")
  • full_join(dm, ae, by = "USUBJID"):
    • Keeps all rows from both dm and ae.
    • Fills with NA where there is no match in either table.

Output Table:

USUBJID AGE SEX ARM AEDECOD AESER
01-701-101 34 M Placebo HEADACHE N
01-701-102 29 F Drug X NAUSEA N
01-701-103 67 F Placebo VOMITING Y
01-701-104 54 M Drug X NA NA
01-701-105 NA NA NA DIZZINESS N

5.5. semi_join()

R Code:

semi_join(dm, ae, by = "USUBJID")
  • semi_join(dm, ae, by = "USUBJID"):
    • Keeps only rows from dm where USUBJID exists in ae.
    • Returns only columns from dm.

Output Table:

USUBJID AGE SEX ARM
01-701-101 34 M Placebo
01-701-102 29 F Drug X
01-701-103 67 F Placebo

5.6. anti_join()

R Code:

anti_join(dm, ae, by = "USUBJID")
  • anti_join(dm, ae, by = "USUBJID"):
    • Keeps only rows from dm where USUBJID does NOT exist in ae.
    • Returns only columns from dm.

Output Table:

USUBJID AGE SEX ARM
01-701-104 54 M Drug X

6. Beyond the Basics

  • Join by multiple keys:
    Sometimes you need to join by more than one column (e.g., STUDYID and USUBJID).

    Input Tables:

    STUDYID USUBJID AGE
    A1001 01-701-101 34
    A1001 01-701-102 29
    STUDYID USUBJID VISIT AEDECOD
    A1001 01-701-101 1 HEADACHE
    A1001 01-701-102 1 NAUSEA

    R Code:

    left_join(dm, ae, by = c("STUDYID", "USUBJID"))
    
    • Joins on both STUDYID and USUBJID columns.

    Output Table:

    STUDYID USUBJID AGE VISIT AEDECOD
    A1001 01-701-101 34 1 HEADACHE
    A1001 01-701-102 29 1 NAUSEA

  • Suffixes for overlapping columns:
    If both tables have columns with the same name (other than the join key), use suffix = c(".dm", ".ae") to distinguish them.

    Input Tables:

    USUBJID AGE ARM
    01-701-101 34 Placebo
    USUBJID AGE AEDECOD
    01-701-101 35 HEADACHE

    R Code:

    left_join(dm, ae, by = "USUBJID", suffix = c(".dm", ".ae"))
    
    • The AGE column from dm becomes AGE.dm, from ae becomes AGE.ae.

    Output Table:

    USUBJID AGE.dm ARM AGE.ae AEDECOD
    01-701-101 34 Placebo 35 HEADACHE

  • Join with different column names:
    If the key columns have different names in each table, specify them in by.

    Input Tables:

    SUBJID AGE
    01-701-101 34
    USUBJID AEDECOD
    01-701-101 HEADACHE

    R Code:

    left_join(dm, ae, by = c("SUBJID" = "USUBJID"))
    
    • Joins dm$SUBJID to ae$USUBJID.

    Output Table:

    SUBJID AGE AEDECOD
    01-701-101 34 HEADACHE

  • Filtering joins for subsetting:
    Use semi_join() and anti_join() to filter one table based on the presence or absence of keys in another.

    Input Tables:

    USUBJID AGE SEX ARM
    01-701-101 34 M Placebo
    01-701-102 29 F Drug X
    01-701-103 67 F Placebo
    01-701-104 54 M Drug X
    USUBJID AEDECOD AESER
    01-701-101 HEADACHE N
    01-701-102 NAUSEA N
    01-701-103 VOMITING Y
    01-701-105 DIZZINESS N

    R Code:

    # Keep only subjects in dm who have an AE
    semi_join(dm, ae, by = "USUBJID")
    # Keep only subjects in dm who do NOT have an AE
    anti_join(dm, ae, by = "USUBJID")
    

    semi_join Output Table:

    USUBJID AGE SEX ARM
    01-701-101 34 M Placebo
    01-701-102 29 F Drug X
    01-701-103 67 F Placebo

    anti_join Output Table:

    USUBJID AGE SEX ARM
    01-701-104 54 M Drug X

  • Chaining joins:
    You can join multiple tables in sequence to build a comprehensive dataset.

    Input Tables:

    dm:

    USUBJID AGE SEX ARM
    01-701-101 34 M Placebo
    01-701-102 29 F Drug X

    ae:

    USUBJID AEDECOD AESER
    01-701-101 HEADACHE N
    01-701-102 NAUSEA N

    lb:

    USUBJID PARAM VALUE
    01-701-101 HGB 13.5
    01-701-102 HGB 12.8

    R Code:

    dm %>%
      left_join(ae, by = "USUBJID") %>%
      left_join(lb, by = "USUBJID")
    
    • Combines demographics, adverse events, and labs into one table by subject.

    Output Table:

    USUBJID AGE SEX ARM AEDECOD AESER PARAM VALUE
    01-701-101 34 M Placebo HEADACHE N HGB 13.5
    01-701-102 29 F Drug X NAUSEA N HGB 12.8

  • Tagging Each Record’s Source After a Full Join

Sometimes, after merging two tables, you want to know where each record originated: only in the left table, only in the right table, or in both. This is useful for data reconciliation, audit, or reporting.

Example Data:

prod_id name
P001 Widget
P002 Gadget
prod_id price
P001 10.00
P003 12.50

Method 1: Use case_when() on is.na() after full_join()

R Code:

library(dplyr)
tagged <- full_join(products, current_price, by = "prod_id") %>%
  mutate(source = case_when(
    !is.na(name) &  !is.na(price) ~ "both",
    !is.na(name) &   is.na(price) ~ "left",
     is.na(name) &  !is.na(price) ~ "right",
    TRUE                          ~ NA_character_
  ))
  • full_join(products, current_price, by = "prod_id"): merges all rows from both tables.
  • mutate(source = case_when(...)): tags each row as "both", "left", or "right" based on which columns are present.

Output Table:

prod_id name price source
P001 Widget 10.00 both
P002 Gadget NA left
P003 Doohickey 12.50 right

Method 2: Add flag columns before join

R Code:

p2 <- products %>% mutate(in_products = TRUE)
c2 <- current_price %>% mutate(in_price = TRUE)

tagged2 <- full_join(p2, c2, by = "prod_id") %>%
  mutate(source = case_when(
    in_products & in_price  ~ "both",
    in_products & !in_price ~ "left",
    !in_products & in_price ~ "right"
  )) %>%
  select(prod_id, name, price, source)
  • Adds logical flags to each table before joining.
  • After the join, uses these flags to determine the source of each row.

Output Table:

prod_id name price source
P001 Widget 10.00 both
P002 Gadget NA left
P003 Doohickey 12.50 right

  • Explanation:
    • "both": present in both tables (matched by prod_id)
    • "left": present only in the left table (products)
    • "right": present only in the right table (current_price)
    • This approach is useful for tracking data provenance after merges.

  • Encapsulate Join Logic in Reusable Functions

For cleaner, more maintainable code, you can wrap join logic in a reusable R function. This is especially useful when you need to perform different types of joins in multiple places.

R Function:

join_data <- function(df1, df2, by, type = "full") {
  library(dplyr)
  join_fn <- switch(type,
                    full = full_join,
                    inner = inner_join,
                    left  = left_join,
                    right = right_join)
  join_fn(df1, df2, by = by)
}
  • df1, df2: Data frames to join.
  • by: Key(s) to join by (e.g., "prod_id" or c("USUBJID", "VISIT")).
  • type: Type of join ("full", "inner", "left", or "right").

Usage Example:

both <- join_data(products, current_price, by = "prod_id", type = "inner")
  • Performs an inner join between products and current_price by prod_id.

Output Table (for the example above):

prod_id name price
P001 Widget 10.00
  • Explanation:
    • The function makes your join code concise and reusable.
    • You can easily switch join types by changing the type argument.

7. Conclusion

  • Joins are essential for combining and analyzing relational data in R.
  • Use the appropriate join type for your analysis goal.
  • Always check your join keys and output for expected results.
  • Explore advanced join options for more complex data wrangling, including multi-key joins, suffixes, and chaining.

**Resource download links**

2.3.5.-Joining-Two-Tables.zip