contact@a2zlearners.com

1.4. Migrating from SAS to R: A Skill Conversion Guide

1.4.14. Removing Duplicate Values in SAS and R

1. Introduction

Duplicate records are common in clinical trial datasets, especially when merging or stacking SDTM domains such as DM (Demographics) or AE (Adverse Events). Removing or extracting duplicates is a key data cleaning step for regulatory compliance, analysis, and reporting.

2. Clinical SDTM Example: AE Domain with Duplicates

Suppose we have an SDTM Adverse Events (AE) dataset with duplicate records:

STUDYID USUBJID AEDECOD AESER AESTDTC
ABC123 01-001 HEADACHE Y 2023-01-05
ABC123 01-002 NAUSEA N 2023-01-07
ABC123 01-001 HEADACHE Y 2023-01-05
ABC123 01-003 DIZZINESS N 2023-01-10
ABC123 01-002 NAUSEA N 2023-01-07
ABC123 01-004 FATIGUE N 2023-01-12
ABC123 01-003 DIZZINESS N 2023-01-10

3. Removing and Extracting Duplicates: SAS vs R

3.1 SAS: PROC SORT NODUPKEY and DUPOUT
/* Remove duplicates and extract duplicate records */
proc sort data=ae nodupkey dupout=ae_dups out=ae_nodup;
  by STUDYID USUBJID AEDECOD AESER AESTDTC;
run;
3.2 R: Base R and dplyr Equivalents
# Sample input data
ae <- data.frame(
  STUDYID = rep("ABC123", 7),
  USUBJID = c("01-001", "01-002", "01-001", "01-003", "01-002", "01-004", "01-003"),
  AEDECOD = c("HEADACHE", "NAUSEA", "HEADACHE", "DIZZINESS", "NAUSEA", "FATIGUE", "DIZZINESS"),
  AESER = c("Y", "N", "Y", "N", "N", "N", "N"),
  AESTDTC = as.Date(c("2023-01-05", "2023-01-07", "2023-01-05", "2023-01-10", "2023-01-07", "2023-01-12", "2023-01-10"))
)

# Remove duplicate rows (keep first occurrence)
ae_nodup <- unique(ae)

# Extract duplicate rows (all but the first occurrence)
ae_dups <- ae[duplicated(ae), ]

# Extract all rows that are part of a duplicate group (including first and subsequent)
ae_dups_all <- ae[duplicated(ae) | duplicated(ae, fromLast = TRUE), ]

With dplyr:

library(dplyr)
# Remove duplicates
ae_nodup <- ae %>% distinct()

# Extract duplicates (all but first occurrence)
ae_dups <- ae %>% filter(duplicated(.))

# Extract all rows that are part of a duplicate group
ae_dups_all <- ae %>% add_count(across(everything())) %>% filter(n > 1)

Explanation:

  • distinct() removes duplicate rows, keeping the first occurrence.
  • filter(duplicated(.)) extracts only the duplicate rows (all but the first occurrence).
  • add_count(across(everything())) %>% filter(n > 1) tags and extracts all rows that are part of a duplicate group (including the first and subsequent occurrences).

Expected Output:

ae_nodup:

STUDYID USUBJID AEDECOD AESER AESTDTC
ABC123 01-001 HEADACHE Y 2023-01-05
ABC123 01-002 NAUSEA N 2023-01-07
ABC123 01-003 DIZZINESS N 2023-01-10
ABC123 01-004 FATIGUE N 2023-01-12

ae_dups:

STUDYID USUBJID AEDECOD AESER AESTDTC
ABC123 01-001 HEADACHE Y 2023-01-05
ABC123 01-002 NAUSEA N 2023-01-07
ABC123 01-003 DIZZINESS N 2023-01-10

ae_dups_all:

STUDYID USUBJID AEDECOD AESER AESTDTC
ABC123 01-001 HEADACHE Y 2023-01-05
ABC123 01-001 HEADACHE Y 2023-01-05
ABC123 01-002 NAUSEA N 2023-01-07
ABC123 01-002 NAUSEA N 2023-01-07
ABC123 01-003 DIZZINESS N 2023-01-10
ABC123 01-003 DIZZINESS N 2023-01-10

dplyr equivalents produce the same outputs as above.


4. Removing Duplicates by Key Variables (Subset of Columns)

Often, you need to deduplicate by a subset of columns (e.g., by subject and event term).

SAS Example:

proc sort data=ae nodupkey out=ae_nodup_key;
  by USUBJID AEDECOD;
run;

Explanation:

  • The by USUBJID AEDECOD; statement tells SAS to consider only these columns for identifying duplicates.
  • nodupkey keeps only the first occurrence for each unique combination of USUBJID and AEDECOD.
  • The output dataset ae_nodup_key will have only one row per subject and event term.

R Example:

# Base R
ae_nodup_key <- ae[!duplicated(ae[c("USUBJID", "AEDECOD")]), ]

# dplyr
ae_nodup_key <- ae %>% distinct(USUBJID, AEDECOD, .keep_all = TRUE)

Explanation:

  • In base R, duplicated(ae[c("USUBJID", "AEDECOD")]) checks for duplicates based only on those columns.
  • The ! negates the logical vector, so only the first occurrence is kept.
  • In dplyr, distinct(USUBJID, AEDECOD, .keep_all = TRUE) keeps the first row for each unique combination of USUBJID and AEDECOD, retaining all columns.

Expected Output:

STUDYID USUBJID AEDECOD AESER AESTDTC
ABC123 01-001 HEADACHE Y 2023-01-05
ABC123 01-002 NAUSEA N 2023-01-07
ABC123 01-003 DIZZINESS N 2023-01-10
ABC123 01-004 FATIGUE N 2023-01-12

5. Daily Use Cases for Removing Duplicates

  • Merging Data from Multiple Sources: Prevents double-counting when stacking or merging datasets.
  • Data Cleaning Before Analysis: Ensures each subject/event is counted only once for summaries.
  • Generating Listings for Review: Removes duplicates before sending data for medical review or queries.
  • SDTM/ADaM Preparation: Ensures unique records per key variables as required by CDISC standards.
  • Checking for Data Entry Errors: Identifies accidental duplicate entries for the same subject/event.
  • Creating Unique Subject Lists: For randomization, safety listings, or subject disposition tables.

Example: Remove duplicate subjects in Demographics (DM) domain

SAS:

proc sort data=dm nodupkey out=dm_nodup;
  by STUDYID USUBJID;
run;

R:

# Dummy data for dm
dm <- data.frame(
  STUDYID = rep("ABC123", 5),
  USUBJID = c("01-001", "01-002", "01-001", "01-003", "01-004"),
  SEX = c("M", "F", "M", "F", "M"),
  AGE = c(34, 29, 34, 41, 38)
)

dm_nodup <- dm[!duplicated(dm[c("STUDYID", "USUBJID")]), ]
# or with dplyr
dm_nodup <- dm %>% distinct(STUDYID, USUBJID, .keep_all = TRUE)

6. Advanced Options and Techniques

6.1 Keeping the Last Occurrence Instead of the First

SAS:

proc sort data=ae;
  by STUDYID USUBJID AEDECOD AESER descending AESTDTC;
run;
proc sort data=ae nodupkey out=ae_last;
  by STUDYID USUBJID AEDECOD AESER;
run;

Expected Output:

STUDYID USUBJID AEDECOD AESER AESTDTC
ABC123 01-001 HEADACHE Y 2023-01-05
ABC123 01-002 NAUSEA N 2023-01-07
ABC123 01-003 DIZZINESS N 2023-01-10
ABC123 01-004 FATIGUE N 2023-01-12

R:

# Base R: Keep last occurrence
ae_last <- ae[!duplicated(ae, fromLast = TRUE), ]

# dplyr: Keep last by key
ae_last <- ae %>% arrange(STUDYID, USUBJID, AEDECOD, desc(AESTDTC)) %>%
  distinct(STUDYID, USUBJID, AEDECOD, .keep_all = TRUE)

Expected Output: (same as above, but if dates differ, the latest per group is kept)


6.2 Counting Duplicates

SAS:

proc freq data=ae;
  tables STUDYID*USUBJID*AEDECOD / noprint out=ae_counts;
run;

Expected Output:

STUDYID USUBJID AEDECOD COUNT
ABC123 01-001 HEADACHE 2
ABC123 01-002 NAUSEA 2
ABC123 01-003 DIZZINESS 2
ABC123 01-004 FATIGUE 1

R:

ae_counts <- ae %>% count(STUDYID, USUBJID, AEDECOD)

Expected Output: (same as above)


6.3 Flagging Duplicates

SAS:

data ae_flag;
  set ae;
  by STUDYID USUBJID AEDECOD AESER AESTDTC;
  if first.AEDECOD and last.AEDECOD then dup_flag=0;
  else dup_flag=1;
run;

Expected Output:

STUDYID USUBJID AEDECOD AESER AESTDTC dup_flag
... ... ... ... ... 0 or 1

R:

ae$dup_flag <- duplicated(ae) | duplicated(ae, fromLast = TRUE)

Expected Output:

STUDYID USUBJID AEDECOD AESER AESTDTC dup_flag
ABC123 01-001 HEADACHE Y 2023-01-05 TRUE
ABC123 01-002 NAUSEA N 2023-01-07 TRUE
... ... ... ... ... ...

6.4 Removing Duplicates with Conditions

SAS:

proc sort data=ae(where=(AESER='Y')) nodupkey out=ae_serious_nodup;
  by USUBJID AEDECOD;
run;

Expected Output:

USUBJID AEDECOD AESER ...
01-001 HEADACHE Y ...

R:

ae_serious_nodup <- ae %>% filter(AESER == "Y") %>% distinct(USUBJID, AEDECOD, .keep_all = TRUE)

Expected Output: (same as above)


7. Summary Table: SAS vs R Approaches

Task SAS (PROC SORT) R (Base) R (dplyr/tidyverse)
Remove all duplicates NODUPKEY unique() distinct()
Remove by key columns BY col1 col2 ... duplicated() distinct(col1, ...)
Extract duplicates DUPOUT= duplicated() filter(duplicated(.))
Extract all in dup group DUPOUT + manual duplicated() | duplicated(..., fromLast=TRUE) add_count() + filter(n > 1)
Keep last occurrence Sort desc + NODUPKEY fromLast=TRUE arrange() + distinct()
Count duplicates PROC FREQ table(), ave count()
Flag duplicates BY group logic duplicated() mutate()

**Resource download links**

1.4.14.-Removing-Duplicate-Values-in-SAS-and-R.zip