1.4.14. Removing Duplicate Values in SAS and R

1.4. Migrating from SAS to R: A Skill Conversion Guide

1.4.14. Removing Duplicate Values in SAS and R

1. Introduction

Duplicate records are common in clinical trial datasets, especially when merging or stacking SDTM domains such as DM (Demographics) or AE (Adverse Events). Removing or extracting duplicates is a key data cleaning step for regulatory compliance, analysis, and reporting.

2. Clinical SDTM Example: AE Domain with Duplicates

Suppose we have an SDTM Adverse Events (AE) dataset with duplicate records:

STUDYID	USUBJID	AEDECOD	AESER	AESTDTC
ABC123	01-001	HEADACHE	Y	2023-01-05
ABC123	01-002	NAUSEA	N	2023-01-07
ABC123	01-001	HEADACHE	Y	2023-01-05
ABC123	01-003	DIZZINESS	N	2023-01-10
ABC123	01-002	NAUSEA	N	2023-01-07
ABC123	01-004	FATIGUE	N	2023-01-12
ABC123	01-003	DIZZINESS	N	2023-01-10

3. Removing and Extracting Duplicates: SAS vs R

3.1 SAS: PROC SORT NODUPKEY and DUPOUT

/* Remove duplicates and extract duplicate records */
proc sort data=ae nodupkey dupout=ae_dups out=ae_nodup;
  by STUDYID USUBJID AEDECOD AESER AESTDTC;
run;

3.2 R: Base R and dplyr Equivalents

# Sample input data
ae <- data.frame(
  STUDYID = rep("ABC123", 7),
  USUBJID = c("01-001", "01-002", "01-001", "01-003", "01-002", "01-004", "01-003"),
  AEDECOD = c("HEADACHE", "NAUSEA", "HEADACHE", "DIZZINESS", "NAUSEA", "FATIGUE", "DIZZINESS"),
  AESER = c("Y", "N", "Y", "N", "N", "N", "N"),
  AESTDTC = as.Date(c("2023-01-05", "2023-01-07", "2023-01-05", "2023-01-10", "2023-01-07", "2023-01-12", "2023-01-10"))
)

# Remove duplicate rows (keep first occurrence)
ae_nodup <- unique(ae)

# Extract duplicate rows (all but the first occurrence)
ae_dups <- ae[duplicated(ae), ]

# Extract all rows that are part of a duplicate group (including first and subsequent)
ae_dups_all <- ae[duplicated(ae) | duplicated(ae, fromLast = TRUE), ]

With dplyr:

library(dplyr)
# Remove duplicates
ae_nodup <- ae %>% distinct()

# Extract duplicates (all but first occurrence)
ae_dups <- ae %>% filter(duplicated(.))

# Extract all rows that are part of a duplicate group
ae_dups_all <- ae %>% add_count(across(everything())) %>% filter(n > 1)

Explanation:

distinct() removes duplicate rows, keeping the first occurrence.
filter(duplicated(.)) extracts only the duplicate rows (all but the first occurrence).
add_count(across(everything())) %>% filter(n > 1) tags and extracts all rows that are part of a duplicate group (including the first and subsequent occurrences).

Expected Output:

ae_nodup:

STUDYID	USUBJID	AEDECOD	AESER	AESTDTC
ABC123	01-001	HEADACHE	Y	2023-01-05
ABC123	01-002	NAUSEA	N	2023-01-07
ABC123	01-003	DIZZINESS	N	2023-01-10
ABC123	01-004	FATIGUE	N	2023-01-12

ae_dups:

STUDYID	USUBJID	AEDECOD	AESER	AESTDTC
ABC123	01-001	HEADACHE	Y	2023-01-05
ABC123	01-002	NAUSEA	N	2023-01-07
ABC123	01-003	DIZZINESS	N	2023-01-10

ae_dups_all:

STUDYID	USUBJID	AEDECOD	AESER	AESTDTC
ABC123	01-001	HEADACHE	Y	2023-01-05
ABC123	01-001	HEADACHE	Y	2023-01-05
ABC123	01-002	NAUSEA	N	2023-01-07
ABC123	01-002	NAUSEA	N	2023-01-07
ABC123	01-003	DIZZINESS	N	2023-01-10
ABC123	01-003	DIZZINESS	N	2023-01-10

dplyr equivalents produce the same outputs as above.

4. Removing Duplicates by Key Variables (Subset of Columns)

Often, you need to deduplicate by a subset of columns (e.g., by subject and event term).

SAS Example:

proc sort data=ae nodupkey out=ae_nodup_key;
  by USUBJID AEDECOD;
run;

Explanation:

The by USUBJID AEDECOD; statement tells SAS to consider only these columns for identifying duplicates.
nodupkey keeps only the first occurrence for each unique combination of USUBJID and AEDECOD.
The output dataset ae_nodup_key will have only one row per subject and event term.

R Example:

# Base R
ae_nodup_key <- ae[!duplicated(ae[c("USUBJID", "AEDECOD")]), ]

# dplyr
ae_nodup_key <- ae %>% distinct(USUBJID, AEDECOD, .keep_all = TRUE)

Explanation:

In base R, duplicated(ae[c("USUBJID", "AEDECOD")]) checks for duplicates based only on those columns.
The ! negates the logical vector, so only the first occurrence is kept.
In dplyr, distinct(USUBJID, AEDECOD, .keep_all = TRUE) keeps the first row for each unique combination of USUBJID and AEDECOD, retaining all columns.

Expected Output:

STUDYID	USUBJID	AEDECOD	AESER	AESTDTC
ABC123	01-001	HEADACHE	Y	2023-01-05
ABC123	01-002	NAUSEA	N	2023-01-07
ABC123	01-003	DIZZINESS	N	2023-01-10
ABC123	01-004	FATIGUE	N	2023-01-12

5. Daily Use Cases for Removing Duplicates

Merging Data from Multiple Sources: Prevents double-counting when stacking or merging datasets.
Data Cleaning Before Analysis: Ensures each subject/event is counted only once for summaries.
Generating Listings for Review: Removes duplicates before sending data for medical review or queries.
SDTM/ADaM Preparation: Ensures unique records per key variables as required by CDISC standards.
Checking for Data Entry Errors: Identifies accidental duplicate entries for the same subject/event.
Creating Unique Subject Lists: For randomization, safety listings, or subject disposition tables.

Example: Remove duplicate subjects in Demographics (DM) domain

SAS:

proc sort data=dm nodupkey out=dm_nodup;
  by STUDYID USUBJID;
run;

# Dummy data for dm
dm <- data.frame(
  STUDYID = rep("ABC123", 5),
  USUBJID = c("01-001", "01-002", "01-001", "01-003", "01-004"),
  SEX = c("M", "F", "M", "F", "M"),
  AGE = c(34, 29, 34, 41, 38)
)

dm_nodup <- dm[!duplicated(dm[c("STUDYID", "USUBJID")]), ]
# or with dplyr
dm_nodup <- dm %>% distinct(STUDYID, USUBJID, .keep_all = TRUE)

6. Advanced Options and Techniques

6.1 Keeping the Last Occurrence Instead of the First

SAS:

proc sort data=ae;
  by STUDYID USUBJID AEDECOD AESER descending AESTDTC;
run;
proc sort data=ae nodupkey out=ae_last;
  by STUDYID USUBJID AEDECOD AESER;
run;

Expected Output:

STUDYID	USUBJID	AEDECOD	AESER	AESTDTC
ABC123	01-001	HEADACHE	Y	2023-01-05
ABC123	01-002	NAUSEA	N	2023-01-07
ABC123	01-003	DIZZINESS	N	2023-01-10
ABC123	01-004	FATIGUE	N	2023-01-12

# Base R: Keep last occurrence
ae_last <- ae[!duplicated(ae, fromLast = TRUE), ]

# dplyr: Keep last by key
ae_last <- ae %>% arrange(STUDYID, USUBJID, AEDECOD, desc(AESTDTC)) %>%
  distinct(STUDYID, USUBJID, AEDECOD, .keep_all = TRUE)

Expected Output: (same as above, but if dates differ, the latest per group is kept)

6.2 Counting Duplicates

SAS:

proc freq data=ae;
  tables STUDYID*USUBJID*AEDECOD / noprint out=ae_counts;
run;

Expected Output:

STUDYID	USUBJID	AEDECOD	COUNT
ABC123	01-001	HEADACHE	2
ABC123	01-002	NAUSEA	2
ABC123	01-003	DIZZINESS	2
ABC123	01-004	FATIGUE	1

ae_counts <- ae %>% count(STUDYID, USUBJID, AEDECOD)

Expected Output: (same as above)

6.3 Flagging Duplicates

SAS:

data ae_flag;
  set ae;
  by STUDYID USUBJID AEDECOD AESER AESTDTC;
  if first.AEDECOD and last.AEDECOD then dup_flag=0;
  else dup_flag=1;
run;

Expected Output:

STUDYID	USUBJID	AEDECOD	AESER	AESTDTC	dup_flag
...	...	...	...	...	0 or 1

ae$dup_flag <- duplicated(ae) | duplicated(ae, fromLast = TRUE)

Expected Output:

STUDYID	USUBJID	AEDECOD	AESER	AESTDTC	dup_flag
ABC123	01-001	HEADACHE	Y	2023-01-05	TRUE
ABC123	01-002	NAUSEA	N	2023-01-07	TRUE
...	...	...	...	...	...

6.4 Removing Duplicates with Conditions

SAS:

proc sort data=ae(where=(AESER='Y')) nodupkey out=ae_serious_nodup;
  by USUBJID AEDECOD;
run;

Expected Output:

USUBJID	AEDECOD	AESER	...
01-001	HEADACHE	Y	...

ae_serious_nodup <- ae %>% filter(AESER == "Y") %>% distinct(USUBJID, AEDECOD, .keep_all = TRUE)

Expected Output: (same as above)

7. Summary Table: SAS vs R Approaches

Task	SAS (PROC SORT)	R (Base)	R (dplyr/tidyverse)
Remove all duplicates	NODUPKEY	unique()	distinct()
Remove by key columns	BY col1 col2 ...	duplicated()	distinct(col1, ...)
Extract duplicates	DUPOUT=	duplicated()	filter(duplicated(.))
Extract all in dup group	DUPOUT + manual	duplicated() \| duplicated(..., fromLast=TRUE)	add_count() + filter(n > 1)
Keep last occurrence	Sort desc + NODUPKEY	fromLast=TRUE	arrange() + distinct()
Count duplicates	PROC FREQ	table(), ave	count()
Flag duplicates	BY group logic	duplicated()	mutate()

Resource download links

1.4.14.-Removing-Duplicate-Values-in-SAS-and-R.zip

⁂

1.4. Migrating from SAS to R: A Skill Conversion Guide