1.4. Migrating from SAS to R: A Skill Conversion Guide
1.4.14. Removing Duplicate Values in SAS and R
1. Introduction
Duplicate records are common in clinical trial datasets, especially when merging or stacking SDTM domains such as DM (Demographics) or AE (Adverse Events). Removing or extracting duplicates is a key data cleaning step for regulatory compliance, analysis, and reporting.
2. Clinical SDTM Example: AE Domain with Duplicates
Suppose we have an SDTM Adverse Events (AE) dataset with duplicate records:
| STUDYID | USUBJID | AEDECOD | AESER | AESTDTC |
|---|---|---|---|---|
| ABC123 | 01-001 | HEADACHE | Y | 2023-01-05 |
| ABC123 | 01-002 | NAUSEA | N | 2023-01-07 |
| ABC123 | 01-001 | HEADACHE | Y | 2023-01-05 |
| ABC123 | 01-003 | DIZZINESS | N | 2023-01-10 |
| ABC123 | 01-002 | NAUSEA | N | 2023-01-07 |
| ABC123 | 01-004 | FATIGUE | N | 2023-01-12 |
| ABC123 | 01-003 | DIZZINESS | N | 2023-01-10 |
3. Removing and Extracting Duplicates: SAS vs R
3.1 SAS: PROC SORT NODUPKEY and DUPOUT
/* Remove duplicates and extract duplicate records */
proc sort data=ae nodupkey dupout=ae_dups out=ae_nodup;
by STUDYID USUBJID AEDECOD AESER AESTDTC;
run;
3.2 R: Base R and dplyr Equivalents
# Sample input data
ae <- data.frame(
STUDYID = rep("ABC123", 7),
USUBJID = c("01-001", "01-002", "01-001", "01-003", "01-002", "01-004", "01-003"),
AEDECOD = c("HEADACHE", "NAUSEA", "HEADACHE", "DIZZINESS", "NAUSEA", "FATIGUE", "DIZZINESS"),
AESER = c("Y", "N", "Y", "N", "N", "N", "N"),
AESTDTC = as.Date(c("2023-01-05", "2023-01-07", "2023-01-05", "2023-01-10", "2023-01-07", "2023-01-12", "2023-01-10"))
)
# Remove duplicate rows (keep first occurrence)
ae_nodup <- unique(ae)
# Extract duplicate rows (all but the first occurrence)
ae_dups <- ae[duplicated(ae), ]
# Extract all rows that are part of a duplicate group (including first and subsequent)
ae_dups_all <- ae[duplicated(ae) | duplicated(ae, fromLast = TRUE), ]
With dplyr:
library(dplyr)
# Remove duplicates
ae_nodup <- ae %>% distinct()
# Extract duplicates (all but first occurrence)
ae_dups <- ae %>% filter(duplicated(.))
# Extract all rows that are part of a duplicate group
ae_dups_all <- ae %>% add_count(across(everything())) %>% filter(n > 1)
Explanation:
distinct()removes duplicate rows, keeping the first occurrence.filter(duplicated(.))extracts only the duplicate rows (all but the first occurrence).add_count(across(everything())) %>% filter(n > 1)tags and extracts all rows that are part of a duplicate group (including the first and subsequent occurrences).
Expected Output:
ae_nodup:
| STUDYID | USUBJID | AEDECOD | AESER | AESTDTC |
|---|---|---|---|---|
| ABC123 | 01-001 | HEADACHE | Y | 2023-01-05 |
| ABC123 | 01-002 | NAUSEA | N | 2023-01-07 |
| ABC123 | 01-003 | DIZZINESS | N | 2023-01-10 |
| ABC123 | 01-004 | FATIGUE | N | 2023-01-12 |
ae_dups:
| STUDYID | USUBJID | AEDECOD | AESER | AESTDTC |
|---|---|---|---|---|
| ABC123 | 01-001 | HEADACHE | Y | 2023-01-05 |
| ABC123 | 01-002 | NAUSEA | N | 2023-01-07 |
| ABC123 | 01-003 | DIZZINESS | N | 2023-01-10 |
ae_dups_all:
| STUDYID | USUBJID | AEDECOD | AESER | AESTDTC |
|---|---|---|---|---|
| ABC123 | 01-001 | HEADACHE | Y | 2023-01-05 |
| ABC123 | 01-001 | HEADACHE | Y | 2023-01-05 |
| ABC123 | 01-002 | NAUSEA | N | 2023-01-07 |
| ABC123 | 01-002 | NAUSEA | N | 2023-01-07 |
| ABC123 | 01-003 | DIZZINESS | N | 2023-01-10 |
| ABC123 | 01-003 | DIZZINESS | N | 2023-01-10 |
dplyr equivalents produce the same outputs as above.
4. Removing Duplicates by Key Variables (Subset of Columns)
Often, you need to deduplicate by a subset of columns (e.g., by subject and event term).
SAS Example:
proc sort data=ae nodupkey out=ae_nodup_key;
by USUBJID AEDECOD;
run;
Explanation:
- The
by USUBJID AEDECOD;statement tells SAS to consider only these columns for identifying duplicates. nodupkeykeeps only the first occurrence for each unique combination of USUBJID and AEDECOD.- The output dataset
ae_nodup_keywill have only one row per subject and event term.
R Example:
# Base R
ae_nodup_key <- ae[!duplicated(ae[c("USUBJID", "AEDECOD")]), ]
# dplyr
ae_nodup_key <- ae %>% distinct(USUBJID, AEDECOD, .keep_all = TRUE)
Explanation:
- In base R,
duplicated(ae[c("USUBJID", "AEDECOD")])checks for duplicates based only on those columns. - The
!negates the logical vector, so only the first occurrence is kept. - In dplyr,
distinct(USUBJID, AEDECOD, .keep_all = TRUE)keeps the first row for each unique combination of USUBJID and AEDECOD, retaining all columns.
Expected Output:
| STUDYID | USUBJID | AEDECOD | AESER | AESTDTC |
|---|---|---|---|---|
| ABC123 | 01-001 | HEADACHE | Y | 2023-01-05 |
| ABC123 | 01-002 | NAUSEA | N | 2023-01-07 |
| ABC123 | 01-003 | DIZZINESS | N | 2023-01-10 |
| ABC123 | 01-004 | FATIGUE | N | 2023-01-12 |
5. Daily Use Cases for Removing Duplicates
- Merging Data from Multiple Sources: Prevents double-counting when stacking or merging datasets.
- Data Cleaning Before Analysis: Ensures each subject/event is counted only once for summaries.
- Generating Listings for Review: Removes duplicates before sending data for medical review or queries.
- SDTM/ADaM Preparation: Ensures unique records per key variables as required by CDISC standards.
- Checking for Data Entry Errors: Identifies accidental duplicate entries for the same subject/event.
- Creating Unique Subject Lists: For randomization, safety listings, or subject disposition tables.
Example: Remove duplicate subjects in Demographics (DM) domain
SAS:
proc sort data=dm nodupkey out=dm_nodup;
by STUDYID USUBJID;
run;
R:
# Dummy data for dm
dm <- data.frame(
STUDYID = rep("ABC123", 5),
USUBJID = c("01-001", "01-002", "01-001", "01-003", "01-004"),
SEX = c("M", "F", "M", "F", "M"),
AGE = c(34, 29, 34, 41, 38)
)
dm_nodup <- dm[!duplicated(dm[c("STUDYID", "USUBJID")]), ]
# or with dplyr
dm_nodup <- dm %>% distinct(STUDYID, USUBJID, .keep_all = TRUE)
6. Advanced Options and Techniques
6.1 Keeping the Last Occurrence Instead of the First
SAS:
proc sort data=ae;
by STUDYID USUBJID AEDECOD AESER descending AESTDTC;
run;
proc sort data=ae nodupkey out=ae_last;
by STUDYID USUBJID AEDECOD AESER;
run;
Expected Output:
| STUDYID | USUBJID | AEDECOD | AESER | AESTDTC |
|---|---|---|---|---|
| ABC123 | 01-001 | HEADACHE | Y | 2023-01-05 |
| ABC123 | 01-002 | NAUSEA | N | 2023-01-07 |
| ABC123 | 01-003 | DIZZINESS | N | 2023-01-10 |
| ABC123 | 01-004 | FATIGUE | N | 2023-01-12 |
R:
# Base R: Keep last occurrence
ae_last <- ae[!duplicated(ae, fromLast = TRUE), ]
# dplyr: Keep last by key
ae_last <- ae %>% arrange(STUDYID, USUBJID, AEDECOD, desc(AESTDTC)) %>%
distinct(STUDYID, USUBJID, AEDECOD, .keep_all = TRUE)
Expected Output: (same as above, but if dates differ, the latest per group is kept)
6.2 Counting Duplicates
SAS:
proc freq data=ae;
tables STUDYID*USUBJID*AEDECOD / noprint out=ae_counts;
run;
Expected Output:
| STUDYID | USUBJID | AEDECOD | COUNT |
|---|---|---|---|
| ABC123 | 01-001 | HEADACHE | 2 |
| ABC123 | 01-002 | NAUSEA | 2 |
| ABC123 | 01-003 | DIZZINESS | 2 |
| ABC123 | 01-004 | FATIGUE | 1 |
R:
ae_counts <- ae %>% count(STUDYID, USUBJID, AEDECOD)
Expected Output: (same as above)
6.3 Flagging Duplicates
SAS:
data ae_flag;
set ae;
by STUDYID USUBJID AEDECOD AESER AESTDTC;
if first.AEDECOD and last.AEDECOD then dup_flag=0;
else dup_flag=1;
run;
Expected Output:
| STUDYID | USUBJID | AEDECOD | AESER | AESTDTC | dup_flag |
|---|---|---|---|---|---|
| ... | ... | ... | ... | ... | 0 or 1 |
R:
ae$dup_flag <- duplicated(ae) | duplicated(ae, fromLast = TRUE)
Expected Output:
| STUDYID | USUBJID | AEDECOD | AESER | AESTDTC | dup_flag |
|---|---|---|---|---|---|
| ABC123 | 01-001 | HEADACHE | Y | 2023-01-05 | TRUE |
| ABC123 | 01-002 | NAUSEA | N | 2023-01-07 | TRUE |
| ... | ... | ... | ... | ... | ... |
6.4 Removing Duplicates with Conditions
SAS:
proc sort data=ae(where=(AESER='Y')) nodupkey out=ae_serious_nodup;
by USUBJID AEDECOD;
run;
Expected Output:
| USUBJID | AEDECOD | AESER | ... |
|---|---|---|---|
| 01-001 | HEADACHE | Y | ... |
R:
ae_serious_nodup <- ae %>% filter(AESER == "Y") %>% distinct(USUBJID, AEDECOD, .keep_all = TRUE)
Expected Output: (same as above)
7. Summary Table: SAS vs R Approaches
| Task | SAS (PROC SORT) | R (Base) | R (dplyr/tidyverse) |
|---|---|---|---|
| Remove all duplicates | NODUPKEY | unique() | distinct() |
| Remove by key columns | BY col1 col2 ... | duplicated() | distinct(col1, ...) |
| Extract duplicates | DUPOUT= | duplicated() | filter(duplicated(.)) |
| Extract all in dup group | DUPOUT + manual | duplicated() | duplicated(..., fromLast=TRUE) | add_count() + filter(n > 1) |
| Keep last occurrence | Sort desc + NODUPKEY | fromLast=TRUE | arrange() + distinct() |
| Count duplicates | PROC FREQ | table(), ave | count() |
| Flag duplicates | BY group logic | duplicated() | mutate() |
**Resource download links**
1.4.14.-Removing-Duplicate-Values-in-SAS-and-R.zip