2.3.4. Separating Merging Cleaning Columns

2.3.4. Separating, merging, cleaning Columns

1. Introduction

In SDTM and clinical data, it's common to encounter columns that combine multiple pieces of information (e.g., AE term and severity, lab parameter and value). For tidy data and easier analysis, it's best to split these into separate columns. R's tidyr package provides separate() for splitting and unite() for merging columns. Cleaning column names is also essential for efficient coding.

2. Why Separate Columns?

Ensures each variable has its own column (tidy data principle).
Makes filtering, summarizing, and reporting easier.
Enables more flexible and accurate SDTM/clinical analysis.
Helps with data cleaning and preparation for regulatory submissions.

3. Example: SDTM AE Dataset with Combined Columns

Suppose you have an SDTM Adverse Events (AE) dataset where the AE term and severity are combined in a single column.

Input Table:

# Create the input table for AE dataset
ae <- data.frame(
  USUBJID = c("01-701-101", "01-701-102", "01-701-103", "01-701-104"),
  AE_TERM_SEV = c("Headache:MILD", "Nausea:MODERATE", "Vomiting:SEVERE", "Dizziness:MILD"),
  AESER = c("N", "N", "Y", "N")
)

# Display the input table
print(ae)

USUBJID	AE_TERM_SEV	AESER
01-701-101	Headache:MILD	N
01-701-102	Nausea:MODERATE	N
01-701-103	Vomiting:SEVERE	Y
01-701-104	Dizziness:MILD	N

4. Separating a Combined Column with separate()

Use separate() to split a column into two or more columns based on a separator.

R Code:

library(tidyr)
ae %>%
  separate(AE_TERM_SEV, into = c("AETERM", "AESEV"), sep = ":")

Output Table:

USUBJID	AETERM	AESEV	AESER
01-701-101	Headache	MILD	N
01-701-102	Nausea	MODERATE	N
01-701-103	Vomiting	SEVERE	Y
01-701-104	Dizziness	MILD	N

Explanation:
- Splits AE_TERM_SEV into AETERM and AESEV at the colon.
- Each variable is now in its own column for easier SDTM analysis.

5. Beyond the Basics: Advanced SDTM Examples

Separate into more than two columns:
If your SDTM column contains more than two pieces of information, specify more column names in into.

Input Table:
```
# Create the input table for AE_COMBINED dataset
ae <- data.frame(
  USUBJID = c("01-701-101", "01-701-102"),
  AE_COMBINED = c("Headache:MILD:2024-01-01", "Nausea:MODERATE:2024-01-02")
)

# Display the input table
print(ae)
```
USUBJID AE_COMBINED

01-701-101 Headache:MILD:2024-01-01

01-701-102 Nausea:MODERATE:2024-01-02

R Code:
```
ae %>% separate(AE_COMBINED, into = c("AETERM", "AESEV", "AESTDTC"), sep = ":")
```
- ae %>% passes the AE dataset to the next function using the pipe.
- separate(AE_COMBINED, into = c("AETERM", "AESEV", "AESTDTC"), sep = ":"):
  - Splits the AE_COMBINED column into three new columns: AETERM, AESEV, and AESTDTC.
  - The sep = ":" argument tells R to split at each colon.
  - Each part is assigned to the corresponding new column.
Output Table:

USUBJID AETERM AESEV AESTDTC

01-701-101 Headache MILD 2024-01-01

01-701-102 Nausea MODERATE 2024-01-02

USUBJID	AE_COMBINED
01-701-101	Headache:MILD:2024-01-01
01-701-102	Nausea:MODERATE:2024-01-02

USUBJID	AETERM	AESEV	AESTDTC
01-701-101	Headache	MILD	2024-01-01
01-701-102	Nausea	MODERATE	2024-01-02

Split by position:
Use sep = c(8) to split after the 8th character, for fixed-width SDTM fields.

Input Table:
```
# Create the input table for AE_FIXED dataset
ae <- data.frame(
  USUBJID = c("01-701-101", "01-701-102"),
  AE_FIXED = c("HeadacheMILD", "Nausea  MODERATE")
)

# Display the input table
print(ae)
```
USUBJID AE_FIXED

01-701-101 HeadacheMILD

01-701-102 Nausea MODERATE

R Code:
```
ae %>% separate(AE_FIXED, into = c("AETERM", "AESEV"), sep = c(8))
```
- ae %>% pipes the dataset to separate().
- separate(AE_FIXED, into = c("AETERM", "AESEV"), sep = c(8)):
  - Splits the AE_FIXED column into AETERM and AESEV after the 8th character.
  - Useful for fixed-width fields where the split position is known.
Output Table:

USUBJID AETERM AESEV

01-701-101 Headache MILD

01-701-102 Nausea MODERATE

USUBJID	AE_FIXED
01-701-101	HeadacheMILD
01-701-102	Nausea MODERATE

USUBJID	AETERM	AESEV
01-701-101	Headache	MILD
01-701-102	Nausea	MODERATE

Convert types automatically:
Use convert = TRUE to automatically convert separated columns to numeric, integer, etc.

Input Table:
```
# Create the input table for LB_COMBINED dataset
lb <- data.frame(
  USUBJID = c("01-701-101", "01-701-102"),
  LB_COMBINED = c("HGB:13.5", "HGB:12.8")
)

# Display the input table
print(lb)
```
USUBJID LB_COMBINED

01-701-101 HGB:13.5

01-701-102 HGB:12.8

R Code:
```
lb %>% separate(LB_COMBINED, into = c("PARAM", "VALUE"), sep = ":", convert = TRUE)
```
- lb %>% pipes the lab dataset to separate().
- separate(LB_COMBINED, into = c("PARAM", "VALUE"), sep = ":", convert = TRUE):
  - Splits LB_COMBINED into PARAM and VALUE at the colon.
  - convert = TRUE automatically converts VALUE to numeric.
Output Table:

USUBJID PARAM VALUE

01-701-101 HGB 13.5

01-701-102 HGB 12.8

USUBJID	LB_COMBINED
01-701-101	HGB:13.5
01-701-102	HGB:12.8

USUBJID	PARAM	VALUE
01-701-101	HGB	13.5
01-701-102	HGB	12.8

Keep original column:
Use remove = FALSE to keep the original column after separation.

Input Table:
```
# Create the input table for AE_TERM_SEV dataset
ae <- data.frame(
  USUBJID = c("01-701-101"),
  AE_TERM_SEV = c("Headache:MILD")
)

# Display the input table
print(ae)
```
USUBJID AE_TERM_SEV

01-701-101 Headache:MILD

R Code:
```
ae %>% separate(AE_TERM_SEV, into = c("AETERM", "AESEV"), sep = ":", remove = FALSE)
```
- ae %>% pipes the dataset to separate().
- separate(AE_TERM_SEV, into = c("AETERM", "AESEV"), sep = ":", remove = FALSE):
  - Splits AE_TERM_SEV into AETERM and AESEV at the colon.
  - remove = FALSE keeps the original AE_TERM_SEV column in the output.
Output Table:

USUBJID AE_TERM_SEV AETERM AESEV

01-701-101 Headache:MILD Headache MILD

USUBJID	AE_TERM_SEV
01-701-101	Headache:MILD

USUBJID	AE_TERM_SEV	AETERM	AESEV
01-701-101	Headache:MILD	Headache	MILD

Unite columns:
Use unite() to combine multiple columns into one (the reverse of separate()).

Input Table:
```
# Create the input table for AETERM and AESEV dataset
ae <- data.frame(
  USUBJID = c("01-701-101", "01-701-102"),
  AETERM = c("Headache", "Nausea"),
  AESEV = c("MILD", "MODERATE")
)

# Display the input table
print(ae)
```
USUBJID AETERM AESEV

01-701-101 Headache MILD

01-701-102 Nausea MODERATE

R Code:
```
ae %>% unite("AE_TERM_SEV", AETERM, AESEV, sep = ":")
```
- ae %>% pipes the dataset to unite().
- unite("AE_TERM_SEV", AETERM, AESEV, sep = ":"):
  - Combines AETERM and AESEV into a single column AE_TERM_SEV with a colon separator.
  - Useful for creating a combined field for export or reporting.
Output Table:

USUBJID AE_TERM_SEV

01-701-101 Headache:MILD

01-701-102 Nausea:MODERATE

USUBJID	AETERM	AESEV
01-701-101	Headache	MILD
01-701-102	Nausea	MODERATE

USUBJID	AE_TERM_SEV
01-701-101	Headache:MILD
01-701-102	Nausea:MODERATE

6. Cleaning Column Names with clean_names()

Input Table:

# Create the input table for AE dataset with inconsistent column names
ae <- data.frame(
  `AE Term/Severity` = c("Headache:MILD", "Nausea:MODERATE"),
  `AE Serious?` = c("N", "N")
)

# Display the input table
print(ae)

AE Term/Severity	AE Serious?
Headache:MILD	N
Nausea:MODERATE	N

R Code:

library(janitor)
ae %>% clean_names()

ae %>% pipes the dataset to clean_names().
clean_names():
- Converts all column names to lowercase.
- Replaces spaces and special characters with underscores.
- Makes names easier to reference in code.

Output Table:

ae_term_severity	ae_serious
Headache:MILD	N
Nausea:MODERATE	N

Explanation:
- Cleaned column names are easier to use in code and less error-prone.

7. Practical Tips

Always check your separator—colons, dashes, or other characters can affect results.
Clean up column names after separating for consistency (e.g., use clean_names() from janitor).
Use separate_rows() if you want to split a column into multiple rows instead of columns (e.g., multiple concomitant medications in one field).
Use unite() to merge columns for reporting or export.

8. Conclusion

Separating and merging columns is essential for tidy, analysis-ready SDTM data.
Use separate() from tidyr to split columns based on a delimiter or position.
Use unite() to merge columns when needed.
Use clean_names() to standardize column names for easier coding.
Combine these with other tidyr and dplyr functions for powerful, readable data wrangling workflows.

Resource download links

2.3.4.-Separating-merging-cleaning-Columns.zip

⁂