contact@a2zlearners.com

2.3.4. Separating, merging, cleaning Columns

1. Introduction

In SDTM and clinical data, it's common to encounter columns that combine multiple pieces of information (e.g., AE term and severity, lab parameter and value). For tidy data and easier analysis, it's best to split these into separate columns. R's tidyr package provides separate() for splitting and unite() for merging columns. Cleaning column names is also essential for efficient coding.


2. Why Separate Columns?

  • Ensures each variable has its own column (tidy data principle).
  • Makes filtering, summarizing, and reporting easier.
  • Enables more flexible and accurate SDTM/clinical analysis.
  • Helps with data cleaning and preparation for regulatory submissions.

3. Example: SDTM AE Dataset with Combined Columns

Suppose you have an SDTM Adverse Events (AE) dataset where the AE term and severity are combined in a single column.

Input Table:

# Create the input table for AE dataset
ae <- data.frame(
  USUBJID = c("01-701-101", "01-701-102", "01-701-103", "01-701-104"),
  AE_TERM_SEV = c("Headache:MILD", "Nausea:MODERATE", "Vomiting:SEVERE", "Dizziness:MILD"),
  AESER = c("N", "N", "Y", "N")
)

# Display the input table
print(ae)
USUBJID AE_TERM_SEV AESER
01-701-101 Headache:MILD N
01-701-102 Nausea:MODERATE N
01-701-103 Vomiting:SEVERE Y
01-701-104 Dizziness:MILD N

4. Separating a Combined Column with separate()

Use separate() to split a column into two or more columns based on a separator.

R Code:

library(tidyr)
ae %>%
  separate(AE_TERM_SEV, into = c("AETERM", "AESEV"), sep = ":")

Output Table:

USUBJID AETERM AESEV AESER
01-701-101 Headache MILD N
01-701-102 Nausea MODERATE N
01-701-103 Vomiting SEVERE Y
01-701-104 Dizziness MILD N
  • Explanation:
    • Splits AE_TERM_SEV into AETERM and AESEV at the colon.
    • Each variable is now in its own column for easier SDTM analysis.

5. Beyond the Basics: Advanced SDTM Examples

  • Separate into more than two columns:
    If your SDTM column contains more than two pieces of information, specify more column names in into.

    Input Table:

    # Create the input table for AE_COMBINED dataset
    ae <- data.frame(
      USUBJID = c("01-701-101", "01-701-102"),
      AE_COMBINED = c("Headache:MILD:2024-01-01", "Nausea:MODERATE:2024-01-02")
    )
    
    # Display the input table
    print(ae)
    
    USUBJID AE_COMBINED
    01-701-101 Headache:MILD:2024-01-01
    01-701-102 Nausea:MODERATE:2024-01-02

    R Code:

    ae %>% separate(AE_COMBINED, into = c("AETERM", "AESEV", "AESTDTC"), sep = ":")
    
    • ae %>% passes the AE dataset to the next function using the pipe.
    • separate(AE_COMBINED, into = c("AETERM", "AESEV", "AESTDTC"), sep = ":"):
      • Splits the AE_COMBINED column into three new columns: AETERM, AESEV, and AESTDTC.
      • The sep = ":" argument tells R to split at each colon.
      • Each part is assigned to the corresponding new column.

    Output Table:

    USUBJID AETERM AESEV AESTDTC
    01-701-101 Headache MILD 2024-01-01
    01-701-102 Nausea MODERATE 2024-01-02

  • Split by position:
    Use sep = c(8) to split after the 8th character, for fixed-width SDTM fields.

    Input Table:

    # Create the input table for AE_FIXED dataset
    ae <- data.frame(
      USUBJID = c("01-701-101", "01-701-102"),
      AE_FIXED = c("HeadacheMILD", "Nausea  MODERATE")
    )
    
    # Display the input table
    print(ae)
    
    USUBJID AE_FIXED
    01-701-101 HeadacheMILD
    01-701-102 Nausea MODERATE

    R Code:

    ae %>% separate(AE_FIXED, into = c("AETERM", "AESEV"), sep = c(8))
    
    • ae %>% pipes the dataset to separate().
    • separate(AE_FIXED, into = c("AETERM", "AESEV"), sep = c(8)):
      • Splits the AE_FIXED column into AETERM and AESEV after the 8th character.
      • Useful for fixed-width fields where the split position is known.

    Output Table:

    USUBJID AETERM AESEV
    01-701-101 Headache MILD
    01-701-102 Nausea MODERATE

  • Convert types automatically:
    Use convert = TRUE to automatically convert separated columns to numeric, integer, etc.

    Input Table:

    # Create the input table for LB_COMBINED dataset
    lb <- data.frame(
      USUBJID = c("01-701-101", "01-701-102"),
      LB_COMBINED = c("HGB:13.5", "HGB:12.8")
    )
    
    # Display the input table
    print(lb)
    
    USUBJID LB_COMBINED
    01-701-101 HGB:13.5
    01-701-102 HGB:12.8

    R Code:

    lb %>% separate(LB_COMBINED, into = c("PARAM", "VALUE"), sep = ":", convert = TRUE)
    
    • lb %>% pipes the lab dataset to separate().
    • separate(LB_COMBINED, into = c("PARAM", "VALUE"), sep = ":", convert = TRUE):
      • Splits LB_COMBINED into PARAM and VALUE at the colon.
      • convert = TRUE automatically converts VALUE to numeric.

    Output Table:

    USUBJID PARAM VALUE
    01-701-101 HGB 13.5
    01-701-102 HGB 12.8

  • Keep original column:
    Use remove = FALSE to keep the original column after separation.

    Input Table:

    # Create the input table for AE_TERM_SEV dataset
    ae <- data.frame(
      USUBJID = c("01-701-101"),
      AE_TERM_SEV = c("Headache:MILD")
    )
    
    # Display the input table
    print(ae)
    
    USUBJID AE_TERM_SEV
    01-701-101 Headache:MILD

    R Code:

    ae %>% separate(AE_TERM_SEV, into = c("AETERM", "AESEV"), sep = ":", remove = FALSE)
    
    • ae %>% pipes the dataset to separate().
    • separate(AE_TERM_SEV, into = c("AETERM", "AESEV"), sep = ":", remove = FALSE):
      • Splits AE_TERM_SEV into AETERM and AESEV at the colon.
      • remove = FALSE keeps the original AE_TERM_SEV column in the output.

    Output Table:

    USUBJID AE_TERM_SEV AETERM AESEV
    01-701-101 Headache:MILD Headache MILD

  • Unite columns:
    Use unite() to combine multiple columns into one (the reverse of separate()).

    Input Table:

    # Create the input table for AETERM and AESEV dataset
    ae <- data.frame(
      USUBJID = c("01-701-101", "01-701-102"),
      AETERM = c("Headache", "Nausea"),
      AESEV = c("MILD", "MODERATE")
    )
    
    # Display the input table
    print(ae)
    
    USUBJID AETERM AESEV
    01-701-101 Headache MILD
    01-701-102 Nausea MODERATE

    R Code:

    ae %>% unite("AE_TERM_SEV", AETERM, AESEV, sep = ":")
    
    • ae %>% pipes the dataset to unite().
    • unite("AE_TERM_SEV", AETERM, AESEV, sep = ":"):
      • Combines AETERM and AESEV into a single column AE_TERM_SEV with a colon separator.
      • Useful for creating a combined field for export or reporting.

    Output Table:

    USUBJID AE_TERM_SEV
    01-701-101 Headache:MILD
    01-701-102 Nausea:MODERATE

6. Cleaning Column Names with clean_names()

Input Table:

# Create the input table for AE dataset with inconsistent column names
ae <- data.frame(
  `AE Term/Severity` = c("Headache:MILD", "Nausea:MODERATE"),
  `AE Serious?` = c("N", "N")
)

# Display the input table
print(ae)
AE Term/Severity AE Serious?
Headache:MILD N
Nausea:MODERATE N

R Code:

library(janitor)
ae %>% clean_names()
  • ae %>% pipes the dataset to clean_names().
  • clean_names():
    • Converts all column names to lowercase.
    • Replaces spaces and special characters with underscores.
    • Makes names easier to reference in code.

Output Table:

ae_term_severity ae_serious
Headache:MILD N
Nausea:MODERATE N
  • Explanation:
    • Cleaned column names are easier to use in code and less error-prone.

7. Practical Tips

  • Always check your separator—colons, dashes, or other characters can affect results.
  • Clean up column names after separating for consistency (e.g., use clean_names() from janitor).
  • Use separate_rows() if you want to split a column into multiple rows instead of columns (e.g., multiple concomitant medications in one field).
  • Use unite() to merge columns for reporting or export.

8. Conclusion

  • Separating and merging columns is essential for tidy, analysis-ready SDTM data.
  • Use separate() from tidyr to split columns based on a delimiter or position.
  • Use unite() to merge columns when needed.
  • Use clean_names() to standardize column names for easier coding.
  • Combine these with other tidyr and dplyr functions for powerful, readable data wrangling workflows.

**Resource download links**

2.3.4.-Separating-merging-cleaning-Columns.zip