2.3.4. Separating, merging, cleaning Columns
1. Introduction
In SDTM and clinical data, it's common to encounter columns that combine multiple pieces of information (e.g., AE term and severity, lab parameter and value). For tidy data and easier analysis, it's best to split these into separate columns. R's tidyr package provides separate() for splitting and unite() for merging columns. Cleaning column names is also essential for efficient coding.
2. Why Separate Columns?
- Ensures each variable has its own column (tidy data principle).
- Makes filtering, summarizing, and reporting easier.
- Enables more flexible and accurate SDTM/clinical analysis.
- Helps with data cleaning and preparation for regulatory submissions.
3. Example: SDTM AE Dataset with Combined Columns
Suppose you have an SDTM Adverse Events (AE) dataset where the AE term and severity are combined in a single column.
Input Table:
# Create the input table for AE dataset
ae <- data.frame(
USUBJID = c("01-701-101", "01-701-102", "01-701-103", "01-701-104"),
AE_TERM_SEV = c("Headache:MILD", "Nausea:MODERATE", "Vomiting:SEVERE", "Dizziness:MILD"),
AESER = c("N", "N", "Y", "N")
)
# Display the input table
print(ae)
| USUBJID | AE_TERM_SEV | AESER |
|---|---|---|
| 01-701-101 | Headache:MILD | N |
| 01-701-102 | Nausea:MODERATE | N |
| 01-701-103 | Vomiting:SEVERE | Y |
| 01-701-104 | Dizziness:MILD | N |
4. Separating a Combined Column with separate()
Use separate() to split a column into two or more columns based on a separator.
R Code:
library(tidyr)
ae %>%
separate(AE_TERM_SEV, into = c("AETERM", "AESEV"), sep = ":")
Output Table:
| USUBJID | AETERM | AESEV | AESER |
|---|---|---|---|
| 01-701-101 | Headache | MILD | N |
| 01-701-102 | Nausea | MODERATE | N |
| 01-701-103 | Vomiting | SEVERE | Y |
| 01-701-104 | Dizziness | MILD | N |
- Explanation:
- Splits
AE_TERM_SEVintoAETERMandAESEVat the colon. - Each variable is now in its own column for easier SDTM analysis.
- Splits
5. Beyond the Basics: Advanced SDTM Examples
Separate into more than two columns:
If your SDTM column contains more than two pieces of information, specify more column names ininto.Input Table:
# Create the input table for AE_COMBINED dataset ae <- data.frame( USUBJID = c("01-701-101", "01-701-102"), AE_COMBINED = c("Headache:MILD:2024-01-01", "Nausea:MODERATE:2024-01-02") ) # Display the input table print(ae)USUBJID AE_COMBINED 01-701-101 Headache:MILD:2024-01-01 01-701-102 Nausea:MODERATE:2024-01-02 R Code:
ae %>% separate(AE_COMBINED, into = c("AETERM", "AESEV", "AESTDTC"), sep = ":")ae %>%passes the AE dataset to the next function using the pipe.separate(AE_COMBINED, into = c("AETERM", "AESEV", "AESTDTC"), sep = ":"):- Splits the
AE_COMBINEDcolumn into three new columns:AETERM,AESEV, andAESTDTC. - The
sep = ":"argument tells R to split at each colon. - Each part is assigned to the corresponding new column.
- Splits the
Output Table:
USUBJID AETERM AESEV AESTDTC 01-701-101 Headache MILD 2024-01-01 01-701-102 Nausea MODERATE 2024-01-02
Split by position:
Usesep = c(8)to split after the 8th character, for fixed-width SDTM fields.Input Table:
# Create the input table for AE_FIXED dataset ae <- data.frame( USUBJID = c("01-701-101", "01-701-102"), AE_FIXED = c("HeadacheMILD", "Nausea MODERATE") ) # Display the input table print(ae)USUBJID AE_FIXED 01-701-101 HeadacheMILD 01-701-102 Nausea MODERATE R Code:
ae %>% separate(AE_FIXED, into = c("AETERM", "AESEV"), sep = c(8))ae %>%pipes the dataset toseparate().separate(AE_FIXED, into = c("AETERM", "AESEV"), sep = c(8)):- Splits the
AE_FIXEDcolumn intoAETERMandAESEVafter the 8th character. - Useful for fixed-width fields where the split position is known.
- Splits the
Output Table:
USUBJID AETERM AESEV 01-701-101 Headache MILD 01-701-102 Nausea MODERATE
Convert types automatically:
Useconvert = TRUEto automatically convert separated columns to numeric, integer, etc.Input Table:
# Create the input table for LB_COMBINED dataset lb <- data.frame( USUBJID = c("01-701-101", "01-701-102"), LB_COMBINED = c("HGB:13.5", "HGB:12.8") ) # Display the input table print(lb)USUBJID LB_COMBINED 01-701-101 HGB:13.5 01-701-102 HGB:12.8 R Code:
lb %>% separate(LB_COMBINED, into = c("PARAM", "VALUE"), sep = ":", convert = TRUE)lb %>%pipes the lab dataset toseparate().separate(LB_COMBINED, into = c("PARAM", "VALUE"), sep = ":", convert = TRUE):- Splits
LB_COMBINEDintoPARAMandVALUEat the colon. convert = TRUEautomatically convertsVALUEto numeric.
- Splits
Output Table:
USUBJID PARAM VALUE 01-701-101 HGB 13.5 01-701-102 HGB 12.8
Keep original column:
Useremove = FALSEto keep the original column after separation.Input Table:
# Create the input table for AE_TERM_SEV dataset ae <- data.frame( USUBJID = c("01-701-101"), AE_TERM_SEV = c("Headache:MILD") ) # Display the input table print(ae)USUBJID AE_TERM_SEV 01-701-101 Headache:MILD R Code:
ae %>% separate(AE_TERM_SEV, into = c("AETERM", "AESEV"), sep = ":", remove = FALSE)ae %>%pipes the dataset toseparate().separate(AE_TERM_SEV, into = c("AETERM", "AESEV"), sep = ":", remove = FALSE):- Splits
AE_TERM_SEVintoAETERMandAESEVat the colon. remove = FALSEkeeps the originalAE_TERM_SEVcolumn in the output.
- Splits
Output Table:
USUBJID AE_TERM_SEV AETERM AESEV 01-701-101 Headache:MILD Headache MILD
Unite columns:
Useunite()to combine multiple columns into one (the reverse ofseparate()).Input Table:
# Create the input table for AETERM and AESEV dataset ae <- data.frame( USUBJID = c("01-701-101", "01-701-102"), AETERM = c("Headache", "Nausea"), AESEV = c("MILD", "MODERATE") ) # Display the input table print(ae)USUBJID AETERM AESEV 01-701-101 Headache MILD 01-701-102 Nausea MODERATE R Code:
ae %>% unite("AE_TERM_SEV", AETERM, AESEV, sep = ":")ae %>%pipes the dataset tounite().unite("AE_TERM_SEV", AETERM, AESEV, sep = ":"):- Combines
AETERMandAESEVinto a single columnAE_TERM_SEVwith a colon separator. - Useful for creating a combined field for export or reporting.
- Combines
Output Table:
USUBJID AE_TERM_SEV 01-701-101 Headache:MILD 01-701-102 Nausea:MODERATE
6. Cleaning Column Names with clean_names()
Input Table:
# Create the input table for AE dataset with inconsistent column names
ae <- data.frame(
`AE Term/Severity` = c("Headache:MILD", "Nausea:MODERATE"),
`AE Serious?` = c("N", "N")
)
# Display the input table
print(ae)
| AE Term/Severity | AE Serious? |
|---|---|
| Headache:MILD | N |
| Nausea:MODERATE | N |
R Code:
library(janitor)
ae %>% clean_names()
ae %>%pipes the dataset toclean_names().clean_names():- Converts all column names to lowercase.
- Replaces spaces and special characters with underscores.
- Makes names easier to reference in code.
Output Table:
| ae_term_severity | ae_serious |
|---|---|
| Headache:MILD | N |
| Nausea:MODERATE | N |
- Explanation:
- Cleaned column names are easier to use in code and less error-prone.
7. Practical Tips
- Always check your separator—colons, dashes, or other characters can affect results.
- Clean up column names after separating for consistency (e.g., use
clean_names()from janitor). - Use
separate_rows()if you want to split a column into multiple rows instead of columns (e.g., multiple concomitant medications in one field). - Use
unite()to merge columns for reporting or export.
8. Conclusion
- Separating and merging columns is essential for tidy, analysis-ready SDTM data.
- Use
separate()from tidyr to split columns based on a delimiter or position. - Use
unite()to merge columns when needed. - Use
clean_names()to standardize column names for easier coding. - Combine these with other tidyr and dplyr functions for powerful, readable data wrangling workflows.
**Resource download links**
2.3.4.-Separating-merging-cleaning-Columns.zip