2.4.1. Factors
1. Introduction
Factors are R's data structure for representing categorical data. In SDTM and clinical data, categorical variables have a fixed set of possible values (levels), such as VISIT names, laboratory test codes (LBTEST), or vital sign parameters (VSTEST). Understanding and working with factors is essential for data analysis, visualization, and regulatory reporting in R.
2. Why Use Factors?
- Ensure SDTM categorical variables (e.g., VISIT, LBTEST, VSTEST) have a defined set and order of possible values.
- Enable correct sorting, grouping, and plotting of categorical data.
- Improve clarity and reproducibility in SDTM data analysis.
- Required for many statistical modeling and reporting functions in R.
3. Creating and Understanding Factors
- Factors have two components: the actual values and the possible levels.
- You can create a factor from a character vector and specify the levels.
Input Table:
| USUBJID | VISIT |
|---|---|
| 01-701-101 | SCREENING |
| 01-701-102 | WEEK 1 |
| 01-701-103 | BASELINE |
| 01-701-104 | WEEK 2 |
| 01-701-105 | WEEK 1 |
R Code:
# All possible visits
all_visits <- c("SCREENING", "BASELINE", "WEEK 1", "WEEK 2")
# Observed data
visit_vals <- c("SCREENING", "WEEK 1", "BASELINE", "WEEK 2", "WEEK 1")
sort(visit_vals)
all_visitsdefines all possible visit names.visit_valsis a character vector of observed visits.sort(visit_vals)sorts the character vector alphabetically.
Output:
> print(sort(visit_vals))
[1] "BASELINE" "SCREENING" "WEEK 1" "WEEK 1" "WEEK 2"
- Explanation:
- Sorting a character vector results in alphabetical order, not protocol-defined visit order.
4. Creating a Factor with Levels
R Code:
visit_factor <- factor(visit_vals, levels = all_visits)
visit_factor
sort(visit_factor)
factor(visit_vals, levels = all_visits)creates a factor with specified levels (protocol order).sort(visit_factor)sorts the factor according to the defined levels.
Output:
> print(visit_factor)
[1] SCREENING WEEK 1 BASELINE WEEK 2 WEEK 1
Levels: SCREENING BASELINE WEEK 1 WEEK 2
> print(sort(visit_factor))
[1] SCREENING BASELINE WEEK 1 WEEK 1 WEEK 2
Levels: SCREENING BASELINE WEEK 1 WEEK 2
- Explanation:
- The factor respects the protocol-defined order for sorting.
5. Working with Factors Using forcats
- The
forcatspackage provides many useful functions for manipulating factors in SDTM and clinical data.
Example 1: Reordering by Frequency (VSTEST)
Suppose you have a vital signs dataset:
| USUBJID | VSTEST |
|---|---|
| 01-701-101 | PULSE |
| 01-701-102 | SYSBP |
| 01-701-103 | DIABP |
| 01-701-104 | PULSE |
| 01-701-105 | SYSBP |
R Code:
library(forcats)
vs_factor <- factor(c("PULSE", "SYSBP", "DIABP", "PULSE", "SYSBP"))
vs_factor_freq <- fct_infreq(vs_factor)
levels(vs_factor_freq)
fct_infreq(vs_factor)reorders VSTEST levels by frequency.
Output:
> vs_factor_freq
[1] PULSE SYSBP DIABP PULSE SYSBP
Levels: PULSE SYSBP DIABP
> print(levels(vs_factor_freq))
[1] "PULSE" "SYSBP" "DIABP"
- Explanation:
- Most frequent vital sign tests come first.
Example 2: Lumping Rare Categories (LBTESTCD)
Suppose you have a lab test code variable:
| USUBJID | LBTESTCD |
|---|---|
| 01-701-101 | HGB |
| 01-701-102 | ALT |
| 01-701-103 | AST |
| 01-701-104 | HGB |
| 01-701-105 | ALP |
R Code:
lb_factor <- factor(c("HGB", "ALT", "AST", "HGB", "ALP"))
lb_lumped <- fct_lump(lb_factor, n = 2)
table(lb_lumped)
fct_lump(lb_factor, n = 2)keeps the two most frequent tests, others become "Other".
Output:
> table(lb_lumped)
lb_lumped
ALP ALT AST HGB
1 1 1 2
- Explanation:
- Only the most common lab test(s) are kept as separate levels.
Example 3: Reordering by Median Value (VSTEST and VSORRES)
Suppose you want to order vital sign tests by their median result, which is useful for plotting or reporting (e.g., boxplots where you want the lowest-median test on the left and the highest on the right).
Input Table:
| USUBJID | VSTEST | VSORRES |
|---|---|---|
| 01-701-101 | PULSE | 70 |
| 01-701-102 | SYSBP | 120 |
| 01-701-103 | DIABP | 80 |
| 01-701-104 | PULSE | 75 |
| 01-701-105 | SYSBP | 130 |
R Code:
library(dplyr)
library(forcats)
vs <- data.frame(
VSTEST = c("PULSE", "SYSBP", "DIABP", "PULSE", "SYSBP"),
VSORRES = c(70, 120, 80, 75, 130)
)
# Reorder VSTEST by median VSORRES
vs$VSTEST <- fct_reorder(vs$VSTEST, vs$VSORRES, .fun = median)
levels(vs$VSTEST)
# To see the effect, sort by the new factor order
vs[order(vs$VSTEST), ]
fct_reorder(VSTEST, VSORRES, .fun = median)computes the median VSORRES for each VSTEST and orders the factor levels accordingly.- This is especially useful for visualizations, as it ensures the plot order reflects the central tendency of each test.
Output Table (levels):
| Ordered VSTEST Levels |
|---|
| PULSE |
| DIABP |
| SYSBP |
Output Table (data sorted by new factor order):
| USUBJID | VSTEST | VSORRES |
|---|---|---|
| 01-701-101 | PULSE | 70 |
| 01-701-104 | PULSE | 75 |
| 01-701-103 | DIABP | 80 |
| 01-701-102 | SYSBP | 120 |
| 01-701-105 | SYSBP | 130 |
- Explanation:
- The median for PULSE is 72.5, for DIABP is 80, and for SYSBP is 125.
- The factor levels are ordered from lowest to highest median (PULSE, DIABP, SYSBP).
- This ordering is reflected in plots (e.g., boxplots) and summaries, making comparisons more intuitive.
Example 4: Collapsing Categories (VSTEST)
Suppose you want to group related vital signs:
| VSTEST |
|---|
| PULSE |
| SYSBP |
| DIABP |
| PULSE |
| SYSBP |
R Code:
vs_factor <- factor(c("PULSE", "SYSBP", "DIABP", "PULSE", "SYSBP"))
vs_collapsed <- fct_collapse(vs_factor, BP = c("SYSBP", "DIABP"), HR = "PULSE")
table(vs_collapsed)
fct_collapse()groups SYSBP and DIABP as "BP", PULSE as "HR".
Output:
print(table(vs_collapsed))
vs_collapsed
BP HR
3 2
- Explanation:
- Useful for summarizing or simplifying categories.
6. Beyond the Basics
Manually Set Level Order
Suppose you want to set a custom order for your VISIT levels, for example, to display "WEEK 2" first, followed by "BASELINE", "WEEK 1", and "SCREENING".
Input Table:
| visit_factor |
|---|
| SCREENING |
| WEEK 1 |
| BASELINE |
| WEEK 2 |
| WEEK 1 |
R Code:
visit_factor <- factor(c("SCREENING", "WEEK 1", "BASELINE", "WEEK 2", "WEEK 1"),
levels = c("SCREENING", "BASELINE", "WEEK 1", "WEEK 2"))
# Manually set custom order
visit_factor_custom <- fct_relevel(visit_factor, "WEEK 2", "BASELINE", "WEEK 1", "SCREENING")
levels(visit_factor_custom)
sort(visit_factor_custom)
fct_relevel(visit_factor, ...)sets the order of levels as specified.sort(visit_factor_custom)sorts the factor according to the new custom order.
Output:
> print(levels(visit_factor_custom))
[1] "WEEK 2" "BASELINE" "WEEK 1" "SCREENING"
> print(sort(visit_factor_custom))
[1] WEEK 2 BASELINE WEEK 1 WEEK 1 SCREENING
Levels: WEEK 2 BASELINE WEEK 1 SCREENING
- Explanation:
- The levels are now in the custom order you specified, which is useful for custom reporting or plotting.
Move Levels to the Front
visit_relevel <- fct_relevel(visit_factor, "WEEK 1", "WEEK 2", after = 0)
sort(visit_relevel)
- Moves "WEEK 1" and "WEEK 2" to the front.
Output:
> sort(visit_relevel)
[1] WEEK 1 WEEK 1 WEEK 2 SCREENING BASELINE
Levels: WEEK 1 WEEK 2 SCREENING BASELINE
- Explanation:
- Levels are reordered as specified.
Rename Levels
visit_factor <- fct_recode(visit_factor, SCR = "SCREENING", BL = "BASELINE", WK1 = "WEEK 1", WK2 = "WEEK 2")
Output:
> print(levels(visit_factor))
[1] "SCR" "BL" "WK1" "WK2"
- Renames levels for clarity.
Collapse Levels
visit_factor <- fct_collapse(visit_factor, EARLY = c("SCREENING", "BASELINE"))
- Groups "SCREENING" and "BASELINE" into "EARLY".
Output:
> print(levels(visit_factor))
[1] "SCREENING" "BASELINE" "WEEK 1" "WEEK 2"
- Explanation:
- "SCREENING" and "BASELINE" are grouped into "EARLY".
Drop Unused Levels
visit_factor <- fct_drop(visit_factor)
- Removes levels not present in the data.
Count Occurrences
fct_count(visit_factor)
- Returns a tibble with counts for each level.
Output:
> print(fct_count(visit_factor))
# A tibble: 4 × 2
f n
<fct> <int>
1 SCREENING 1
2 BASELINE 1
3 WEEK 1 2
4 WEEK 2 1
Keep Order of Appearance
visit_inorder <- fct_inorder(c("BASELINE", "WEEK 1", "WEEK 2", "BASELINE"))
sort(visit_inorder)
- Levels remain in the order they first appear.
Output:
> print(sort(visit_inorder))
[1] BASELINE BASELINE WEEK 1 WEEK 2
Levels: BASELINE WEEK 1 WEEK 2
- Explanation:
- Levels remain in the order of appearance.
7. Advanced Factoring Example: LBTEST in Laboratory Data
Suppose you have a laboratory dataset:
| USUBJID | LBTEST |
|---|---|
| 01-701-101 | HGB |
| 01-701-102 | ALT |
| 01-701-103 | HGB |
| 01-701-104 | AST |
| 01-701-105 | ALT |
Tabulate Frequencies
library(janitor)
tabyl(lb$LBTEST)
- Shows frequency and proportion of each test.
Output Table:
| LBTEST | n | percent |
|---|---|---|
| ALT | 2 | 0.4 |
| AST | 1 | 0.2 |
| HGB | 2 | 0.4 |
Order by Frequency
library(forcats)
lb$LBTEST <- fct_infreq(lb$LBTEST)
levels(lb$LBTEST)
Reorders LBTEST levels by frequency.
Output:
> print(levels(lb$LBTEST)) [1] "ALT" "HGB" "AST"
Reverse Order
fct_rev(fct_infreq(lb$LBTEST)) %>% head()
Reverses the order of factor levels.
Output:
> print(fct_rev(fct_infreq(lb$LBTEST))) [1] HGB ALT HGB AST ALT Levels: AST HGB ALT
Order by Numeric Mapping (e.g., for ranking)
lb %>%
mutate(LBTEST_ord = fct_reorder(LBTEST, as.numeric(LBTEST))) %>%
arrange(LBTEST_ord)
- Reorders LBTEST by a numeric mapping (e.g., for ranking).
Group and Rename Levels
lb %>%
mutate(LBTEST_group = fct_recode(LBTEST,
"LIVER" = "ALT",
"LIVER" = "AST",
"BLOOD" = "HGB"
)) %>%
tabyl(LBTEST_group)
Combines and renames levels.
Output:
> print(tabyl(lb$LBTEST_group)) lb$LBTEST_group n percent LIVER 3 0.6 BLOOD 2 0.4
Convert to Binary Factor
lb %>%
mutate(LBTEST_BIN = ifelse(LBTEST == "HGB", "Blood", "Liver"),
LBTEST_BIN = factor(LBTEST_BIN)) %>%
tabyl(LBTEST_BIN)
Converts LBTEST to a binary factor.
Output:
> print(tabyl(lb$LBTEST_BIN)) lb$LBTEST_BIN n percent Blood 2 0.4 Liver 3 0.6Explanation:
- Each R code block is explained with input and output tables, showing how factors are created, manipulated, and summarized for SDTM categorical data.
8. Conclusion
- Factors are essential for handling categorical data in R and SDTM.
- Always define levels explicitly for correct ordering and analysis.
- Use the
forcatspackage for advanced factor manipulation. - Proper use of factors improves data quality, analysis, and regulatory reporting.