2.4.1. Factors

1. Introduction

Factors are R's data structure for representing categorical data. In SDTM and clinical data, categorical variables have a fixed set of possible values (levels), such as VISIT names, laboratory test codes (LBTEST), or vital sign parameters (VSTEST). Understanding and working with factors is essential for data analysis, visualization, and regulatory reporting in R.

2. Why Use Factors?

Ensure SDTM categorical variables (e.g., VISIT, LBTEST, VSTEST) have a defined set and order of possible values.
Enable correct sorting, grouping, and plotting of categorical data.
Improve clarity and reproducibility in SDTM data analysis.
Required for many statistical modeling and reporting functions in R.

3. Creating and Understanding Factors

Factors have two components: the actual values and the possible levels.
You can create a factor from a character vector and specify the levels.

Input Table:

USUBJID	VISIT
01-701-101	SCREENING
01-701-102	WEEK 1
01-701-103	BASELINE
01-701-104	WEEK 2
01-701-105	WEEK 1

R Code:

# All possible visits
all_visits <- c("SCREENING", "BASELINE", "WEEK 1", "WEEK 2")
# Observed data
visit_vals <- c("SCREENING", "WEEK 1", "BASELINE", "WEEK 2", "WEEK 1")
sort(visit_vals)

all_visits defines all possible visit names.
visit_vals is a character vector of observed visits.
sort(visit_vals) sorts the character vector alphabetically.

Output:

> print(sort(visit_vals))
[1] "BASELINE"  "SCREENING" "WEEK 1"    "WEEK 1"    "WEEK 2"

Explanation:
- Sorting a character vector results in alphabetical order, not protocol-defined visit order.

4. Creating a Factor with Levels

R Code:

visit_factor <- factor(visit_vals, levels = all_visits)
visit_factor
sort(visit_factor)

factor(visit_vals, levels = all_visits) creates a factor with specified levels (protocol order).
sort(visit_factor) sorts the factor according to the defined levels.

Output:

> print(visit_factor)
[1] SCREENING WEEK 1    BASELINE  WEEK 2    WEEK 1   
Levels: SCREENING BASELINE WEEK 1 WEEK 2
> print(sort(visit_factor))
[1] SCREENING BASELINE  WEEK 1    WEEK 1    WEEK 2   
Levels: SCREENING BASELINE WEEK 1 WEEK 2

Explanation:
- The factor respects the protocol-defined order for sorting.

5. Working with Factors Using forcats

The forcats package provides many useful functions for manipulating factors in SDTM and clinical data.

Example 1: Reordering by Frequency (VSTEST)

Suppose you have a vital signs dataset:

USUBJID	VSTEST
01-701-101	PULSE
01-701-102	SYSBP
01-701-103	DIABP
01-701-104	PULSE
01-701-105	SYSBP

R Code:

library(forcats)
vs_factor <- factor(c("PULSE", "SYSBP", "DIABP", "PULSE", "SYSBP"))
vs_factor_freq <- fct_infreq(vs_factor)
levels(vs_factor_freq)

fct_infreq(vs_factor) reorders VSTEST levels by frequency.

Output:

> vs_factor_freq
[1] PULSE SYSBP DIABP PULSE SYSBP
Levels: PULSE SYSBP DIABP
> print(levels(vs_factor_freq))
[1] "PULSE" "SYSBP" "DIABP"

Explanation:
- Most frequent vital sign tests come first.

Example 2: Lumping Rare Categories (LBTESTCD)

Suppose you have a lab test code variable:

USUBJID	LBTESTCD
01-701-101	HGB
01-701-102	ALT
01-701-103	AST
01-701-104	HGB
01-701-105	ALP

R Code:

lb_factor <- factor(c("HGB", "ALT", "AST", "HGB", "ALP"))
lb_lumped <- fct_lump(lb_factor, n = 2)
table(lb_lumped)

fct_lump(lb_factor, n = 2) keeps the two most frequent tests, others become "Other".

Output:

> table(lb_lumped)
lb_lumped
ALP ALT AST HGB 
  1   1   1   2

Explanation:
- Only the most common lab test(s) are kept as separate levels.

Example 3: Reordering by Median Value (VSTEST and VSORRES)

Suppose you want to order vital sign tests by their median result, which is useful for plotting or reporting (e.g., boxplots where you want the lowest-median test on the left and the highest on the right).

Input Table:

USUBJID	VSTEST	VSORRES
01-701-101	PULSE	70
01-701-102	SYSBP	120
01-701-103	DIABP	80
01-701-104	PULSE	75
01-701-105	SYSBP	130

R Code:

library(dplyr)
library(forcats)
vs <- data.frame(
  VSTEST = c("PULSE", "SYSBP", "DIABP", "PULSE", "SYSBP"),
  VSORRES = c(70, 120, 80, 75, 130)
)
# Reorder VSTEST by median VSORRES
vs$VSTEST <- fct_reorder(vs$VSTEST, vs$VSORRES, .fun = median)
levels(vs$VSTEST)
# To see the effect, sort by the new factor order
vs[order(vs$VSTEST), ]

fct_reorder(VSTEST, VSORRES, .fun = median) computes the median VSORRES for each VSTEST and orders the factor levels accordingly.
This is especially useful for visualizations, as it ensures the plot order reflects the central tendency of each test.

Output Table (levels):

Ordered VSTEST Levels
PULSE
DIABP
SYSBP

Output Table (data sorted by new factor order):

USUBJID	VSTEST	VSORRES
01-701-101	PULSE	70
01-701-104	PULSE	75
01-701-103	DIABP	80
01-701-102	SYSBP	120
01-701-105	SYSBP	130

Explanation:
- The median for PULSE is 72.5, for DIABP is 80, and for SYSBP is 125.
- The factor levels are ordered from lowest to highest median (PULSE, DIABP, SYSBP).
- This ordering is reflected in plots (e.g., boxplots) and summaries, making comparisons more intuitive.

Example 4: Collapsing Categories (VSTEST)

Suppose you want to group related vital signs:

VSTEST
PULSE
SYSBP
DIABP
PULSE
SYSBP

R Code:

vs_factor <- factor(c("PULSE", "SYSBP", "DIABP", "PULSE", "SYSBP"))
vs_collapsed <- fct_collapse(vs_factor, BP = c("SYSBP", "DIABP"), HR = "PULSE")
table(vs_collapsed)

fct_collapse() groups SYSBP and DIABP as "BP", PULSE as "HR".

Output:

print(table(vs_collapsed))
vs_collapsed
BP HR 
 3  2

Explanation:
- Useful for summarizing or simplifying categories.

6. Beyond the Basics

Manually Set Level Order

Suppose you want to set a custom order for your VISIT levels, for example, to display "WEEK 2" first, followed by "BASELINE", "WEEK 1", and "SCREENING".

Input Table:

visit_factor
SCREENING
WEEK 1
BASELINE
WEEK 2
WEEK 1

R Code:

visit_factor <- factor(c("SCREENING", "WEEK 1", "BASELINE", "WEEK 2", "WEEK 1"),
                       levels = c("SCREENING", "BASELINE", "WEEK 1", "WEEK 2"))
# Manually set custom order
visit_factor_custom <- fct_relevel(visit_factor, "WEEK 2", "BASELINE", "WEEK 1", "SCREENING")
levels(visit_factor_custom)
sort(visit_factor_custom)

fct_relevel(visit_factor, ...) sets the order of levels as specified.
sort(visit_factor_custom) sorts the factor according to the new custom order.

Output:

> print(levels(visit_factor_custom))
[1] "WEEK 2"    "BASELINE"  "WEEK 1"    "SCREENING"
> print(sort(visit_factor_custom))
[1] WEEK 2    BASELINE  WEEK 1    WEEK 1    SCREENING
Levels: WEEK 2 BASELINE WEEK 1 SCREENING

Explanation:
- The levels are now in the custom order you specified, which is useful for custom reporting or plotting.

Move Levels to the Front

visit_relevel <- fct_relevel(visit_factor, "WEEK 1", "WEEK 2", after = 0)
sort(visit_relevel)

Moves "WEEK 1" and "WEEK 2" to the front.

Output:

> sort(visit_relevel)
[1] WEEK 1    WEEK 1    WEEK 2    SCREENING BASELINE 
Levels: WEEK 1 WEEK 2 SCREENING BASELINE

Explanation:
- Levels are reordered as specified.

Rename Levels

visit_factor <- fct_recode(visit_factor, SCR = "SCREENING", BL = "BASELINE", WK1 = "WEEK 1", WK2 = "WEEK 2")

Output:

> print(levels(visit_factor))
[1] "SCR" "BL"  "WK1" "WK2"

Renames levels for clarity.

Collapse Levels

visit_factor <- fct_collapse(visit_factor, EARLY = c("SCREENING", "BASELINE"))

Groups "SCREENING" and "BASELINE" into "EARLY".

Output:

> print(levels(visit_factor))
[1] "SCREENING" "BASELINE"  "WEEK 1"    "WEEK 2"

Explanation:
- "SCREENING" and "BASELINE" are grouped into "EARLY".

Drop Unused Levels

visit_factor <- fct_drop(visit_factor)

Removes levels not present in the data.

Count Occurrences

fct_count(visit_factor)

Returns a tibble with counts for each level.

Output:

> print(fct_count(visit_factor))
# A tibble: 4 × 2
  f             n
  <fct>     <int>
1 SCREENING     1
2 BASELINE      1
3 WEEK 1        2
4 WEEK 2        1

Keep Order of Appearance

visit_inorder <- fct_inorder(c("BASELINE", "WEEK 1", "WEEK 2", "BASELINE"))
sort(visit_inorder)

Levels remain in the order they first appear.

Output:

> print(sort(visit_inorder))
[1] BASELINE BASELINE WEEK 1   WEEK 2  
Levels: BASELINE WEEK 1 WEEK 2

Explanation:
- Levels remain in the order of appearance.

7. Advanced Factoring Example: LBTEST in Laboratory Data

Suppose you have a laboratory dataset:

USUBJID	LBTEST
01-701-101	HGB
01-701-102	ALT
01-701-103	HGB
01-701-104	AST
01-701-105	ALT

Tabulate Frequencies

library(janitor)
tabyl(lb$LBTEST)

Shows frequency and proportion of each test.

Output Table:

LBTEST	n	percent
ALT	2	0.4
AST	1	0.2
HGB	2	0.4

Order by Frequency

library(forcats)
lb$LBTEST <- fct_infreq(lb$LBTEST)
levels(lb$LBTEST)

Reorders LBTEST levels by frequency.

Output:

> print(levels(lb$LBTEST))
[1] "ALT" "HGB" "AST"

Reverse Order

fct_rev(fct_infreq(lb$LBTEST)) %>% head()

Reverses the order of factor levels.

Output:

> print(fct_rev(fct_infreq(lb$LBTEST)))
[1] HGB ALT HGB AST ALT
Levels: AST HGB ALT

Order by Numeric Mapping (e.g., for ranking)

lb %>%
  mutate(LBTEST_ord = fct_reorder(LBTEST, as.numeric(LBTEST))) %>%
  arrange(LBTEST_ord)

Reorders LBTEST by a numeric mapping (e.g., for ranking).

Group and Rename Levels

lb %>%
  mutate(LBTEST_group = fct_recode(LBTEST,
    "LIVER" = "ALT",
    "LIVER" = "AST",
    "BLOOD" = "HGB"
  )) %>%
  tabyl(LBTEST_group)

Combines and renames levels.

Output:

> print(tabyl(lb$LBTEST_group))
lb$LBTEST_group n percent
          LIVER 3     0.6
          BLOOD 2     0.4

Convert to Binary Factor

lb %>%
  mutate(LBTEST_BIN = ifelse(LBTEST == "HGB", "Blood", "Liver"),
         LBTEST_BIN = factor(LBTEST_BIN)) %>%
  tabyl(LBTEST_BIN)

Converts LBTEST to a binary factor.

Output:

> print(tabyl(lb$LBTEST_BIN))
lb$LBTEST_BIN n percent
        Blood 2     0.4
        Liver 3     0.6

Explanation:
- Each R code block is explained with input and output tables, showing how factors are created, manipulated, and summarized for SDTM categorical data.

8. Conclusion

Factors are essential for handling categorical data in R and SDTM.
Always define levels explicitly for correct ordering and analysis.
Use the forcats package for advanced factor manipulation.
Proper use of factors improves data quality, analysis, and regulatory reporting.

Resource download links

2.4.1.-Factors.zip

⁂

2.4.1. Factors

1. Introduction

2. Why Use Factors?

3. Creating and Understanding Factors

4. Creating a Factor with Levels

5. Working with Factors Using forcats

6. Beyond the Basics

7. Advanced Factoring Example: LBTEST in Laboratory Data

8. Conclusion

**Resource download links**

Resource download links