contact@a2zlearners.com

2.4.1. Factors

1. Introduction

Factors are R's data structure for representing categorical data. In SDTM and clinical data, categorical variables have a fixed set of possible values (levels), such as VISIT names, laboratory test codes (LBTEST), or vital sign parameters (VSTEST). Understanding and working with factors is essential for data analysis, visualization, and regulatory reporting in R.


2. Why Use Factors?

  • Ensure SDTM categorical variables (e.g., VISIT, LBTEST, VSTEST) have a defined set and order of possible values.
  • Enable correct sorting, grouping, and plotting of categorical data.
  • Improve clarity and reproducibility in SDTM data analysis.
  • Required for many statistical modeling and reporting functions in R.

3. Creating and Understanding Factors

  • Factors have two components: the actual values and the possible levels.
  • You can create a factor from a character vector and specify the levels.

Input Table:

USUBJID VISIT
01-701-101 SCREENING
01-701-102 WEEK 1
01-701-103 BASELINE
01-701-104 WEEK 2
01-701-105 WEEK 1

R Code:

# All possible visits
all_visits <- c("SCREENING", "BASELINE", "WEEK 1", "WEEK 2")
# Observed data
visit_vals <- c("SCREENING", "WEEK 1", "BASELINE", "WEEK 2", "WEEK 1")
sort(visit_vals)
  • all_visits defines all possible visit names.
  • visit_vals is a character vector of observed visits.
  • sort(visit_vals) sorts the character vector alphabetically.

Output:

> print(sort(visit_vals))
[1] "BASELINE"  "SCREENING" "WEEK 1"    "WEEK 1"    "WEEK 2"   
  • Explanation:
    • Sorting a character vector results in alphabetical order, not protocol-defined visit order.

4. Creating a Factor with Levels

R Code:

visit_factor <- factor(visit_vals, levels = all_visits)
visit_factor
sort(visit_factor)
  • factor(visit_vals, levels = all_visits) creates a factor with specified levels (protocol order).
  • sort(visit_factor) sorts the factor according to the defined levels.

Output:

> print(visit_factor)
[1] SCREENING WEEK 1    BASELINE  WEEK 2    WEEK 1   
Levels: SCREENING BASELINE WEEK 1 WEEK 2
> print(sort(visit_factor))
[1] SCREENING BASELINE  WEEK 1    WEEK 1    WEEK 2   
Levels: SCREENING BASELINE WEEK 1 WEEK 2
  • Explanation:
    • The factor respects the protocol-defined order for sorting.

5. Working with Factors Using forcats

  • The forcats package provides many useful functions for manipulating factors in SDTM and clinical data.

Example 1: Reordering by Frequency (VSTEST)

Suppose you have a vital signs dataset:

USUBJID VSTEST
01-701-101 PULSE
01-701-102 SYSBP
01-701-103 DIABP
01-701-104 PULSE
01-701-105 SYSBP

R Code:

library(forcats)
vs_factor <- factor(c("PULSE", "SYSBP", "DIABP", "PULSE", "SYSBP"))
vs_factor_freq <- fct_infreq(vs_factor)
levels(vs_factor_freq)
  • fct_infreq(vs_factor) reorders VSTEST levels by frequency.

Output:

> vs_factor_freq
[1] PULSE SYSBP DIABP PULSE SYSBP
Levels: PULSE SYSBP DIABP
> print(levels(vs_factor_freq))
[1] "PULSE" "SYSBP" "DIABP"
  • Explanation:
    • Most frequent vital sign tests come first.

Example 2: Lumping Rare Categories (LBTESTCD)

Suppose you have a lab test code variable:

USUBJID LBTESTCD
01-701-101 HGB
01-701-102 ALT
01-701-103 AST
01-701-104 HGB
01-701-105 ALP

R Code:

lb_factor <- factor(c("HGB", "ALT", "AST", "HGB", "ALP"))
lb_lumped <- fct_lump(lb_factor, n = 2)
table(lb_lumped)
  • fct_lump(lb_factor, n = 2) keeps the two most frequent tests, others become "Other".

Output:

> table(lb_lumped)
lb_lumped
ALP ALT AST HGB 
  1   1   1   2 
  • Explanation:
    • Only the most common lab test(s) are kept as separate levels.

Example 3: Reordering by Median Value (VSTEST and VSORRES)

Suppose you want to order vital sign tests by their median result, which is useful for plotting or reporting (e.g., boxplots where you want the lowest-median test on the left and the highest on the right).

Input Table:

USUBJID VSTEST VSORRES
01-701-101 PULSE 70
01-701-102 SYSBP 120
01-701-103 DIABP 80
01-701-104 PULSE 75
01-701-105 SYSBP 130

R Code:

library(dplyr)
library(forcats)
vs <- data.frame(
  VSTEST = c("PULSE", "SYSBP", "DIABP", "PULSE", "SYSBP"),
  VSORRES = c(70, 120, 80, 75, 130)
)
# Reorder VSTEST by median VSORRES
vs$VSTEST <- fct_reorder(vs$VSTEST, vs$VSORRES, .fun = median)
levels(vs$VSTEST)
# To see the effect, sort by the new factor order
vs[order(vs$VSTEST), ]
  • fct_reorder(VSTEST, VSORRES, .fun = median) computes the median VSORRES for each VSTEST and orders the factor levels accordingly.
  • This is especially useful for visualizations, as it ensures the plot order reflects the central tendency of each test.

Output Table (levels):

Ordered VSTEST Levels
PULSE
DIABP
SYSBP

Output Table (data sorted by new factor order):

USUBJID VSTEST VSORRES
01-701-101 PULSE 70
01-701-104 PULSE 75
01-701-103 DIABP 80
01-701-102 SYSBP 120
01-701-105 SYSBP 130
  • Explanation:
    • The median for PULSE is 72.5, for DIABP is 80, and for SYSBP is 125.
    • The factor levels are ordered from lowest to highest median (PULSE, DIABP, SYSBP).
    • This ordering is reflected in plots (e.g., boxplots) and summaries, making comparisons more intuitive.

Example 4: Collapsing Categories (VSTEST)

Suppose you want to group related vital signs:

VSTEST
PULSE
SYSBP
DIABP
PULSE
SYSBP

R Code:

vs_factor <- factor(c("PULSE", "SYSBP", "DIABP", "PULSE", "SYSBP"))
vs_collapsed <- fct_collapse(vs_factor, BP = c("SYSBP", "DIABP"), HR = "PULSE")
table(vs_collapsed)
  • fct_collapse() groups SYSBP and DIABP as "BP", PULSE as "HR".

Output:

print(table(vs_collapsed))
vs_collapsed
BP HR 
 3  2 
  • Explanation:
    • Useful for summarizing or simplifying categories.

6. Beyond the Basics

Manually Set Level Order

Suppose you want to set a custom order for your VISIT levels, for example, to display "WEEK 2" first, followed by "BASELINE", "WEEK 1", and "SCREENING".

Input Table:

visit_factor
SCREENING
WEEK 1
BASELINE
WEEK 2
WEEK 1

R Code:

visit_factor <- factor(c("SCREENING", "WEEK 1", "BASELINE", "WEEK 2", "WEEK 1"),
                       levels = c("SCREENING", "BASELINE", "WEEK 1", "WEEK 2"))
# Manually set custom order
visit_factor_custom <- fct_relevel(visit_factor, "WEEK 2", "BASELINE", "WEEK 1", "SCREENING")
levels(visit_factor_custom)
sort(visit_factor_custom)
  • fct_relevel(visit_factor, ...) sets the order of levels as specified.
  • sort(visit_factor_custom) sorts the factor according to the new custom order.

Output:

> print(levels(visit_factor_custom))
[1] "WEEK 2"    "BASELINE"  "WEEK 1"    "SCREENING"
> print(sort(visit_factor_custom))
[1] WEEK 2    BASELINE  WEEK 1    WEEK 1    SCREENING
Levels: WEEK 2 BASELINE WEEK 1 SCREENING
  • Explanation:
    • The levels are now in the custom order you specified, which is useful for custom reporting or plotting.

Move Levels to the Front

visit_relevel <- fct_relevel(visit_factor, "WEEK 1", "WEEK 2", after = 0)
sort(visit_relevel)
  • Moves "WEEK 1" and "WEEK 2" to the front.

Output:

> sort(visit_relevel)
[1] WEEK 1    WEEK 1    WEEK 2    SCREENING BASELINE 
Levels: WEEK 1 WEEK 2 SCREENING BASELINE
  • Explanation:
    • Levels are reordered as specified.

Rename Levels

visit_factor <- fct_recode(visit_factor, SCR = "SCREENING", BL = "BASELINE", WK1 = "WEEK 1", WK2 = "WEEK 2")

Output:

> print(levels(visit_factor))
[1] "SCR" "BL"  "WK1" "WK2"
  • Renames levels for clarity.

Collapse Levels

visit_factor <- fct_collapse(visit_factor, EARLY = c("SCREENING", "BASELINE"))
  • Groups "SCREENING" and "BASELINE" into "EARLY".

Output:

> print(levels(visit_factor))
[1] "SCREENING" "BASELINE"  "WEEK 1"    "WEEK 2" 
  • Explanation:
    • "SCREENING" and "BASELINE" are grouped into "EARLY".

Drop Unused Levels

visit_factor <- fct_drop(visit_factor)
  • Removes levels not present in the data.

Count Occurrences

fct_count(visit_factor)
  • Returns a tibble with counts for each level.

Output:

> print(fct_count(visit_factor))
# A tibble: 4 × 2
  f             n
  <fct>     <int>
1 SCREENING     1
2 BASELINE      1
3 WEEK 1        2
4 WEEK 2        1

Keep Order of Appearance

visit_inorder <- fct_inorder(c("BASELINE", "WEEK 1", "WEEK 2", "BASELINE"))
sort(visit_inorder)
  • Levels remain in the order they first appear.

Output:

> print(sort(visit_inorder))
[1] BASELINE BASELINE WEEK 1   WEEK 2  
Levels: BASELINE WEEK 1 WEEK 2
  • Explanation:
    • Levels remain in the order of appearance.

7. Advanced Factoring Example: LBTEST in Laboratory Data

Suppose you have a laboratory dataset:

USUBJID LBTEST
01-701-101 HGB
01-701-102 ALT
01-701-103 HGB
01-701-104 AST
01-701-105 ALT

Tabulate Frequencies

library(janitor)
tabyl(lb$LBTEST)
  • Shows frequency and proportion of each test.

Output Table:

LBTEST n percent
ALT 2 0.4
AST 1 0.2
HGB 2 0.4

Order by Frequency

library(forcats)
lb$LBTEST <- fct_infreq(lb$LBTEST)
levels(lb$LBTEST)
  • Reorders LBTEST levels by frequency.

    Output:

    > print(levels(lb$LBTEST))
    [1] "ALT" "HGB" "AST"
    

Reverse Order

fct_rev(fct_infreq(lb$LBTEST)) %>% head()
  • Reverses the order of factor levels.

    Output:

    > print(fct_rev(fct_infreq(lb$LBTEST)))
    [1] HGB ALT HGB AST ALT
    Levels: AST HGB ALT
    

Order by Numeric Mapping (e.g., for ranking)

lb %>%
  mutate(LBTEST_ord = fct_reorder(LBTEST, as.numeric(LBTEST))) %>%
  arrange(LBTEST_ord)
  • Reorders LBTEST by a numeric mapping (e.g., for ranking).

Group and Rename Levels

lb %>%
  mutate(LBTEST_group = fct_recode(LBTEST,
    "LIVER" = "ALT",
    "LIVER" = "AST",
    "BLOOD" = "HGB"
  )) %>%
  tabyl(LBTEST_group)
  • Combines and renames levels.

    Output:

    > print(tabyl(lb$LBTEST_group))
    lb$LBTEST_group n percent
              LIVER 3     0.6
              BLOOD 2     0.4
    

Convert to Binary Factor

lb %>%
  mutate(LBTEST_BIN = ifelse(LBTEST == "HGB", "Blood", "Liver"),
         LBTEST_BIN = factor(LBTEST_BIN)) %>%
  tabyl(LBTEST_BIN)
  • Converts LBTEST to a binary factor.

    Output:

    > print(tabyl(lb$LBTEST_BIN))
    lb$LBTEST_BIN n percent
            Blood 2     0.4
            Liver 3     0.6
    
  • Explanation:

    • Each R code block is explained with input and output tables, showing how factors are created, manipulated, and summarized for SDTM categorical data.

8. Conclusion

  • Factors are essential for handling categorical data in R and SDTM.
  • Always define levels explicitly for correct ordering and analysis.
  • Use the forcats package for advanced factor manipulation.
  • Proper use of factors improves data quality, analysis, and regulatory reporting.

**Resource download links**

2.4.1.-Factors.zip