1.4. Migrating from SAS to R: A Skill Conversion Guide
1.4.16. The apply Family Functions in R vs SAS
1. Introduction
The apply family of functions in R (apply, lapply, sapply, vapply, mapply, tapply, etc.) provides a powerful and concise way to perform operations on data structures without explicit loops. These functions are essential for efficient data manipulation, especially in clinical trial programming, and offer a flexible alternative to SAS's array processing and summary procedures.
2. SAS vs R: Program Comparison
| Task | SAS Approach | R Approach (apply family) |
|---|---|---|
| Row/column summary | PROC MEANS, DATA step loops |
apply() |
| List-wise operation | Array processing, macros | lapply(), sapply() |
| Grouped summary | PROC MEANS, BY statement |
tapply(), aggregate(), by() |
| Parallel operation | Macro loops | mapply() |
2A. Table: Common Use Cases for apply Family Functions
| Function | Typical Use Cases |
|---|---|
| apply | - Row/column means, sums, min, max, sd; - Summarize repeated measures (labs, vitals); - Custom row/col ops |
| lapply | - Apply function to each column (e.g., uppercase, trim); - List element summaries; - Convert types |
| sapply | - Simplified summaries (length, class); - Quick checks per column/element; - Vectorized transformations |
| tapply | - Grouped summaries (mean/sum by group); - AE by severity, labs by visit; - Counts by group |
| mapply | - Row-wise operations across columns; - Concatenate/combine columns; - Custom flags/labels |
| replicate | - Simulations, bootstrapping; - Generate mock data; - Repeated random sampling |
| vapply | - Type-safe version of sapply; - Enforce output type for summaries; - Production code safety |
3. apply(): Row/Column Operations on Matrices and Data Frames
Description:apply() is used to apply a function to the rows or columns of a matrix or numeric data frame. It is especially useful for summarizing repeated measures (e.g., labs, vitals) for each subject.
Possible Uses:
- Calculate row or column means, sums, min, max, standard deviation, etc.
- Summarize repeated measures for each subject (e.g., multiple lab results, vital signs).
- Apply custom functions to each row or column.
Input Table (Lab Results - LB):
| USUBJID | LBTESTCD | LBORRES1 | LBORRES2 | LBORRES3 |
|---|---|---|---|---|
| 01-001 | HGB | 13.2 | 13.5 | 13.1 |
| 01-002 | HGB | 12.8 | 13.0 | 12.9 |
| 01-003 | HGB | 14.0 | 13.8 | 14.2 |
- The second argument in apply function MARGIN, determines whether the function should be applied across rows or columns:
- MARGIN = 1 → apply the function across rows (i.e., row-wise)
- MARGIN = 2 → apply the function across columns (i.e., column-wise)
R Row Example:
# Dummy LB data
lb <- data.frame(
USUBJID = c("01-001", "01-002", "01-003"),
LBTESTCD = c("HGB", "HGB", "HGB"),
LBORRES1 = c(13.2, 12.8, 14.0),
LBORRES2 = c(13.5, 13.0, 13.8),
LBORRES3 = c(13.1, 12.9, 14.2)
)
lb$row_mean <- apply(lb[, c("LBORRES1", "LBORRES2", "LBORRES3")], 1, mean)
lb$row_sd <- apply(lb[, c("LBORRES1", "LBORRES2", "LBORRES3")], 1, sd)
lb$row_min <- apply(lb[, c("LBORRES1", "LBORRES2", "LBORRES3")], 1, min)
lb$row_max <- apply(lb[, c("LBORRES1", "LBORRES2", "LBORRES3")], 1, max)
Output Table:
| USUBJID | LBTESTCD | LBORRES1 | LBORRES2 | LBORRES3 | row_mean | row_sd | row_min | row_max |
|---|---|---|---|---|---|---|---|---|
| 01-001 | HGB | 13.2 | 13.5 | 13.1 | 13.27 | 0.208 | 13.1 | 13.5 |
| 01-002 | HGB | 12.8 | 13.0 | 12.9 | 12.90 | 0.100 | 12.8 | 13.0 |
| 01-003 | HGB | 14.0 | 13.8 | 14.2 | 14.00 | 0.200 | 13.8 | 14.2 |
R Column Example: We’ll now apply a function column-wise to the three lab result columns.
apply(lb[, c("LBORRES1", "LBORRES2", "LBORRES3")], 2, mean)
Explanation
lb[, c("LBORRES1", "LBORRES2", "LBORRES3")]: selects the numeric lab result columns.2: tellsapply()to operate column-wise.mean: calculates the mean of each column.
Output
LBORRES1 LBORRES2 LBORRES3
13.33 13.43 13.40
This tells us:
- The average of
LBORRES1across all subjects is 13.33 - The average of
LBORRES2is 13.43 - The average of
LBORRES3is 13.40
4. lapply(): Element-wise Operations on Lists or Data Frame Columns
Description:lapply() applies a function to each element of a list or each column of a data frame, always returning a list. It is commonly used for cleaning or transforming all columns of a domain.
Possible Uses:
- Apply a function to each column of a data frame (e.g., uppercase, trim whitespace).
- Apply a function to each element of a list (e.g., summary, length, custom transformation).
- Convert all factors to characters or vice versa.
Input Table (Demographics - DM):
| USUBJID | SEX | RACE |
|---|---|---|
| 01-001 | F | Asian |
| 01-002 | M | White |
| 01-003 | F | Black |
R Example:
# Dummy DM data
dm <- data.frame(
USUBJID = c("01-001", "01-002", "01-003"),
SEX = c("F", "M", "F"),
RACE = c("Asian", "White", "Black"),
stringsAsFactors = FALSE
)
char_cols <- sapply(dm, is.character)
dm[char_cols] <- lapply(dm[char_cols], toupper)
Output Table:
| USUBJID | SEX | RACE |
|---|---|---|
| 01-001 | F | ASIAN |
| 01-002 | M | WHITE |
| 01-003 | F | BLACK |
5. sapply(): Simplified Element-wise Operations
Description:sapply() is similar to lapply() but tries to simplify the result to a vector or matrix. It is useful for getting summary statistics or properties for each element/column.
Possible Uses:
- Get the length, class, or summary of each column.
- Apply a function to each element and return a vector.
- Quick checks or summaries for reporting.
Input Table (Adverse Events - AE):
| AEDECOD |
|---|
| HEADACHE |
| NAUSEA |
| DIZZINESS |
R Example:
# Dummy AE data
ae <- data.frame(AEDECOD = c("HEADACHE", "NAUSEA", "DIZZINESS"))
sapply(ae$AEDECOD, nchar)
Output:
| AEDECOD | nchar |
|---|---|
| HEADACHE | 8 |
| NAUSEA | 6 |
| DIZZINESS | 9 |
6. tapply(): Grouped Summaries
Description:tapply() applies a function over subsets of a vector, defined by a factor (grouping variable). It is ideal for grouped summaries.
Possible Uses:
- Calculate mean, sum, min, max by group (e.g., mean AGE by SEX, mean lab by visit).
- Summarize adverse events by severity or seriousness.
- Count occurrences by group.
Input Table:
| USUBJID | SEX | AGE |
|---|---|---|
| 01-001 | F | 34 |
| 01-002 | M | 40 |
| 01-003 | F | 29 |
R Example:
dm <- data.frame(
USUBJID = c("01-001", "01-002", "01-003"),
SEX = c("F", "M", "F"),
AGE = c(34, 40, 29)
)
tapply(dm$AGE, dm$SEX, mean)
Output:
| SEX | Mean AGE |
|---|---|
| F | 31.5 |
| M | 40 |
Example: Group by Multiple Columns with tapply()
To convert the tapply() output (which is typically a matrix or array) into a data frame, you can use the as.data.frame() function combined with reshape techniques. Here's how you can do it using the SDTM-style VS domain example:
Example: tapply() → Data Frame
# Sample SDTM-style data
vs <- data.frame(
USUBJID = c("SUBJ001", "SUBJ001", "SUBJ002", "SUBJ002", "SUBJ003", "SUBJ003"),
VSTESTCD = c("SYSBP", "DIABP", "SYSBP", "DIABP", "SYSBP", "DIABP"),
VSSTRESN = c(120, 80, 130, 85, 125, 82),
VISIT = c("SCREENING", "SCREENING", "SCREENING", "SCREENING", "SCREENING", "SCREENING")
)
# Group by USUBJID and VSTESTCD, compute mean
result <- tapply(vs$VSSTRESN, list(vs$USUBJID, vs$VSTESTCD), mean)
# Convert to data frame
df_result <- as.data.frame(as.table(result))
colnames(df_result) <- c("USUBJID", "VSTESTCD", "MEAN_VSSTRESN")
df_result
Output:
| USUBJID | VSTESTCD | MEAN_VSSTRESN |
|---|---|---|
| SUBJ001 | DIABP | 80 |
| SUBJ002 | DIABP | 85 |
| SUBJ003 | DIABP | 82 |
| SUBJ001 | SYSBP | 120 |
| SUBJ002 | SYSBP | 130 |
| SUBJ003 | SYSBP | 125 |
Explanation
tapply(...): Computes the grouped mean.as.table(...): Converts the array to a table object.as.data.frame(...): Converts the table to a tidy data frame.colnames(...): Renames columns for clarity.
This is a clean and efficient way to get grouped summary statistics in a data frame format using only base R.
7. mapply(): Parallel Operations Across Multiple Vectors
Description:mapply() applies a function in parallel over multiple arguments (vectors/lists). It is useful for row-wise operations combining multiple columns.
Possible Uses:
- Concatenate or combine values from multiple columns for reporting.
- Create custom flags or labels using multiple variables.
- Apply a function to corresponding elements of several vectors.
Input Table:
| AEDECOD | AESEV |
|---|---|
| HEADACHE | MILD |
| NAUSEA | MODERATE |
| DIZZINESS | SEVERE |
R Example:
ae <- data.frame(
AEDECOD = c("HEADACHE", "NAUSEA", "DIZZINESS"),
AESEV = c("MILD", "MODERATE", "SEVERE")
)
mapply(function(a, b) paste(a, b, sep = " - "), ae$AEDECOD, ae$AESEV)
Output:
| AEDECOD | AESEV | Combined |
|---|---|---|
| HEADACHE | MILD | HEADACHE - MILD |
| NAUSEA | MODERATE | NAUSEA - MODERATE |
| DIZZINESS | SEVERE | DIZZINESS - SEVERE |
8. replicate(): Simulations and Repeated Calculations
Description:replicate() repeats an expression multiple times, useful for simulations or generating mock data.
Possible Uses:
- Simulate random subject ages or lab values.
- Bootstrap resampling.
- Generate mock data for testing.
R Example:
set.seed(123)
replicate(5, sample(20:70, 1))
Output:
A vector of 5 random ages, e.g., [38 70 57 21 67]
9. calling Custom Function
We'll use a simulated VS (Vital Signs) domain and apply a custom function using the apply family.
Step 1: Simulate an SDTM-like VS Dataset
vs <- data.frame(
USUBJID = rep(c("SUBJ001", "SUBJ002", "SUBJ003"), each = 4),
VSTESTCD = rep(c("SYSBP", "DIABP"), times = 6),
VISIT = rep(c("SCREENING", "WEEK1"), times = 6),
VSSTRESN = c(120, 80, 130, 85, 125, 82, 118, 78, 135, 88, 140, 90)
)
Step 2: Define a Custom Function
cv <- function(x) {
if (mean(x) == 0) return(NA)
return(sd(x) / mean(x))
}
Step 3: Apply the Custom Function Using tapply()
tapply(vs$VSSTRESN, list(vs$VSTESTCD, vs$VISIT), cv)
Output:
SCREENING WEEK1
DIABP 0.03608439 0.06681531
SYSBP 0.04166667 0.07856742
Step 4: Convert to Data Frame (Optional)
cv_result <- as.data.frame(as.table(
tapply(vs$VSSTRESN, list(vs$VSTESTCD, vs$VISIT), cv)
))
colnames(cv_result) <- c("VSTESTCD", "VISIT", "CV")
**Resource download links**
1.4.16.-The-apply-Family-Functions-in-R-vs-SAS.zip