2.3.6. Summarizing Data
1. Introduction
Summarizing data is a key part of data wrangling and analysis. Whether for reporting, checking data quality, or exploring your dataset, R provides powerful tools to quickly summarize and understand your data. This chapter covers essential summary functions from dplyr, janitor, and skimr.
2. Why Summarize Data?
- Quickly understand the structure and content of your dataset.
- Check for missing values, duplicates, or outliers.
- Produce summary statistics for reports or presentations.
- Aggregate data for modeling or visualization.
- Validate results after data wrangling steps.
3. Summarizing with summarize()
- The
summarize()(orsummarise()) function fromdplyrcreates summary statistics for your data. - Use with or without
group_by()for overall or grouped summaries.
Input Table:
| USUBJID | AGE | SEX | ARM |
|---|---|---|---|
| 01-701-101 | 34 | M | Placebo |
| 01-701-102 | 29 | F | Drug X |
| 01-701-103 | 67 | F | Placebo |
| 01-701-104 | 54 | M | Drug X |
R Code:
dm %>%
summarize(N = n())
summarize(N = n())counts the number of rows in the dataset and returns it as column N.
Output Table:
| N |
|---|
| 4 |
- Explanation:
- Counts the number of rows in the dataset.
4. Grouped Summaries with group_by() and summarize()
- Use
group_by()to summarize within categories.
R Code:
dm %>%
group_by(ARM) %>%
summarize(N = n())
group_by(ARM)groups the data by the ARM column.summarize(N = n())counts the number of rows in each group.
Output Table:
| ARM | N |
|---|---|
| Drug X | 2 |
| Placebo | 2 |
- Explanation:
- Counts the number of subjects in each treatment arm.
5. Summarizing with Other Functions
- You can use functions like
mean(),median(),min(),max(), andsum()insidesummarize().
R Code:
dm %>%
group_by(ARM) %>%
summarize(
N = n(),
mean_age = mean(AGE),
min_age = min(AGE),
max_age = max(AGE)
)
mean_age = mean(AGE)calculates the mean age for each group.min_age = min(AGE)andmax_age = max(AGE)find the minimum and maximum age in each group.
Output Table:
| ARM | N | mean_age | min_age | max_age |
|---|---|---|---|---|
| Drug X | 2 | 41.5 | 29 | 54 |
| Placebo | 2 | 50.5 | 34 | 67 |
- Explanation:
- Provides count, mean, minimum, and maximum age for each treatment arm.
6. Quick Counts with tally() and count()
tally()counts rows or sums a numeric column.count()counts unique values of one or more columns.
R Code:
dm %>% tally()
tally()returns the total number of rows in the dataset.
Output Table:
| n |
|---|
| 4 |
R Code:
dm %>% count(SEX)
count(SEX)counts the number of occurrences for each unique value in the SEX column.
Output Table:
| SEX | n |
|---|---|
| F | 2 |
| M | 2 |
R Code:
dm %>% count(SEX, ARM)
count(SEX, ARM)counts the number of occurrences for each combination of SEX and ARM.
Output Table:
| SEX | ARM | n |
|---|---|---|
| F | Drug X | 1 |
| F | Placebo | 1 |
| M | Drug X | 1 |
| M | Placebo | 1 |
7. Adding Counts to Data with add_tally() and add_count()
add_tally()adds a column with the total row count to each row.add_count()adds a column with the group count to each row.
R Code:
dm %>% add_tally()
add_tally()adds a column n to every row, showing the total number of rows in the dataset.
Output Table:
| USUBJID | AGE | SEX | ARM | n |
|---|---|---|---|---|
| 01-701-101 | 34 | M | Placebo | 4 |
| 01-701-102 | 29 | F | Drug X | 4 |
| 01-701-103 | 67 | F | Placebo | 4 |
| 01-701-104 | 54 | M | Drug X | 4 |
R Code:
dm %>% add_count(ARM)
add_count(ARM)adds a column n to every row, showing the count for that row's ARM group.
Output Table:
| USUBJID | AGE | SEX | ARM | n |
|---|---|---|---|---|
| 01-701-101 | 34 | M | Placebo | 2 |
| 01-701-102 | 29 | F | Drug X | 2 |
| 01-701-103 | 67 | F | Placebo | 2 |
| 01-701-104 | 54 | M | Drug X | 2 |
8. Summarizing Categorical Data with tabyl()
- The
tabyl()function fromjanitorquickly summarizes categorical variables.
R Code:
library(janitor)
dm %>% tabyl(ARM)
tabyl(ARM)returns a frequency table for the ARM column, including counts and proportions.
Output Table:
| ARM | n | percent |
|---|---|---|
| Drug X | 2 | 0.5 |
| Placebo | 2 | 0.5 |
- Explanation:
- Shows counts and proportions for each category.
9. Summarizing Numeric Data with summary() and skim()
summary()(base R) gives min, max, mean, quartiles for numeric columns.skim()fromskimrprovides a detailed summary for all columns.
R Code:
summary(dm$AGE)
summary(dm$AGE)returns min, 1st quartile, median, mean, 3rd quartile, and max for AGE.
Output Table:
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|
| 29 | 33.25 | 44 | 46 | 56.25 | 67 |
R Code:
library(skimr)
skim(dm)
skim(dm)provides a comprehensive summary for each variable, including missingness, mean, sd, and quantiles.
Output Table:
── Data Summary ────────────────────────
Values
Name dm
Number of rows 4
Number of columns 4
_______________________
Column type frequency:
character 3
numeric 1
________________________
Group variables None
── Variable type: character ──────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate min max empty n_unique whitespace
1 USUBJID 0 1 10 10 0 4 0
2 SEX 0 1 1 1 0 2 0
3 ARM 0 1 6 7 0 2 0
── Variable type: numeric ────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
1 AGE 0 1 46 17.7 29 32.8 44 57.2 67 ▇▁▁▃▃
- Explanation:
skim()provides a comprehensive summary for each variable, including missingness and distribution.
10. Finding Duplicates with get_dupes()
- The
get_dupes()function fromjanitorhelps identify duplicate records.
Input Table:
| USUBJID | SEX | ARM |
|---|---|---|
| 01-701-101 | M | Placebo |
| 01-701-101 | M | Placebo |
| 01-701-102 | F | Drug X |
R Code:
library(janitor)
dm %>% get_dupes(USUBJID)
get_dupes(USUBJID)finds rows where USUBJID is duplicated and adds a dupe_count column.
Output Table:
| USUBJID | dupe_count | SEX | ARM |
|---|---|---|---|
| 01-701-101 | 2 | M | Placebo |
| 01-701-101 | 2 | M | Placebo |
11. Exploring More: Advanced Summaries
Below are advanced ways to summarize your data, with detailed explanations, input, and output examples.
Multiple summaries at once with across()
You can use
across()insidesummarize()to apply several summary functions to one or more columns at once.Input Table:
USUBJID AGE 01-701-101 34 01-701-102 29 01-701-103 67 01-701-104 54 R Code:
dm %>% summarize(across(AGE, list(mean = mean, sd = sd, min = min, max = max)))across(AGE, list(mean = mean, sd = sd, min = min, max = max))applies each function to the AGE column.- The result is a single-row summary with columns for mean, sd, min, and max.
Output Table:
AGE_mean AGE_sd AGE_min AGE_max 46 16.6 29 67
Summarize after grouping by multiple variables
You can group by more than one variable to get summaries for each combination.
Input Table:
USUBJID AGE SEX ARM 01-701-101 34 M Placebo 01-701-102 29 F Drug X 01-701-103 67 F Placebo 01-701-104 54 M Drug X R Code:
dm %>% group_by(SEX, ARM) %>% summarize(N = n(), mean_age = mean(AGE))group_by(SEX, ARM)groups the data by both SEX and ARM.summarize(N = n(), mean_age = mean(AGE))gives the count and mean age for each group.
Output Table:
SEX ARM N mean_age F Drug X 1 29 F Placebo 1 67 M Drug X 1 54 M Placebo 1 34
Summarize with custom functions
You can use any function inside
summarize(), such asmedian().Input Table:
USUBJID AGE 01-701-101 34 01-701-102 29 01-701-103 67 01-701-104 54 R Code:
dm %>% summarize(median_age = median(AGE, na.rm = TRUE))- Calculates the median age, ignoring missing values.
Output Table:
median_age 44
Summarize with weighted means
Sometimes you want to calculate a mean that gives different weights to different rows.
Input Table:
USUBJID AGE WEIGHT 01-701-101 34 1 01-701-102 29 2 01-701-103 67 1 01-701-104 54 1 R Code:
dm %>% summarize(weighted_mean = weighted.mean(AGE, w = WEIGHT))weighted.mean(AGE, w = WEIGHT)computes the mean of AGE, weighting each row by the WEIGHT column.
Output Table:
weighted_mean 42.6
Summarize with proportions
You can calculate the proportion of each group using
count()andmutate().Input Table:
USUBJID ARM 01-701-101 Placebo 01-701-102 Drug X 01-701-103 Placebo 01-701-104 Drug X R Code:
dm %>% count(ARM) %>% mutate(prop = n / sum(n))count(ARM)counts the number of rows for each ARM.mutate(prop = n / sum(n))calculates the proportion for each ARM.
Output Table:
ARM n prop Drug X 2 0.5 Placebo 2 0.5
- Explanation for all above:
- These advanced summaries allow you to quickly get deeper insights into your data.
- You can combine grouping, multiple summary functions, custom logic, and proportions for flexible reporting and analysis.
- Input and output tables help you see exactly what each code block does.
12. Conclusion
- Summarizing data is essential for understanding, validating, and reporting your data.
- Use
summarize(),count(),tabyl(),skim(), and related functions for flexible, powerful summaries. - Combine with
group_by()for grouped summaries. - Explore advanced summaries for deeper insights and better data quality.