2.3.6. Summarizing Data

1. Introduction

Summarizing data is a key part of data wrangling and analysis. Whether for reporting, checking data quality, or exploring your dataset, R provides powerful tools to quickly summarize and understand your data. This chapter covers essential summary functions from dplyr, janitor, and skimr.

2. Why Summarize Data?

Quickly understand the structure and content of your dataset.
Check for missing values, duplicates, or outliers.
Produce summary statistics for reports or presentations.
Aggregate data for modeling or visualization.
Validate results after data wrangling steps.

3. Summarizing with summarize()

The summarize() (or summarise()) function from dplyr creates summary statistics for your data.
Use with or without group_by() for overall or grouped summaries.

Input Table:

USUBJID	AGE	SEX	ARM
01-701-101	34	M	Placebo
01-701-102	29	F	Drug X
01-701-103	67	F	Placebo
01-701-104	54	M	Drug X

R Code:

dm %>%
  summarize(N = n())

summarize(N = n()) counts the number of rows in the dataset and returns it as column N.

Output Table:

N
4

Explanation:
- Counts the number of rows in the dataset.

4. Grouped Summaries with group_by() and summarize()

Use group_by() to summarize within categories.

R Code:

dm %>%
  group_by(ARM) %>%
  summarize(N = n())

group_by(ARM) groups the data by the ARM column.
summarize(N = n()) counts the number of rows in each group.

Output Table:

ARM	N
Drug X	2
Placebo	2

Explanation:
- Counts the number of subjects in each treatment arm.

5. Summarizing with Other Functions

You can use functions like mean(), median(), min(), max(), and sum() inside summarize().

R Code:

dm %>%
  group_by(ARM) %>%
  summarize(
    N = n(),
    mean_age = mean(AGE),
    min_age = min(AGE),
    max_age = max(AGE)
  )

mean_age = mean(AGE) calculates the mean age for each group.
min_age = min(AGE) and max_age = max(AGE) find the minimum and maximum age in each group.

Output Table:

ARM	N	mean_age	min_age	max_age
Drug X	2	41.5	29	54
Placebo	2	50.5	34	67

Explanation:
- Provides count, mean, minimum, and maximum age for each treatment arm.

6. Quick Counts with tally() and count()

tally() counts rows or sums a numeric column.
count() counts unique values of one or more columns.

R Code:

dm %>% tally()

tally() returns the total number of rows in the dataset.

Output Table:

n
4

R Code:

dm %>% count(SEX)

count(SEX) counts the number of occurrences for each unique value in the SEX column.

Output Table:

SEX	n
F	2
M	2

R Code:

dm %>% count(SEX, ARM)

count(SEX, ARM) counts the number of occurrences for each combination of SEX and ARM.

Output Table:

SEX	ARM	n
F	Drug X	1
F	Placebo	1
M	Drug X	1
M	Placebo	1

7. Adding Counts to Data with add_tally() and add_count()

add_tally() adds a column with the total row count to each row.
add_count() adds a column with the group count to each row.

R Code:

dm %>% add_tally()

add_tally() adds a column n to every row, showing the total number of rows in the dataset.

Output Table:

USUBJID	AGE	SEX	ARM	n
01-701-101	34	M	Placebo	4
01-701-102	29	F	Drug X	4
01-701-103	67	F	Placebo	4
01-701-104	54	M	Drug X	4

R Code:

dm %>% add_count(ARM)

add_count(ARM) adds a column n to every row, showing the count for that row's ARM group.

Output Table:

USUBJID	AGE	SEX	ARM	n
01-701-101	34	M	Placebo	2
01-701-102	29	F	Drug X	2
01-701-103	67	F	Placebo	2
01-701-104	54	M	Drug X	2

8. Summarizing Categorical Data with tabyl()

The tabyl() function from janitor quickly summarizes categorical variables.

R Code:

library(janitor)
dm %>% tabyl(ARM)

tabyl(ARM) returns a frequency table for the ARM column, including counts and proportions.

Output Table:

ARM	n	percent
Drug X	2	0.5
Placebo	2	0.5

Explanation:
- Shows counts and proportions for each category.

9. Summarizing Numeric Data with summary() and skim()

summary() (base R) gives min, max, mean, quartiles for numeric columns.
skim() from skimr provides a detailed summary for all columns.

R Code:

summary(dm$AGE)

summary(dm$AGE) returns min, 1st quartile, median, mean, 3rd quartile, and max for AGE.

Output Table:

Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
29	33.25	44	46	56.25	67

R Code:

library(skimr)
skim(dm)

skim(dm) provides a comprehensive summary for each variable, including missingness, mean, sd, and quantiles.

Output Table:

── Data Summary ────────────────────────
                           Values
Name                       dm    
Number of rows             4     
Number of columns          4     
_______________________          
Column type frequency:           
  character                3     
  numeric                  1     
________________________         
Group variables            None  

── Variable type: character ──────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate min max empty n_unique whitespace
1 USUBJID               0             1  10  10     0        4          0
2 SEX                   0             1   1   1     0        2          0
3 ARM                   0             1   6   7     0        2          0

── Variable type: numeric ────────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate mean   sd p0  p25 p50  p75 p100 hist 
1 AGE                   0             1   46 17.7 29 32.8  44 57.2   67 ▇▁▁▃▃

Explanation:
- skim() provides a comprehensive summary for each variable, including missingness and distribution.

10. Finding Duplicates with get_dupes()

The get_dupes() function from janitor helps identify duplicate records.

Input Table:

USUBJID	SEX	ARM
01-701-101	M	Placebo
01-701-101	M	Placebo
01-701-102	F	Drug X

R Code:

library(janitor)
dm %>% get_dupes(USUBJID)

get_dupes(USUBJID) finds rows where USUBJID is duplicated and adds a dupe_count column.

Output Table:

USUBJID	dupe_count	SEX	ARM
01-701-101	2	M	Placebo
01-701-101	2	M	Placebo

11. Exploring More: Advanced Summaries

Below are advanced ways to summarize your data, with detailed explanations, input, and output examples.

Multiple summaries at once with across()

You can use across() inside summarize() to apply several summary functions to one or more columns at once.

Input Table:

USUBJID AGE

01-701-101 34

01-701-102 29

01-701-103 67

01-701-104 54

R Code:
```
dm %>%
  summarize(across(AGE, list(mean = mean, sd = sd, min = min, max = max)))
```
- across(AGE, list(mean = mean, sd = sd, min = min, max = max)) applies each function to the AGE column.
- The result is a single-row summary with columns for mean, sd, min, and max.
Output Table:

AGE_mean AGE_sd AGE_min AGE_max

46 16.6 29 67

USUBJID	AGE
01-701-101	34
01-701-102	29
01-701-103	67
01-701-104	54

AGE_mean	AGE_sd	AGE_min	AGE_max
46	16.6	29	67

Summarize after grouping by multiple variables

You can group by more than one variable to get summaries for each combination.

Input Table:

USUBJID AGE SEX ARM

01-701-101 34 M Placebo

01-701-102 29 F Drug X

01-701-103 67 F Placebo

01-701-104 54 M Drug X

R Code:
```
dm %>%
  group_by(SEX, ARM) %>%
  summarize(N = n(), mean_age = mean(AGE))
```
- group_by(SEX, ARM) groups the data by both SEX and ARM.
- summarize(N = n(), mean_age = mean(AGE)) gives the count and mean age for each group.
Output Table:

SEX ARM N mean_age

F Drug X 1 29

F Placebo 1 67

M Drug X 1 54

M Placebo 1 34

SEX	ARM	N	mean_age
F	Drug X	1	29
F	Placebo	1	67
M	Drug X	1	54
M	Placebo	1	34

Summarize with custom functions

You can use any function inside summarize(), such as median().

Input Table:

USUBJID AGE

01-701-101 34

01-701-102 29

01-701-103 67

01-701-104 54

R Code:
```
dm %>%
  summarize(median_age = median(AGE, na.rm = TRUE))
```
- Calculates the median age, ignoring missing values.
Output Table:

median_age

44

USUBJID	AGE
01-701-101	34
01-701-102	29
01-701-103	67
01-701-104	54

median_age
44

Summarize with weighted means

Sometimes you want to calculate a mean that gives different weights to different rows.

Input Table:

USUBJID AGE WEIGHT

01-701-101 34 1

01-701-102 29 2

01-701-103 67 1

01-701-104 54 1

R Code:
```
dm %>%
  summarize(weighted_mean = weighted.mean(AGE, w = WEIGHT))
```
- weighted.mean(AGE, w = WEIGHT) computes the mean of AGE, weighting each row by the WEIGHT column.
Output Table:

weighted_mean

42.6

USUBJID	AGE	WEIGHT
01-701-101	34	1
01-701-102	29	2
01-701-103	67	1
01-701-104	54	1

weighted_mean
42.6

Summarize with proportions

You can calculate the proportion of each group using count() and mutate().

Input Table:

USUBJID ARM

01-701-101 Placebo

01-701-102 Drug X

01-701-103 Placebo

01-701-104 Drug X

R Code:
```
dm %>%
  count(ARM) %>%
  mutate(prop = n / sum(n))
```
- count(ARM) counts the number of rows for each ARM.
- mutate(prop = n / sum(n)) calculates the proportion for each ARM.
Output Table:

ARM n prop

Drug X 2 0.5

Placebo 2 0.5

USUBJID	ARM
01-701-101	Placebo
01-701-102	Drug X
01-701-103	Placebo
01-701-104	Drug X

Explanation for all above:
- These advanced summaries allow you to quickly get deeper insights into your data.
- You can combine grouping, multiple summary functions, custom logic, and proportions for flexible reporting and analysis.
- Input and output tables help you see exactly what each code block does.

12. Conclusion

Summarizing data is essential for understanding, validating, and reporting your data.
Use summarize(), count(), tabyl(), skim(), and related functions for flexible, powerful summaries.
Combine with group_by() for grouped summaries.
Explore advanced summaries for deeper insights and better data quality.

Resource download links

2.3.6.-Summarizing-Data.zip

⁂