contact@a2zlearners.com

2.3.6. Summarizing Data

1. Introduction

Summarizing data is a key part of data wrangling and analysis. Whether for reporting, checking data quality, or exploring your dataset, R provides powerful tools to quickly summarize and understand your data. This chapter covers essential summary functions from dplyr, janitor, and skimr.


2. Why Summarize Data?

  • Quickly understand the structure and content of your dataset.
  • Check for missing values, duplicates, or outliers.
  • Produce summary statistics for reports or presentations.
  • Aggregate data for modeling or visualization.
  • Validate results after data wrangling steps.

3. Summarizing with summarize()

  • The summarize() (or summarise()) function from dplyr creates summary statistics for your data.
  • Use with or without group_by() for overall or grouped summaries.

Input Table:

USUBJID AGE SEX ARM
01-701-101 34 M Placebo
01-701-102 29 F Drug X
01-701-103 67 F Placebo
01-701-104 54 M Drug X

R Code:

dm %>%
  summarize(N = n())
  • summarize(N = n()) counts the number of rows in the dataset and returns it as column N.

Output Table:

N
4
  • Explanation:
    • Counts the number of rows in the dataset.

4. Grouped Summaries with group_by() and summarize()

  • Use group_by() to summarize within categories.

R Code:

dm %>%
  group_by(ARM) %>%
  summarize(N = n())
  • group_by(ARM) groups the data by the ARM column.
  • summarize(N = n()) counts the number of rows in each group.

Output Table:

ARM N
Drug X 2
Placebo 2
  • Explanation:
    • Counts the number of subjects in each treatment arm.

5. Summarizing with Other Functions

  • You can use functions like mean(), median(), min(), max(), and sum() inside summarize().

R Code:

dm %>%
  group_by(ARM) %>%
  summarize(
    N = n(),
    mean_age = mean(AGE),
    min_age = min(AGE),
    max_age = max(AGE)
  )
  • mean_age = mean(AGE) calculates the mean age for each group.
  • min_age = min(AGE) and max_age = max(AGE) find the minimum and maximum age in each group.

Output Table:

ARM N mean_age min_age max_age
Drug X 2 41.5 29 54
Placebo 2 50.5 34 67
  • Explanation:
    • Provides count, mean, minimum, and maximum age for each treatment arm.

6. Quick Counts with tally() and count()

  • tally() counts rows or sums a numeric column.
  • count() counts unique values of one or more columns.

R Code:

dm %>% tally()
  • tally() returns the total number of rows in the dataset.

Output Table:

n
4

R Code:

dm %>% count(SEX)
  • count(SEX) counts the number of occurrences for each unique value in the SEX column.

Output Table:

SEX n
F 2
M 2

R Code:

dm %>% count(SEX, ARM)
  • count(SEX, ARM) counts the number of occurrences for each combination of SEX and ARM.

Output Table:

SEX ARM n
F Drug X 1
F Placebo 1
M Drug X 1
M Placebo 1

7. Adding Counts to Data with add_tally() and add_count()

  • add_tally() adds a column with the total row count to each row.
  • add_count() adds a column with the group count to each row.

R Code:

dm %>% add_tally()
  • add_tally() adds a column n to every row, showing the total number of rows in the dataset.

Output Table:

USUBJID AGE SEX ARM n
01-701-101 34 M Placebo 4
01-701-102 29 F Drug X 4
01-701-103 67 F Placebo 4
01-701-104 54 M Drug X 4

R Code:

dm %>% add_count(ARM)
  • add_count(ARM) adds a column n to every row, showing the count for that row's ARM group.

Output Table:

USUBJID AGE SEX ARM n
01-701-101 34 M Placebo 2
01-701-102 29 F Drug X 2
01-701-103 67 F Placebo 2
01-701-104 54 M Drug X 2

8. Summarizing Categorical Data with tabyl()

  • The tabyl() function from janitor quickly summarizes categorical variables.

R Code:

library(janitor)
dm %>% tabyl(ARM)
  • tabyl(ARM) returns a frequency table for the ARM column, including counts and proportions.

Output Table:

ARM n percent
Drug X 2 0.5
Placebo 2 0.5
  • Explanation:
    • Shows counts and proportions for each category.

9. Summarizing Numeric Data with summary() and skim()

  • summary() (base R) gives min, max, mean, quartiles for numeric columns.
  • skim() from skimr provides a detailed summary for all columns.

R Code:

summary(dm$AGE)
  • summary(dm$AGE) returns min, 1st quartile, median, mean, 3rd quartile, and max for AGE.

Output Table:

Min. 1st Qu. Median Mean 3rd Qu. Max.
29 33.25 44 46 56.25 67

R Code:

library(skimr)
skim(dm)
  • skim(dm) provides a comprehensive summary for each variable, including missingness, mean, sd, and quantiles.

Output Table:

── Data Summary ────────────────────────
                           Values
Name                       dm    
Number of rows             4     
Number of columns          4     
_______________________          
Column type frequency:           
  character                3     
  numeric                  1     
________________________         
Group variables            None  

── Variable type: character ──────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate min max empty n_unique whitespace
1 USUBJID               0             1  10  10     0        4          0
2 SEX                   0             1   1   1     0        2          0
3 ARM                   0             1   6   7     0        2          0

── Variable type: numeric ────────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate mean   sd p0  p25 p50  p75 p100 hist 
1 AGE                   0             1   46 17.7 29 32.8  44 57.2   67 ▇▁▁▃▃
  • Explanation:
    • skim() provides a comprehensive summary for each variable, including missingness and distribution.

10. Finding Duplicates with get_dupes()

  • The get_dupes() function from janitor helps identify duplicate records.

Input Table:

USUBJID SEX ARM
01-701-101 M Placebo
01-701-101 M Placebo
01-701-102 F Drug X

R Code:

library(janitor)
dm %>% get_dupes(USUBJID)
  • get_dupes(USUBJID) finds rows where USUBJID is duplicated and adds a dupe_count column.

Output Table:

USUBJID dupe_count SEX ARM
01-701-101 2 M Placebo
01-701-101 2 M Placebo

11. Exploring More: Advanced Summaries

Below are advanced ways to summarize your data, with detailed explanations, input, and output examples.


  • Multiple summaries at once with across()

    You can use across() inside summarize() to apply several summary functions to one or more columns at once.

    Input Table:

    USUBJID AGE
    01-701-101 34
    01-701-102 29
    01-701-103 67
    01-701-104 54

    R Code:

    dm %>%
      summarize(across(AGE, list(mean = mean, sd = sd, min = min, max = max)))
    
    • across(AGE, list(mean = mean, sd = sd, min = min, max = max)) applies each function to the AGE column.
    • The result is a single-row summary with columns for mean, sd, min, and max.

    Output Table:

    AGE_mean AGE_sd AGE_min AGE_max
    46 16.6 29 67

  • Summarize after grouping by multiple variables

    You can group by more than one variable to get summaries for each combination.

    Input Table:

    USUBJID AGE SEX ARM
    01-701-101 34 M Placebo
    01-701-102 29 F Drug X
    01-701-103 67 F Placebo
    01-701-104 54 M Drug X

    R Code:

    dm %>%
      group_by(SEX, ARM) %>%
      summarize(N = n(), mean_age = mean(AGE))
    
    • group_by(SEX, ARM) groups the data by both SEX and ARM.
    • summarize(N = n(), mean_age = mean(AGE)) gives the count and mean age for each group.

    Output Table:

    SEX ARM N mean_age
    F Drug X 1 29
    F Placebo 1 67
    M Drug X 1 54
    M Placebo 1 34

  • Summarize with custom functions

    You can use any function inside summarize(), such as median().

    Input Table:

    USUBJID AGE
    01-701-101 34
    01-701-102 29
    01-701-103 67
    01-701-104 54

    R Code:

    dm %>%
      summarize(median_age = median(AGE, na.rm = TRUE))
    
    • Calculates the median age, ignoring missing values.

    Output Table:

    median_age
    44

  • Summarize with weighted means

    Sometimes you want to calculate a mean that gives different weights to different rows.

    Input Table:

    USUBJID AGE WEIGHT
    01-701-101 34 1
    01-701-102 29 2
    01-701-103 67 1
    01-701-104 54 1

    R Code:

    dm %>%
      summarize(weighted_mean = weighted.mean(AGE, w = WEIGHT))
    
    • weighted.mean(AGE, w = WEIGHT) computes the mean of AGE, weighting each row by the WEIGHT column.

    Output Table:

    weighted_mean
    42.6

  • Summarize with proportions

    You can calculate the proportion of each group using count() and mutate().

    Input Table:

    USUBJID ARM
    01-701-101 Placebo
    01-701-102 Drug X
    01-701-103 Placebo
    01-701-104 Drug X

    R Code:

    dm %>%
      count(ARM) %>%
      mutate(prop = n / sum(n))
    
    • count(ARM) counts the number of rows for each ARM.
    • mutate(prop = n / sum(n)) calculates the proportion for each ARM.

    Output Table:

    ARM n prop
    Drug X 2 0.5
    Placebo 2 0.5

  • Explanation for all above:
    • These advanced summaries allow you to quickly get deeper insights into your data.
    • You can combine grouping, multiple summary functions, custom logic, and proportions for flexible reporting and analysis.
    • Input and output tables help you see exactly what each code block does.

12. Conclusion

  • Summarizing data is essential for understanding, validating, and reporting your data.
  • Use summarize(), count(), tabyl(), skim(), and related functions for flexible, powerful summaries.
  • Combine with group_by() for grouped summaries.
  • Explore advanced summaries for deeper insights and better data quality.

**Resource download links**

2.3.6.-Summarizing-Data.zip