contact@a2zlearners.com

1.5. SAS PROCEDURES -> Equivalent in R

1.5.3. PROC MEANS - Compute Descriptive Statistics in SAS and R

**Example Input Table:** `scores`

STUDENT SUBJECT SCORE
A001 Math 78
A002 Math 85
A003 Math 91
A004 Math 62
A005 Math 74
A006 Science 83
A007 Science 77
A008 Science 65
A009 Science 90
A010 Science NA
A011 History 69
A012 History 74
A013 History 80
A014 History 85
A015 History 77

**Goal**

Compute descriptive statistics for SCORE overall and within each SUBJECT:

  • Count (n)

  • Mean, standard deviation

  • 25th and 75th percentiles (q1, q3)

  • Number of missing values


1. Overall Summary

SAS Code

proc means data=scores noprint;
  var score;
  output n=n mean=mean std=sd q1=q1 q3=q3 nmiss=nmissing out=overall_stats;
run;

Explanation:

  • proc means: Starts the summary procedure.

  • noprint: Suppresses printing the output to the Results window.

  • var score;: Specifies the variable to summarize.

  • output ... out=...: Stores summary results (n, mean, std, etc.) into a new dataset named overall_stats.

R Code

# Dummy data for execution
library(dplyr)
scores <- tibble::tibble(
  STUDENT = c("A001","A002","A003","A004","A005","A006","A007","A008","A009","A010","A011","A012","A013","A014","A015"),
  SUBJECT = c("Math","Math","Math","Math","Math","Science","Science","Science","Science","Science","History","History","History","History","History"),
  SCORE = c(78,85,91,62,74,83,77,65,90,NA,69,74,80,85,77)
)
overall_stats <- scores %>%
  dplyr::summarize(
    n = n(),
    mean = mean(SCORE, na.rm = TRUE),
    sd = sd(SCORE, na.rm = TRUE),
    q1 = quantile(SCORE, 0.25, na.rm = TRUE),
    q3 = quantile(SCORE, 0.75, na.rm = TRUE),
    nmissing = sum(is.na(SCORE))
  )

Explanation:

  • summarize(): Reduces the entire data frame to a single-row summary.

  • n(): Counts rows (non-missing).

  • mean(), sd(): Basic statistics (exclude NAs).

  • quantile(..., 0.25 / 0.75): Computes the 25th and 75th percentiles.

  • sum(is.na()): Counts how many values are missing in the column.

Output Table

n mean sd q1 q3 nmissing
14 77.64 8.31 73.0 83.5 1

2. Summary by SUBJECT

SAS Code

proc sort data=scores; by subject;

proc means data=scores noprint;
  by subject;
  var score;
  output n=n mean=mean std=sd q1=q1 q3=q3 nmiss=nmissing out=by_subject_stats;
run;

Explanation:

  • proc sort ... by subject;: Required step before using BY in SAS.

  • BY SUBJECT: Tells SAS to compute statistics separately for each group.

  • output ... out=by_subject_stats;: Saves the results for each SUBJECT to a new dataset.

R Code

by_subject_stats <- scores %>%
  group_by(SUBJECT) %>%
  dplyr::summarize(
    n = n(),
    mean = mean(SCORE, na.rm = TRUE),
    sd = sd(SCORE, na.rm = TRUE),
    q1 = quantile(SCORE, 0.25, na.rm = TRUE),
    q3 = quantile(SCORE, 0.75, na.rm = TRUE),
    nmissing = sum(is.na(SCORE))
  )

Explanation:

  • group_by(SUBJECT): Groups the data frame by subject category.

  • summarize(...): Applies each summary function to the grouped values.

Output Table

SUBJECT n mean sd q1 q3 nmissing
History 5 77.0 6.52 74.0 80.0 0
Math 5 78.0 10.87 74.0 85.0 0
Science 4 78.75 10.45 72.0 86.25 1

3. Alternatives to dplyr::summarize()

Package Function Description
psych describe() One-liner for all basic stats
skimr skim() Pretty summaries grouped by variable
data.table DT[, .(…), by=] Fastest summary on large datasets
Hmisc describe() Summary plus metadata/label info

**What More Can You Explore**

1. Add More Summary Metrics

  • min(SCORE), max(SCORE), IQR(), median(), var()
  • Helps validate outliers or spread

2. Multi-Grouping

  • Add group_by(SUBJECT, GENDER) to replicate BY across multiple class variables.

3. Visualize Results

  • Boxplots, histograms, or violin plots for score distributions
  • Plot mean ± sd error bars by subject

4. Export to Reports

  • Use gt::gt(), flextable, or kable() for presentation-ready summary tables
  • Save as Excel or HTML

5. Build a Custom summary_table() Function Reusable for any dataset:

summary_table <- function(data, var, group = NULL) {
  if (!is.null(group)) data <- data %>% group_by(across(all_of(group)))
  data %>%
    dplyr::summarize(
      n = sum(!is.na(.data[[var]])),
      mean = mean(.data[[var]], na.rm = TRUE),
      sd = sd(.data[[var]], na.rm = TRUE),
      q1 = quantile(.data[[var]], 0.25, na.rm = TRUE),
      q3 = quantile(.data[[var]], 0.75, na.rm = TRUE),
      nmiss = sum(is.na(.data[[var]])),
      .groups = "drop"
    )
}

overall_stats <- summary_table(scores, "SCORE")

**Resource download links**

1.5.3.-PROC-MEANS-Compute-Descriptive-Statistics-in-SAS-and-R.zip