1.5. SAS PROCEDURES -> Equivalent in R
1.5.3. PROC MEANS - Compute Descriptive Statistics in SAS and R
**Example Input Table:** `scores`
| STUDENT | SUBJECT | SCORE |
|---|---|---|
| A001 | Math | 78 |
| A002 | Math | 85 |
| A003 | Math | 91 |
| A004 | Math | 62 |
| A005 | Math | 74 |
| A006 | Science | 83 |
| A007 | Science | 77 |
| A008 | Science | 65 |
| A009 | Science | 90 |
| A010 | Science | NA |
| A011 | History | 69 |
| A012 | History | 74 |
| A013 | History | 80 |
| A014 | History | 85 |
| A015 | History | 77 |
**Goal**
Compute descriptive statistics for SCORE overall and within each SUBJECT:
Count (
n)Mean, standard deviation
25th and 75th percentiles (
q1,q3)Number of missing values
1. Overall Summary
SAS Code
proc means data=scores noprint;
var score;
output n=n mean=mean std=sd q1=q1 q3=q3 nmiss=nmissing out=overall_stats;
run;
Explanation:
proc means: Starts the summary procedure.noprint: Suppresses printing the output to the Results window.var score;: Specifies the variable to summarize.output ... out=...: Stores summary results (n,mean,std, etc.) into a new dataset namedoverall_stats.
R Code
# Dummy data for execution
library(dplyr)
scores <- tibble::tibble(
STUDENT = c("A001","A002","A003","A004","A005","A006","A007","A008","A009","A010","A011","A012","A013","A014","A015"),
SUBJECT = c("Math","Math","Math","Math","Math","Science","Science","Science","Science","Science","History","History","History","History","History"),
SCORE = c(78,85,91,62,74,83,77,65,90,NA,69,74,80,85,77)
)
overall_stats <- scores %>%
dplyr::summarize(
n = n(),
mean = mean(SCORE, na.rm = TRUE),
sd = sd(SCORE, na.rm = TRUE),
q1 = quantile(SCORE, 0.25, na.rm = TRUE),
q3 = quantile(SCORE, 0.75, na.rm = TRUE),
nmissing = sum(is.na(SCORE))
)
Explanation:
summarize(): Reduces the entire data frame to a single-row summary.n(): Counts rows (non-missing).mean(),sd(): Basic statistics (exclude NAs).quantile(..., 0.25 / 0.75): Computes the 25th and 75th percentiles.sum(is.na()): Counts how many values are missing in the column.
Output Table
| n | mean | sd | q1 | q3 | nmissing |
|---|---|---|---|---|---|
| 14 | 77.64 | 8.31 | 73.0 | 83.5 | 1 |
2. Summary by SUBJECT
SAS Code
proc sort data=scores; by subject;
proc means data=scores noprint;
by subject;
var score;
output n=n mean=mean std=sd q1=q1 q3=q3 nmiss=nmissing out=by_subject_stats;
run;
Explanation:
proc sort ... by subject;: Required step before usingBYin SAS.BY SUBJECT: Tells SAS to compute statistics separately for each group.output ... out=by_subject_stats;: Saves the results for each SUBJECT to a new dataset.
R Code
by_subject_stats <- scores %>%
group_by(SUBJECT) %>%
dplyr::summarize(
n = n(),
mean = mean(SCORE, na.rm = TRUE),
sd = sd(SCORE, na.rm = TRUE),
q1 = quantile(SCORE, 0.25, na.rm = TRUE),
q3 = quantile(SCORE, 0.75, na.rm = TRUE),
nmissing = sum(is.na(SCORE))
)
Explanation:
group_by(SUBJECT): Groups the data frame by subject category.summarize(...): Applies each summary function to the grouped values.
Output Table
| SUBJECT | n | mean | sd | q1 | q3 | nmissing |
|---|---|---|---|---|---|---|
| History | 5 | 77.0 | 6.52 | 74.0 | 80.0 | 0 |
| Math | 5 | 78.0 | 10.87 | 74.0 | 85.0 | 0 |
| Science | 4 | 78.75 | 10.45 | 72.0 | 86.25 | 1 |
3. Alternatives to dplyr::summarize()
| Package | Function | Description |
|---|---|---|
psych |
describe() |
One-liner for all basic stats |
skimr |
skim() |
Pretty summaries grouped by variable |
data.table |
DT[, .(…), by=] |
Fastest summary on large datasets |
Hmisc |
describe() |
Summary plus metadata/label info |
**What More Can You Explore**
1. Add More Summary Metrics
min(SCORE),max(SCORE),IQR(),median(),var()- Helps validate outliers or spread
2. Multi-Grouping
- Add
group_by(SUBJECT, GENDER)to replicateBYacross multiple class variables.
3. Visualize Results
- Boxplots, histograms, or violin plots for score distributions
- Plot
mean ± sderror bars by subject
4. Export to Reports
- Use
gt::gt(),flextable, orkable()for presentation-ready summary tables - Save as Excel or HTML
5. Build a Custom summary_table() Function
Reusable for any dataset:
summary_table <- function(data, var, group = NULL) {
if (!is.null(group)) data <- data %>% group_by(across(all_of(group)))
data %>%
dplyr::summarize(
n = sum(!is.na(.data[[var]])),
mean = mean(.data[[var]], na.rm = TRUE),
sd = sd(.data[[var]], na.rm = TRUE),
q1 = quantile(.data[[var]], 0.25, na.rm = TRUE),
q3 = quantile(.data[[var]], 0.75, na.rm = TRUE),
nmiss = sum(is.na(.data[[var]])),
.groups = "drop"
)
}
overall_stats <- summary_table(scores, "SCORE")
**Resource download links**
1.5.3.-PROC-MEANS-Compute-Descriptive-Statistics-in-SAS-and-R.zip