1.5. SAS PROCEDURES -> Equivalent in R
1.5.1 PROC CONTENTS - R equivalent derivation and functions
1 SAS: Metadata Extraction Using PROC CONTENTS
Input
data work.people;
input ID Age Gender $;
label
ID = "Participant ID"
Age = "Age in Years"
Gender = "Self-reported Gender";
datalines;
1 25 M
2 30 F
3 28 M
;
run;
proc contents data=work.people out=work.meta noprint;
run;
Output
| NAME | TYPE | LENGTH | FORMAT | INFORMAT | LABEL | VARNUM |
|---|---|---|---|---|---|---|
| ID | 1 | 8 | Participant ID | 1 | ||
| AGE | 1 | 8 | Age in Years | 2 | ||
| GENDER | 2 | 8 | Self-reported Gender | 3 |
Explanation
- NAME: Variable name.
- TYPE: 1 = numeric, 2 = character.
- LENGTH: Byte size allocated for storage (default = 8).
- FORMAT / INFORMAT: Display/input formats (blank if none assigned).
- LABEL: The user-defined label assigned in the
LABELstatement. - VARNUM: Position of the variable in the original dataset.
2. R: Metadata Extraction Using Custom Code - equivalent to PROC CONTENTS
Input
# Step 1: Create the data frame with NA and blank string values
people <- data.frame(
ID = c(1, NA, 3),
Age = c(25, 30, NA),
Gender = c("M", "", NA), # One empty string and one NA
stringsAsFactors = FALSE
)
# Step 2: Add variable-level labels
attr(people$ID, "label") <- "Participant ID"
attr(people$Age, "label") <- "Age in Years"
attr(people$Gender, "label") <- "Self-reported Gender"
# Step 3: Extract metadata
meta <- data.frame(
Name = names(people),
Class = sapply(people, class),
Type = sapply(people, typeof),
Length = sapply(people, length),
NA_Count = sapply(people, function(x) sum(is.na(x))),
Null_String_Count = sapply(people, function(x) {
if (is.character(x)) sum(trimws(x) == "", na.rm = TRUE) else NA
}),
Label = sapply(people, function(x) {
label <- attr(x, "label")
if (!is.null(label)) label else NA
}),
stringsAsFactors = FALSE
)
# Step 4: Print metadata summary
print(meta)
Output
| Name | Class | Type | Length | NA_Count | Null_String_Count | Label |
|---|---|---|---|---|---|---|
| ID | numeric | double | 3 | 1 | NA | Participant ID |
| Age | numeric | double | 3 | 1 | NA | Age in Years |
| Gender | character | character | 3 | 1 | 1 | Self-reported Gender |
Explanation
This code creates a simple 3-row data.frame called people, which mimics a small SAS dataset. Each metadata attribute is then computed systematically:
names()extracts the variable names.class()returns the broad object class, such as"numeric"or"character", useful for modeling and method dispatch.typeof()gives the internal storage type, like"double"or"integer", allowing for low-level validation or memory profiling.length()captures the number of elements in each vector (typically equal to the number of rows in the data frame).sum(is.na(x))counts how many values in each column are explicitly marked asNA, a key quality-check metric.sum(trimws(x) == "")(for character columns) flags blank string entries, often used to detect incomplete user input not recorded asNA.
By compiling all of this into a single meta object, the code creates a neatly tabulated metadata summary. This is ideal for exploratory analysis, structural audits, or turning into a reusable function for data validation within R pipelines. Adding column labels using attr() enhances documentation and traceability across the workflow.
3. Helpful R Functions to extract metadata details
# Dummy data for execution
library(dplyr)
df <- data.frame(
ID = c(1, 2, 3),
Age = c(25, 30, 28),
Gender = c("M", "F", "M"),
stringsAsFactors = FALSE
)
**1.** str(df)
Input:
str(df)
Output:
'data.frame': 3 obs. of 3 variables:
$ ID : num 1 2 3
$ Age : num 25 30 28
$ Gender: chr "M" "F" "M"
Explanation:
Quick structural overview including types and sample values. Often used for debugging and sanity checks in pipelines.
**2.** attributes(df)
Input:
attributes(df)
Output:
$names
[1] "ID" "Age" "Gender"
$class
[1] "data.frame"
$row.names
[1] 1 2 3
Explanation:
Displays meta-properties like column names, row indices, and class. Vital for detecting extra attributes from import tools.
**3.** names(df) / colnames(df)
Input:
names(df)
Output:
[1] "ID" "Age" "Gender"
Explanation:
Lists all column names—useful for name validation, renaming, or aligning columns for joins.
**4.** sapply(df, class) and sapply(df, typeof)
Input:
sapply(df, class)
sapply(df, typeof)
Output:
$ class
ID Age Gender
"numeric" "numeric" "character"
$ typeof
ID Age Gender
"double" "double" "character"
Explanation:
class: how R represents the variable in modeling.typeof: physical representation in memory.
Essential for type safety checks in modeling or when bridging to other tools.
**5.** length(), nrow(), ncol()
Input:
length(df) # Returns the number of columns in the data frame.
nrow(df) # Returns the number of rows (observations) in the data frame.
ncol(df) # Also returns the number of columns in the data frame.
Output:
[1] 3
[1] 3
[1] 3
Explanation:
Measures structure—e.g., use these to trigger alerts if expected structure (rows/columns) is broken.
**6.** summary(df)
Input:
summary(df)
Output:
ID Age Gender
Min. :1.00 Min. :25.0 Length:3
1st Qu.:1.50 1st Qu.:26.5 Class :character
Median :2.00 Median :28.0 Mode :character
Mean :2.00 Mean :27.7
3rd Qu.:2.50 3rd Qu.:29.0
Max. :3.00 Max. :30.0
Explanation:
Exploratory summary of each variable. You’d use this early in a pipeline to flag anomalies and get statistical context.
**7.** sapply(df, function(x) sum(is.na(x)))
Input:
sapply(df, function(x) sum(is.na(x)))
Output:
ID Age Gender
0 0 0
Explanation:
Shows column-wise missing data counts. Crucial for pre-modeling QC and reporting.
**8.** Hmisc::describe(df)
Input:
# Dummy data for execution
install.packages("Hmisc") # once
library(Hmisc)
df <- data.frame(
ID = c(1, 2, 3),
Age = c(25, 30, 28),
Gender = c("M", "F", "M"),
stringsAsFactors = FALSE
)
describe(df)
Output (abridged):
| n | missing | distinct | Info | Mean | Gmd | |
|---|---|---|---|---|---|---|
| ID | 3 | 0 | 3 | 1 | 2 | 1 |
Explanation:
Provides exhaustive metadata for all variables—labels, uniqueness, missing, and distributions. Ideal for documentation and structured audits.
**4. Comparison Table: Function and Purpose**
| Function | Purpose | Output Type |
|---|---|---|
str() |
Compact structure summary | Console output |
attributes() |
Internal metadata (names, class) | List |
names() |
Column name vector | Character vector |
class() / typeof() |
Variable type vs storage type | Named vector |
length(), nrow(), ncol() |
Dimensions | Integer |
summary() |
Descriptive statistics | Multi-line console |
sapply() |
Batch-wise metadata extraction | Named vector |
describe() (Hmisc) |
Detailed metadata, including frequency/info | Console + object |
**5. SAS vs R Comparison Snapshot**
| Feature | SAS (PROC CONTENTS) |
R (Custom Script) |
|---|---|---|
| Built-in Metadata | Yes | No (requires scripting) |
| Output Format | Structured dataset | Custom data frame |
| Variable Order | Preserved via VARNUM |
Preserved, not explicitly indexed |
| Data Types | Numeric/Character | More granular (e.g., double, factor) |
| Labels/Formats | Automatically stored | Must be manually added |
| Missing Count | Not shown by default | Easily computed with sapply() |
**Resource download links**
1.5.1.-PROC-CONTENTS-R-equivalent-derivation-and-functions.zip