1.5.1. PROC CONTENTS - R equivalent derivation and functions

1.5. SAS PROCEDURES -> Equivalent in R

1.5.1 PROC CONTENTS - R equivalent derivation and functions

1 SAS: Metadata Extraction Using `PROC CONTENTS`

Input

data work.people;
  input ID Age Gender $;
  label 
    ID     = "Participant ID"
    Age    = "Age in Years"
    Gender = "Self-reported Gender";
  datalines;
1 25 M
2 30 F
3 28 M
;
run;

proc contents data=work.people out=work.meta noprint;
run;

Output

NAME	TYPE	LENGTH	LABEL	VARNUM
ID	1	8	Participant ID	1
AGE	1	8	Age in Years	2
GENDER	2	8	Self-reported Gender	3

Explanation

NAME: Variable name.
TYPE: 1 = numeric, 2 = character.
LENGTH: Byte size allocated for storage (default = 8).
FORMAT / INFORMAT: Display/input formats (blank if none assigned).
LABEL: The user-defined label assigned in the LABEL statement.
VARNUM: Position of the variable in the original dataset.

2. R: Metadata Extraction Using Custom Code - equivalent to `PROC CONTENTS`

Input

# Step 1: Create the data frame with NA and blank string values
people <- data.frame(
  ID = c(1, NA, 3),
  Age = c(25, 30, NA),
  Gender = c("M", "", NA),  # One empty string and one NA
  stringsAsFactors = FALSE
)

# Step 2: Add variable-level labels
attr(people$ID, "label") <- "Participant ID"
attr(people$Age, "label") <- "Age in Years"
attr(people$Gender, "label") <- "Self-reported Gender"

# Step 3: Extract metadata
meta <- data.frame(
  Name = names(people),
  Class = sapply(people, class),
  Type = sapply(people, typeof),
  Length = sapply(people, length),
  NA_Count = sapply(people, function(x) sum(is.na(x))),
  Null_String_Count = sapply(people, function(x) {
    if (is.character(x)) sum(trimws(x) == "", na.rm = TRUE) else NA
  }),
  Label = sapply(people, function(x) {
    label <- attr(x, "label")
    if (!is.null(label)) label else NA
  }),
  stringsAsFactors = FALSE
)

# Step 4: Print metadata summary
print(meta)

Output

Name	Class	Type	Length	NA_Count	Null_String_Count	Label
ID	numeric	double	3	1	NA	Participant ID
Age	numeric	double	3	1	NA	Age in Years
Gender	character	character	3	1	1	Self-reported Gender

Explanation

This code creates a simple 3-row data.frame called people, which mimics a small SAS dataset. Each metadata attribute is then computed systematically:

names() extracts the variable names.
class() returns the broad object class, such as "numeric" or "character", useful for modeling and method dispatch.
typeof() gives the internal storage type, like "double" or "integer", allowing for low-level validation or memory profiling.
length() captures the number of elements in each vector (typically equal to the number of rows in the data frame).
sum(is.na(x)) counts how many values in each column are explicitly marked as NA, a key quality-check metric.
sum(trimws(x) == "") (for character columns) flags blank string entries, often used to detect incomplete user input not recorded as NA.

By compiling all of this into a single meta object, the code creates a neatly tabulated metadata summary. This is ideal for exploratory analysis, structural audits, or turning into a reusable function for data validation within R pipelines. Adding column labels using attr() enhances documentation and traceability across the workflow.

3. Helpful R Functions to extract metadata details

# Dummy data for execution
library(dplyr)
df <- data.frame(
  ID = c(1, 2, 3),
  Age = c(25, 30, 28),
  Gender = c("M", "F", "M"),
  stringsAsFactors = FALSE
)

1. `str(df)`

Input:

str(df)

Output:

'data.frame':	3 obs. of  3 variables:
 $ ID    : num  1 2 3
 $ Age   : num  25 30 28
 $ Gender: chr  "M" "F" "M"

Explanation:
Quick structural overview including types and sample values. Often used for debugging and sanity checks in pipelines.

2. `attributes(df)`

Input:

attributes(df)

Output:

$names
[1] "ID" "Age" "Gender"

$class
[1] "data.frame"

$row.names
[1] 1 2 3

Explanation:
Displays meta-properties like column names, row indices, and class. Vital for detecting extra attributes from import tools.

3. `names(df)` / `colnames(df)`

Input:

names(df)

Output:

[1] "ID" "Age" "Gender"

Explanation:
Lists all column names—useful for name validation, renaming, or aligning columns for joins.

4. `sapply(df, class)` and `sapply(df, typeof)`

Input:

sapply(df, class)
sapply(df, typeof)

Output:

$ class
     ID      Age   Gender 
"numeric" "numeric" "character" 

$ typeof
     ID      Age   Gender 
"double" "double" "character"

Explanation:

class: how R represents the variable in modeling.
typeof: physical representation in memory.

Essential for type safety checks in modeling or when bridging to other tools.

5. `length()`, `nrow()`, `ncol()`

Input:

length(df) # Returns the number of columns in the data frame.
nrow(df) # Returns the number of rows (observations) in the data frame.
ncol(df) # Also returns the number of columns in the data frame.

Output:

[1] 3
[1] 3
[1] 3

Explanation:
Measures structure—e.g., use these to trigger alerts if expected structure (rows/columns) is broken.

6. `summary(df)`

Input:

summary(df)

Output:

       ID             Age           Gender         
 Min.   :1.00   Min.   :25.0   Length:3          
 1st Qu.:1.50   1st Qu.:26.5   Class :character  
 Median :2.00   Median :28.0   Mode  :character  
 Mean   :2.00   Mean   :27.7                     
 3rd Qu.:2.50   3rd Qu.:29.0                     
 Max.   :3.00   Max.   :30.0

Explanation:
Exploratory summary of each variable. You’d use this early in a pipeline to flag anomalies and get statistical context.

7. `sapply(df, function(x) sum(is.na(x)))`

Input:

sapply(df, function(x) sum(is.na(x)))

Output:

 ID  Age Gender 
  0    0      0

Explanation:
Shows column-wise missing data counts. Crucial for pre-modeling QC and reporting.

8. `Hmisc::describe(df)`

Input:

# Dummy data for execution
install.packages("Hmisc")  # once
library(Hmisc)
df <- data.frame(
  ID = c(1, 2, 3),
  Age = c(25, 30, 28),
  Gender = c("M", "F", "M"),
  stringsAsFactors = FALSE
)
describe(df)

Output (abridged):

	n	missing	distinct	Info	Mean	Gmd
ID	3	0	3	1	2	1

Explanation:
Provides exhaustive metadata for all variables—labels, uniqueness, missing, and distributions. Ideal for documentation and structured audits.

4. Comparison Table: Function and Purpose

Function	Purpose	Output Type
`str()`	Compact structure summary	Console output
`attributes()`	Internal metadata (names, class)	List
`names()`	Column name vector	Character vector
`class()` / `typeof()`	Variable type vs storage type	Named vector
`length()`, `nrow()`, `ncol()`	Dimensions	Integer
`summary()`	Descriptive statistics	Multi-line console
`sapply()`	Batch-wise metadata extraction	Named vector
`describe()` (Hmisc)	Detailed metadata, including frequency/info	Console + object

5. SAS vs R Comparison Snapshot

Feature	SAS (`PROC CONTENTS`)	R (Custom Script)
Built-in Metadata	Yes	No (requires scripting)
Output Format	Structured dataset	Custom data frame
Variable Order	Preserved via `VARNUM`	Preserved, not explicitly indexed
Data Types	Numeric/Character	More granular (e.g., double, factor)
Labels/Formats	Automatically stored	Must be manually added
Missing Count	Not shown by default	Easily computed with `sapply()`

Resource download links

1.5.1.-PROC-CONTENTS-R-equivalent-derivation-and-functions.zip

⁂

1.5. SAS PROCEDURES -> Equivalent in R