contact@a2zlearners.com

1.5. SAS PROCEDURES -> Equivalent in R

1.5.1 PROC CONTENTS - R equivalent derivation and functions

1 SAS: Metadata Extraction Using PROC CONTENTS

Input

data work.people;
  input ID Age Gender $;
  label 
    ID     = "Participant ID"
    Age    = "Age in Years"
    Gender = "Self-reported Gender";
  datalines;
1 25 M
2 30 F
3 28 M
;
run;

proc contents data=work.people out=work.meta noprint;
run;

Output

NAME TYPE LENGTH FORMAT INFORMAT LABEL VARNUM
ID 1 8 Participant ID 1
AGE 1 8 Age in Years 2
GENDER 2 8 Self-reported Gender 3

Explanation

  • NAME: Variable name.
  • TYPE: 1 = numeric, 2 = character.
  • LENGTH: Byte size allocated for storage (default = 8).
  • FORMAT / INFORMAT: Display/input formats (blank if none assigned).
  • LABEL: The user-defined label assigned in the LABEL statement.
  • VARNUM: Position of the variable in the original dataset.

2. R: Metadata Extraction Using Custom Code - equivalent to PROC CONTENTS

Input

# Step 1: Create the data frame with NA and blank string values
people <- data.frame(
  ID = c(1, NA, 3),
  Age = c(25, 30, NA),
  Gender = c("M", "", NA),  # One empty string and one NA
  stringsAsFactors = FALSE
)

# Step 2: Add variable-level labels
attr(people$ID, "label") <- "Participant ID"
attr(people$Age, "label") <- "Age in Years"
attr(people$Gender, "label") <- "Self-reported Gender"

# Step 3: Extract metadata
meta <- data.frame(
  Name = names(people),
  Class = sapply(people, class),
  Type = sapply(people, typeof),
  Length = sapply(people, length),
  NA_Count = sapply(people, function(x) sum(is.na(x))),
  Null_String_Count = sapply(people, function(x) {
    if (is.character(x)) sum(trimws(x) == "", na.rm = TRUE) else NA
  }),
  Label = sapply(people, function(x) {
    label <- attr(x, "label")
    if (!is.null(label)) label else NA
  }),
  stringsAsFactors = FALSE
)

# Step 4: Print metadata summary
print(meta)

Output

Name Class Type Length NA_Count Null_String_Count Label
ID numeric double 3 1 NA Participant ID
Age numeric double 3 1 NA Age in Years
Gender character character 3 1 1 Self-reported Gender

Explanation

This code creates a simple 3-row data.frame called people, which mimics a small SAS dataset. Each metadata attribute is then computed systematically:

  • names() extracts the variable names.

  • class() returns the broad object class, such as "numeric" or "character", useful for modeling and method dispatch.

  • typeof() gives the internal storage type, like "double" or "integer", allowing for low-level validation or memory profiling.

  • length() captures the number of elements in each vector (typically equal to the number of rows in the data frame).

  • sum(is.na(x)) counts how many values in each column are explicitly marked as NA, a key quality-check metric.

  • sum(trimws(x) == "") (for character columns) flags blank string entries, often used to detect incomplete user input not recorded as NA.

By compiling all of this into a single meta object, the code creates a neatly tabulated metadata summary. This is ideal for exploratory analysis, structural audits, or turning into a reusable function for data validation within R pipelines. Adding column labels using attr() enhances documentation and traceability across the workflow.


3. Helpful R Functions to extract metadata details

# Dummy data for execution
library(dplyr)
df <- data.frame(
  ID = c(1, 2, 3),
  Age = c(25, 30, 28),
  Gender = c("M", "F", "M"),
  stringsAsFactors = FALSE
)

**1.** str(df)

Input:

str(df)

Output:

'data.frame':	3 obs. of  3 variables:
 $ ID    : num  1 2 3
 $ Age   : num  25 30 28
 $ Gender: chr  "M" "F" "M"

Explanation:
Quick structural overview including types and sample values. Often used for debugging and sanity checks in pipelines.


**2.** attributes(df)

Input:

attributes(df)

Output:

$names
[1] "ID" "Age" "Gender"

$class
[1] "data.frame"

$row.names
[1] 1 2 3

Explanation:
Displays meta-properties like column names, row indices, and class. Vital for detecting extra attributes from import tools.


**3.** names(df) / colnames(df)

Input:

names(df)

Output:

[1] "ID" "Age" "Gender"

Explanation:
Lists all column names—useful for name validation, renaming, or aligning columns for joins.


**4.** sapply(df, class) and sapply(df, typeof)

Input:

sapply(df, class)
sapply(df, typeof)

Output:

$ class
     ID      Age   Gender 
"numeric" "numeric" "character" 

$ typeof
     ID      Age   Gender 
"double" "double" "character" 

Explanation:

  • class: how R represents the variable in modeling.

  • typeof: physical representation in memory.

Essential for type safety checks in modeling or when bridging to other tools.


**5.** length(), nrow(), ncol()

Input:

length(df) # Returns the number of columns in the data frame.
nrow(df) # Returns the number of rows (observations) in the data frame.
ncol(df) # Also returns the number of columns in the data frame.

Output:

[1] 3
[1] 3
[1] 3

Explanation:
Measures structure—e.g., use these to trigger alerts if expected structure (rows/columns) is broken.


**6.** summary(df)

Input:

summary(df)

Output:

       ID             Age           Gender         
 Min.   :1.00   Min.   :25.0   Length:3          
 1st Qu.:1.50   1st Qu.:26.5   Class :character  
 Median :2.00   Median :28.0   Mode  :character  
 Mean   :2.00   Mean   :27.7                     
 3rd Qu.:2.50   3rd Qu.:29.0                     
 Max.   :3.00   Max.   :30.0                     

Explanation:
Exploratory summary of each variable. You’d use this early in a pipeline to flag anomalies and get statistical context.


**7.** sapply(df, function(x) sum(is.na(x)))

Input:

sapply(df, function(x) sum(is.na(x)))

Output:

 ID  Age Gender 
  0    0      0 

Explanation:
Shows column-wise missing data counts. Crucial for pre-modeling QC and reporting.


**8.** Hmisc::describe(df)

Input:

# Dummy data for execution
install.packages("Hmisc")  # once
library(Hmisc)
df <- data.frame(
  ID = c(1, 2, 3),
  Age = c(25, 30, 28),
  Gender = c("M", "F", "M"),
  stringsAsFactors = FALSE
)
describe(df)

Output (abridged):

n missing distinct Info Mean Gmd
ID 3 0 3 1 2 1

Explanation:
Provides exhaustive metadata for all variables—labels, uniqueness, missing, and distributions. Ideal for documentation and structured audits.


**4. Comparison Table: Function and Purpose**

Function Purpose Output Type
str() Compact structure summary Console output
attributes() Internal metadata (names, class) List
names() Column name vector Character vector
class() / typeof() Variable type vs storage type Named vector
length(), nrow(), ncol() Dimensions Integer
summary() Descriptive statistics Multi-line console
sapply() Batch-wise metadata extraction Named vector
describe() (Hmisc) Detailed metadata, including frequency/info Console + object

**5. SAS vs R Comparison Snapshot**

Feature SAS (PROC CONTENTS) R (Custom Script)
Built-in Metadata Yes No (requires scripting)
Output Format Structured dataset Custom data frame
Variable Order Preserved via VARNUM Preserved, not explicitly indexed
Data Types Numeric/Character More granular (e.g., double, factor)
Labels/Formats Automatically stored Must be manually added
Missing Count Not shown by default Easily computed with sapply()

**Resource download links**

1.5.1.-PROC-CONTENTS-R-equivalent-derivation-and-functions.zip