contact@a2zlearners.com

1.5. SAS PROCEDURES -> Equivalent in R

1.5.11. PROC Import CSV file in SAS vs R equivalent

1. Basic CSV Import in SAS vs R

# Dummy data creation and export to CSV
my_data <- data.frame(
  ID = c(1, 2, 3, NA),
  Name = c("John", "Jane", "Bob", "Maria"),
  Age = c(25, 30, NA, 28),
  Date = c("2023-01-01", "2023-02-15", "2023-03-20", "2023-04-01"),
  Comment = c("First entry, with comma", "Second entry", "Missing age", "Missing ID"),
  stringsAsFactors = FALSE
)
write.csv(my_data, "myfile.csv", row.names = FALSE, na = "")
Capability SAS (PROC IMPORT) R (Base and readr)
Basic import PROC IMPORT DBMS=CSV read.csv() or readr::read_csv()
Header handling getnames = yes/no header = TRUE/FALSE or col_names = TRUE/FALSE
Delimiter delimiter = ',' sep = "," or automatically handled
Missing values Default handling na.strings = "" or na = ""
Output SAS dataset Data frame or tibble

SAS Example

PROC IMPORT DATAFILE="myfile.csv"
    OUT=my_data
    DBMS=CSV
    REPLACE;
    GETNAMES=YES;
RUN;

Explanation (SAS):

  • Uses PROC IMPORT with DBMS=CSV to read CSV files
  • GETNAMES=YES treats the first row as column headers
  • Automatically detects delimiters (usually commas)
  • Outputs a SAS dataset named my_data

R Base Example

my_data <- read.csv("myfile.csv", header = TRUE, sep = ",", na.strings = "")
head(my_data)

Output:

> head(my_data)
# A tibble: 4 × 5
     ID Name    Age Date       Comment                
  <dbl> <chr> <dbl> <date>     <chr>                  
1     1 John     25 2023-01-01 First entry, with comma
2     2 Jane     30 2023-02-15 Second entry           
3     3 Bob      NA 2023-03-20 Missing age            
4    NA Maria    28 2023-04-01 Missing ID 

Explanation (R Base):

  • Uses base R's read.csv() function
  • header = TRUE treats the first row as column names
  • sep = "," specifies comma delimiter
  • na.strings = "" treats empty strings as NA values
  • Returns a data frame stored in my_data

R readr Example

library(readr)
my_data <- read_csv("myfile.csv")
head(my_data)

Output:

> head(my_data)
# A tibble: 4 × 5
     ID Name    Age Date       Comment                
  <dbl> <chr> <dbl> <date>     <chr>                  
1     1 John     25 2023-01-01 First entry, with comma
2     2 Jane     30 2023-02-15 Second entry           
3     3 Bob      NA 2023-03-20 Missing age            
4    NA Maria    28 2023-04-01 Missing ID 

Explanation (R readr):

  • Uses tidyverse's read_csv() function
  • More efficient than base R for large files
  • Automatically handles headers and common delimiters
  • Returns a tibble (modern data frame)

2. Handling Different Delimiters

Capability SAS R
Comma-separated delimiter = ',' (default) read.csv() or read_csv()
Tab-separated delimiter = '09'x read.delim() or read_tsv()
Semicolon-separated delimiter = ';' read.csv2() or read_csv2()
Custom delimiter `delimiter = ' '`

SAS Example with Custom Delimiter

PROC IMPORT DATAFILE="myfile.txt"
    OUT=pipe_data
    DBMS=DLM
    REPLACE;
    DELIMITER='|';
    GETNAMES=YES;
RUN;

Explanation (SAS):

  • Uses DBMS=DLM for general delimited files
  • DELIMITER='|' specifies pipe as the delimiter
  • Works with any custom delimiter character

R Example with Custom Delimiter

# Dummy data creation and export to pipe-delimited file
pipe_data <- data.frame(
  ID = c(1, 2, 3, NA),
  Name = c("John", "Jane", "Bob", "Maria"),
  Age = c(25, 30, NA, 28),
  Date = c("2023-01-01", "2023-02-15", "2023-03-20", "2023-04-01"),
  Comment = c("First entry, with pipe|char", "Second entry", "Missing age", "Missing ID"),
  stringsAsFactors = FALSE
)
write.table(pipe_data, "myfile.txt", sep = "|", row.names = FALSE, na = "", quote = TRUE)

# Base R
pipe_data <- read.delim("myfile.txt", sep = "|", header = TRUE)

# Using readr
library(readr)
pipe_data <- read_delim("myfile.txt", delim = "|")

Output:

> pipe_data
  ID  Name Age       Date                     Comment
1  1  John  25 2023-01-01 First entry, with pipe|char
2  2  Jane  30 2023-02-15                Second entry
3  3   Bob  NA 2023-03-20                 Missing age
4 NA Maria  28 2023-04-01                  Missing ID

Explanation (R):

  • Base R: read.delim() with custom sep parameter
  • readr: read_delim() with explicit delim parameter
  • Both handle any single-character delimiter

Example with European Format (Semicolon and Decimal Comma)

# Dummy data creation and export to European format CSV
euro_data <- data.frame(
  ID = c(1, 2, 3, NA),
  Name = c("John", "Jane", "Bob", "Maria"),
  Amount = c(1234.56, 567.89, 98.76, 1234.56),
  stringsAsFactors = FALSE
)
write.csv2(euro_data, "european_data.csv", row.names = FALSE, na = "")

# Base R
euro_data <- read.csv2("european_data.csv")

# Using readr
library(readr)
euro_data <- read_csv2("european_data.csv")

Output:

> euro_data
# A tibble: 4 × 3
     ID Name  Amount
  <dbl> <chr>  <dbl>
1     1 John  1235. 
2     2 Jane   568. 
3     3 Bob     98.8
4    NA Maria 1235.

Explanation (R European Format):

  • read.csv2() and read_csv2() are specialized for European format
  • Uses semicolons (;) as field separators
  • Uses commas (,) as decimal points
  • Common in countries that use comma as decimal separator

3. Handling Column Types and Missing Values

Capability SAS R
Column types Post-import with FORMAT statement colClasses or col_types
Missing values Default or with DBSASTYPE na.strings or na
Handling bad data Limited at import time problems() function (readr)

SAS Example

PROC IMPORT DATAFILE="myfile.csv"
    OUT=typed_data
    DBMS=CSV
    REPLACE;
    GETNAMES=YES;
RUN;

DATA typed_data;
    SET typed_data;
    FORMAT date_col MMDDYY10.;
    numeric_col = INPUT(char_num, 8.);
RUN;

Explanation (SAS):

  • SAS typically handles type conversion after import
  • Uses DATA step to apply formats or convert types
  • Less control during the initial import process

R Example

# Dummy data creation and export to CSV
typed_data <- data.frame(
  col1 = c("A", "B", "C"),
  col2 = c(1.1, 2.2, 3.3),
  col3 = as.Date(c("2023-01-01", "2023-02-01", "2023-03-01")),
  stringsAsFactors = FALSE
)
write.csv(typed_data, "myfile.csv", row.names = FALSE, na = "")

# Base R
typed_data <- read.csv("myfile.csv", 
                       colClasses = c("character", "numeric", "Date"),
                       na.strings = c("", "NA", "N/A"))

# Using readr
library(readr)
typed_data <- read_csv("myfile.csv",
                       col_types = cols(
                           col1 = col_character(),
                           col2 = col_double(),
                           col3 = col_date()
                       ),
                       na = c("", "NA", "N/A"))

# Check for parsing problems
problems(typed_data)

Output:

> problems(typed_data)
# A tibble: 0 × 5
# ℹ 5 variables: row <int>, col <int>, expected <chr>, actual <chr>, file <chr>
> typed_data
# A tibble: 3 × 3
  col1   col2 col3      
  <chr> <dbl> <date>    
1 A       1.1 2023-01-01
2 B       2.2 2023-02-01
3 C       3.3 2023-03-01
> problems(typed_data)
# A tibble: 0 × 5
# ℹ 5 variables: row <int>, col <int>, expected <chr>, actual <chr>, file <chr>

Explanation (R):

  • Base R: Uses colClasses to specify column types
  • readr: More flexible col_types specification
  • Both allow multiple NA value definitions
  • readr provides problems() to identify parsing issues

4. Skipping Rows and Managing Headers

Capability SAS R
Skip rows DATAROW=n skip=n or skip=n-1, header=TRUE
No column names GETNAMES=NO header=FALSE or col_names=FALSE
Custom column names Post-import renaming col.names or col_names vector

SAS Example

PROC IMPORT DATAFILE="myfile.csv"
    OUT=data_with_headers
    DBMS=CSV
    REPLACE;
    DATAROW=3;
    GETNAMES=NO;
RUN;

DATA data_with_headers;
    SET data_with_headers;
    RENAME VAR1=ID VAR2=Name VAR3=Value;
RUN;

Explanation (SAS):

  • DATAROW=3 starts reading from the third row
  • GETNAMES=NO doesn't use any row for headers
  • Post-import renaming with DATA step

R Example

# Dummy data creation and export to CSV
data_with_headers <- data.frame(
  ID = c(1, 2, 3),
  Name = c("John", "Jane", "Bob"),
  Value = c(100, 200, 300),
  stringsAsFactors = FALSE
)
write.csv(data_with_headers, "myfile.csv", row.names = FALSE, na = "")

# Base R
data_with_headers <- read.csv("myfile.csv", 
                             skip = 2, 
                             header = FALSE,
                             col.names = c("ID", "Name", "Value"))

# Using readr
library(readr)
data_with_headers <- read_csv("myfile.csv",
                              skip = 2,
                              col_names = c("ID", "Name", "Value"))
s <- spec(data_with_headers)

Output:

> data_with_headers
# A tibble: 2 × 3
     ID Name  Value
  <dbl> <chr> <dbl>
1     2 Jane    200
2     3 Bob     300
> s
cols(
  ID = col_double(),
  Name = col_character(),
  Value = col_double()
)

Explanation (R):

  • skip = 2 ignores the first two rows
  • Setting custom column names with col.names or col_names
  • More straightforward naming during import compared to SAS

5. Input/Output Examples

Input CSV Example:

ID,Name,Age,Date,Comment
1,John,25,2023-01-01,"First entry, with comma"
2,Jane,30,2023-02-15,Second entry
3,Bob,,2023-03-20,Missing age
,Maria,28,2023-04-01,Missing ID

Expected Output in SAS:

ID Name Age Date Comment
1 John 25 2023-01-01 First entry, with comma
2 Jane 30 2023-02-15 Second entry
3 Bob . 2023-03-20 Missing age
. Maria 28 2023-04-01 Missing ID

Expected Output in R:

ID Name Age Date Comment
1 John 25 2023-01-01 First entry, with comma
2 Jane 30 2023-02-15 Second entry
3 Bob NA 2023-03-20 Missing age
NA Maria 28 2023-04-01 Missing ID

6. Beyond Basics: Performance Optimization

Capability SAS R
Large file handling BUFSIZE= option data.table::fread() or vroom::vroom()
Chunked reading Custom DATA step with INFILE chunked = TRUE with callback functions
Parallel processing Limited native support parallel package or multidplyr

SAS Advanced Example

/* Efficient reading of large CSV */
DATA large_data;
    INFILE "large_file.csv" DSD FIRSTOBS=2 BUFSIZE=32768;
    INPUT id name :$20. age date :YYMMDD10. comment :$100.;
RUN;

Explanation (SAS Advanced):

  • Uses direct INFILE statement instead of PROC IMPORT
  • BUFSIZE=32768 increases buffer size for better performance
  • Explicit column definitions for more control
  • Better for large files than standard PROC IMPORT

R Advanced Example

# Load necessary libraries
library(data.table)   # For fast fread/fwrite

# 1. Create dummy large data
set.seed(123) # For reproducibility
large_data <- data.frame(
  id = 1:10000,
  name = paste0("Name", 1:10000),
  age = sample(18:80, 10000, replace = TRUE),
  comment = sample(c("A", "B", "C"), 10000, replace = TRUE),
  stringsAsFactors = FALSE
)

# 2. Export to CSV quickly (fwrite is MUCH faster than write.csv)
fwrite(large_data, "large_file.csv", na = "")

# 3. Read file efficiently, selecting only needed columns (id, name, age)
large_data_read <- fread(
  "large_file.csv",
  nThread = 4,                        # Use 4 threads (adjust as per your CPU)
  select = c("id", "name", "age")     # Only these columns will be read
  # drop = "comment"                  # Alternatively, use drop to ignore columns
)

# 4. Display first few rows as a quick check
print(head(large_data_read))

Output:

> print(head(large_data_read))
      id   name   age
   <int> <char> <int>
1:     1  Name1    48
2:     2  Name2    32
3:     3  Name3    68
4:     4  Name4    31
5:     5  Name5    20
6:     6  Name6    59

Explanation (R Advanced):

  • data.table::fread() is optimized for large files
  • nThread = 4 enables parallel processing
  • Selective column reading improves performance
  • Can be 10-20x faster than base R for large files

7. Beyond Basics: Advanced Processing

Capability SAS R
Processing during import Custom INFILE statements readr with transforming functions
Column filtering Limited at import time Select columns during import
Encoding ENCODING= option encoding parameter
Validation during import Limited col_types with specification

SAS Advanced Example with Validation

/* Advanced data validation during import */
DATA validated_data;
    INFILE "input.csv" DSD DELIMITER=';' FIRSTOBS=2 MISSOVER ENCODING="latin1";
    
    INPUT id age name :$30.;
    
    /* Validation checks */
    IF id <= 0 THEN DELETE;
    IF age < 18 OR age > 100 THEN age = .;
    IF name = "" THEN name = "Unknown";
RUN;

Explanation (SAS Advanced Validation):

  • Uses DATA step for more control over import process
  • Applies validation rules during import
  • Handles special cases like negative IDs or invalid ages

R Advanced Example with Validation

# Dummy data creation and export to CSV
validated_data <- data.frame(
  id = c(1, 2, -1, 4),
  age = c(25, 17, 101, 40),
  name = c("Alice", "", "Charlie", "David"),
  stringsAsFactors = FALSE
)
write.csv(validated_data, "input.csv", row.names = FALSE, na = "")

library(readr)
library(dplyr)

validated_data <- read_csv("input.csv", 
    col_types = cols(
        id = col_integer(),
        age = col_integer(),
        name = col_character()
    ),
    col_select = c(id, age, name),  # Only these columns
    name_repair = "unique"          # Fix duplicate names
)

# Post-import validation
validated_data <- validated_data %>%
    filter(id > 0) %>%
    mutate(
        age = if_else(age < 18 | age > 100, NA_integer_, age),
        name = if_else(name == "", "Unknown", name)
    )

Output:

> validated_data
# A tibble: 3 × 3
     id   age name 
  <int> <int> <chr>
1     1    25 Alice
2     2    NA NA   
3     4    40 David

Explanation (R Advanced Validation):

  • Uses col_types for explicit type checking
  • col_select to only import needed columns
  • name_repair handles duplicate column names
  • Post-import processing with tidyverse functions

8. Beyond Basics: Input/Output Examples

SAS Code for Advanced Processing:

/* Advanced processing for complex CSV */
FILENAME MYSRC "messy_data.csv";

DATA cleaned_data;
    INFILE MYSRC DSD DELIMITER=';' FIRSTOBS=2 MISSOVER ENCODING="latin1";
    
    INPUT Customer_ID :$10. 
          Name :$50. 
          Purchase_Date :$10. 
          Amount_Raw :$20. 
          Notes :$100.;
    
    /* Date conversion from European format */
    IF Purchase_Date NE "" THEN DO;
        day = SCAN(Purchase_Date, 1, '/');
        month = SCAN(Purchase_Date, 2, '/');
        year = SCAN(Purchase_Date, 3, '/');
        Purchase_Date = MDY(INPUT(month, 8.), INPUT(day, 8.), INPUT(year, 8.));
        FORMAT Purchase_Date DATE9.;
    END;
    
    /* Amount conversion from European format */
    Amount_Raw = TRANWRD(Amount_Raw, '.', '');
    Amount_Raw = TRANWRD(Amount_Raw, ',', '.');
    Amount = INPUT(Amount_Raw, BEST.);
    
    /* Derived column */
    Priority_Customer = (FIND(Notes, "priority") > 0) ? "Yes" : "No";
    
    DROP Amount_Raw day month year;
RUN;

R Code for Advanced Processing:

# Load necessary libraries
library(readr)
library(dplyr)
library(lubridate)

# 1. Create and write messy European CSV
messy_data <- data.frame(
  `Customer ID` = c("001", "002", "003", NA),
  Name = c("John Smith", "Jane Doe", "Bob Jones", "Maria Garcia"),
  `Purchase Date` = c("01/04/2023", "15/04/2023", "", "30/04/2023"),
  `Amount (€)` = c("1,234.56", "567,89", "98,76", "1.234,56"),
  Notes = c("First purchase, priority customer", "Recurring", "Missing date", "European format, ID missing"),
  stringsAsFactors = FALSE
)
write.table(
  messy_data,
  "messy_data.csv",
  sep = ";",
  row.names = FALSE,
  na = "",
  quote = TRUE,
  fileEncoding = "UTF-8"
)

# 2. Function to convert various European number formats to numeric
convert_amount <- function(x) {
  x <- gsub("\\.", "", x)     # Remove thousand separators (dot)
  x <- gsub(",", ".", x)      # Convert decimal comma to decimal point
  as.numeric(x)
}

# 3. Read and process the "messy" CSV file
raw_data <- read_delim(
  "messy_data.csv",
  delim = ";",
  locale = locale(decimal_mark = ","),
  col_types = cols(.default = col_character())
)

# 4. Clean column names: strip quotes, then trim whitespace, make syntactically valid
colnames(raw_data) <- gsub('^"|"$', '', colnames(raw_data)) # strip quotes
colnames(raw_data) <- trimws(colnames(raw_data))            # trim whitespace
colnames(raw_data) <- make.names(colnames(raw_data))        # ensure valid column names

# (Optional) Print for debugging
print(colnames(raw_data))

# 5. Rename with the column names as they now appear
# Since make.names replaces spaces with dots, do the same in rename:
# Customer ID        --> Customer.ID
# Purchase Date      --> Purchase.Date
# Amount (€)         --> Amount.....

cleaned_data <- raw_data %>%
  rename(
    customer_id      = Customer.ID,
    name             = Name,
    purchase_date_raw = Purchase.Date,
    amount_raw       = Amount....,
    notes            = Notes
  ) %>%
  mutate(
    purchase_date    = suppressWarnings(dmy(purchase_date_raw)),
    amount           = convert_amount(amount_raw),
    priority_customer = grepl("priority", notes, ignore.case = TRUE)
  ) %>%
  select(customer_id, name, purchase_date, amount, priority_customer)

print(cleaned_data)

Output:

> print(cleaned_data)
# A tibble: 4 × 5
  customer_id name         purchase_date  amount priority_customer
  <chr>       <chr>        <date>          <dbl> <lgl>            
1 001         John Smith   2023-04-01       1.23 TRUE             
2 002         Jane Doe     2023-04-15     568.   FALSE            
3 003         Bob Jones    NA              98.8  FALSE            
4 NA          Maria Garcia 2023-04-30    1235.   FALSE 

Explanation (R Advanced Processing):

  • Uses read_delim() for custom delimiter handling
  • Applies lubridate for date parsing
  • Implements a custom function for numeric conversion
  • Utilizes dplyr for data manipulation and cleaning

**Resource download links**

1.5.11.-PROC-Import-CSV-file-in-SAS-vs-R-equivalent.zip