1.5.11. PROC Import CSV file in SAS vs R equivalents

1.5. SAS PROCEDURES -> Equivalent in R

1.5.11. PROC Import CSV file in SAS vs R equivalent

1. Basic CSV Import in SAS vs R

# Dummy data creation and export to CSV
my_data <- data.frame(
  ID = c(1, 2, 3, NA),
  Name = c("John", "Jane", "Bob", "Maria"),
  Age = c(25, 30, NA, 28),
  Date = c("2023-01-01", "2023-02-15", "2023-03-20", "2023-04-01"),
  Comment = c("First entry, with comma", "Second entry", "Missing age", "Missing ID"),
  stringsAsFactors = FALSE
)
write.csv(my_data, "myfile.csv", row.names = FALSE, na = "")

Capability	SAS (PROC IMPORT)	R (Base and readr)
Basic import	`PROC IMPORT DBMS=CSV`	`read.csv()` or `readr::read_csv()`
Header handling	`getnames = yes/no`	`header = TRUE/FALSE` or `col_names = TRUE/FALSE`
Delimiter	`delimiter = ','`	`sep = ","` or automatically handled
Missing values	Default handling	`na.strings = ""` or `na = ""`
Output	SAS dataset	Data frame or tibble

SAS Example

PROC IMPORT DATAFILE="myfile.csv"
    OUT=my_data
    DBMS=CSV
    REPLACE;
    GETNAMES=YES;
RUN;

Explanation (SAS):

Uses PROC IMPORT with DBMS=CSV to read CSV files
GETNAMES=YES treats the first row as column headers
Automatically detects delimiters (usually commas)
Outputs a SAS dataset named my_data

R Base Example

my_data <- read.csv("myfile.csv", header = TRUE, sep = ",", na.strings = "")
head(my_data)

Output:

> head(my_data)
# A tibble: 4 × 5
     ID Name    Age Date       Comment                
  <dbl> <chr> <dbl> <date>     <chr>                  
1     1 John     25 2023-01-01 First entry, with comma
2     2 Jane     30 2023-02-15 Second entry           
3     3 Bob      NA 2023-03-20 Missing age            
4    NA Maria    28 2023-04-01 Missing ID

Explanation (R Base):

Uses base R's read.csv() function
header = TRUE treats the first row as column names
sep = "," specifies comma delimiter
na.strings = "" treats empty strings as NA values
Returns a data frame stored in my_data

R readr Example

library(readr)
my_data <- read_csv("myfile.csv")
head(my_data)

Output:

> head(my_data)
# A tibble: 4 × 5
     ID Name    Age Date       Comment                
  <dbl> <chr> <dbl> <date>     <chr>                  
1     1 John     25 2023-01-01 First entry, with comma
2     2 Jane     30 2023-02-15 Second entry           
3     3 Bob      NA 2023-03-20 Missing age            
4    NA Maria    28 2023-04-01 Missing ID

Explanation (R readr):

Uses tidyverse's read_csv() function
More efficient than base R for large files
Automatically handles headers and common delimiters
Returns a tibble (modern data frame)

2. Handling Different Delimiters

Capability	SAS	R
Comma-separated	`delimiter = ','` (default)	`read.csv()` or `read_csv()`
Tab-separated	`delimiter = '09'x`	`read.delim()` or `read_tsv()`
Semicolon-separated	`delimiter = ';'`	`read.csv2()` or `read_csv2()`
Custom delimiter	`delimiter = '	'`

SAS Example with Custom Delimiter

PROC IMPORT DATAFILE="myfile.txt"
    OUT=pipe_data
    DBMS=DLM
    REPLACE;
    DELIMITER='|';
    GETNAMES=YES;
RUN;

Explanation (SAS):

Uses DBMS=DLM for general delimited files
DELIMITER='|' specifies pipe as the delimiter
Works with any custom delimiter character

R Example with Custom Delimiter

# Dummy data creation and export to pipe-delimited file
pipe_data <- data.frame(
  ID = c(1, 2, 3, NA),
  Name = c("John", "Jane", "Bob", "Maria"),
  Age = c(25, 30, NA, 28),
  Date = c("2023-01-01", "2023-02-15", "2023-03-20", "2023-04-01"),
  Comment = c("First entry, with pipe|char", "Second entry", "Missing age", "Missing ID"),
  stringsAsFactors = FALSE
)
write.table(pipe_data, "myfile.txt", sep = "|", row.names = FALSE, na = "", quote = TRUE)

# Base R
pipe_data <- read.delim("myfile.txt", sep = "|", header = TRUE)

# Using readr
library(readr)
pipe_data <- read_delim("myfile.txt", delim = "|")

Output:

> pipe_data
  ID  Name Age       Date                     Comment
1  1  John  25 2023-01-01 First entry, with pipe|char
2  2  Jane  30 2023-02-15                Second entry
3  3   Bob  NA 2023-03-20                 Missing age
4 NA Maria  28 2023-04-01                  Missing ID

Explanation (R):

Base R: read.delim() with custom sep parameter
readr: read_delim() with explicit delim parameter
Both handle any single-character delimiter

Example with European Format (Semicolon and Decimal Comma)

# Dummy data creation and export to European format CSV
euro_data <- data.frame(
  ID = c(1, 2, 3, NA),
  Name = c("John", "Jane", "Bob", "Maria"),
  Amount = c(1234.56, 567.89, 98.76, 1234.56),
  stringsAsFactors = FALSE
)
write.csv2(euro_data, "european_data.csv", row.names = FALSE, na = "")

# Base R
euro_data <- read.csv2("european_data.csv")

# Using readr
library(readr)
euro_data <- read_csv2("european_data.csv")

Output:

> euro_data
# A tibble: 4 × 3
     ID Name  Amount
  <dbl> <chr>  <dbl>
1     1 John  1235. 
2     2 Jane   568. 
3     3 Bob     98.8
4    NA Maria 1235.

Explanation (R European Format):

read.csv2() and read_csv2() are specialized for European format
Uses semicolons (;) as field separators
Uses commas (,) as decimal points
Common in countries that use comma as decimal separator

3. Handling Column Types and Missing Values

Capability	SAS	R
Column types	Post-import with FORMAT statement	`colClasses` or `col_types`
Missing values	Default or with DBSASTYPE	`na.strings` or `na`
Handling bad data	Limited at import time	`problems()` function (readr)

SAS Example

PROC IMPORT DATAFILE="myfile.csv"
    OUT=typed_data
    DBMS=CSV
    REPLACE;
    GETNAMES=YES;
RUN;

DATA typed_data;
    SET typed_data;
    FORMAT date_col MMDDYY10.;
    numeric_col = INPUT(char_num, 8.);
RUN;

Explanation (SAS):

SAS typically handles type conversion after import
Uses DATA step to apply formats or convert types
Less control during the initial import process

R Example

# Dummy data creation and export to CSV
typed_data <- data.frame(
  col1 = c("A", "B", "C"),
  col2 = c(1.1, 2.2, 3.3),
  col3 = as.Date(c("2023-01-01", "2023-02-01", "2023-03-01")),
  stringsAsFactors = FALSE
)
write.csv(typed_data, "myfile.csv", row.names = FALSE, na = "")

# Base R
typed_data <- read.csv("myfile.csv", 
                       colClasses = c("character", "numeric", "Date"),
                       na.strings = c("", "NA", "N/A"))

# Using readr
library(readr)
typed_data <- read_csv("myfile.csv",
                       col_types = cols(
                           col1 = col_character(),
                           col2 = col_double(),
                           col3 = col_date()
                       ),
                       na = c("", "NA", "N/A"))

# Check for parsing problems
problems(typed_data)

Output:

> problems(typed_data)
# A tibble: 0 × 5
# ℹ 5 variables: row <int>, col <int>, expected <chr>, actual <chr>, file <chr>
> typed_data
# A tibble: 3 × 3
  col1   col2 col3      
  <chr> <dbl> <date>    
1 A       1.1 2023-01-01
2 B       2.2 2023-02-01
3 C       3.3 2023-03-01
> problems(typed_data)
# A tibble: 0 × 5
# ℹ 5 variables: row <int>, col <int>, expected <chr>, actual <chr>, file <chr>

Explanation (R):

Base R: Uses colClasses to specify column types
readr: More flexible col_types specification
Both allow multiple NA value definitions
readr provides problems() to identify parsing issues

4. Skipping Rows and Managing Headers

Capability	SAS	R
Skip rows	`DATAROW=n`	`skip=n` or `skip=n-1, header=TRUE`
No column names	`GETNAMES=NO`	`header=FALSE` or `col_names=FALSE`
Custom column names	Post-import renaming	`col.names` or `col_names` vector

SAS Example

PROC IMPORT DATAFILE="myfile.csv"
    OUT=data_with_headers
    DBMS=CSV
    REPLACE;
    DATAROW=3;
    GETNAMES=NO;
RUN;

DATA data_with_headers;
    SET data_with_headers;
    RENAME VAR1=ID VAR2=Name VAR3=Value;
RUN;

Explanation (SAS):

DATAROW=3 starts reading from the third row
GETNAMES=NO doesn't use any row for headers
Post-import renaming with DATA step

R Example

# Dummy data creation and export to CSV
data_with_headers <- data.frame(
  ID = c(1, 2, 3),
  Name = c("John", "Jane", "Bob"),
  Value = c(100, 200, 300),
  stringsAsFactors = FALSE
)
write.csv(data_with_headers, "myfile.csv", row.names = FALSE, na = "")

# Base R
data_with_headers <- read.csv("myfile.csv", 
                             skip = 2, 
                             header = FALSE,
                             col.names = c("ID", "Name", "Value"))

# Using readr
library(readr)
data_with_headers <- read_csv("myfile.csv",
                              skip = 2,
                              col_names = c("ID", "Name", "Value"))
s <- spec(data_with_headers)

Output:

> data_with_headers
# A tibble: 2 × 3
     ID Name  Value
  <dbl> <chr> <dbl>
1     2 Jane    200
2     3 Bob     300
> s
cols(
  ID = col_double(),
  Name = col_character(),
  Value = col_double()
)

Explanation (R):

skip = 2 ignores the first two rows
Setting custom column names with col.names or col_names
More straightforward naming during import compared to SAS

5. Input/Output Examples

Input CSV Example:

ID,Name,Age,Date,Comment
1,John,25,2023-01-01,"First entry, with comma"
2,Jane,30,2023-02-15,Second entry
3,Bob,,2023-03-20,Missing age
,Maria,28,2023-04-01,Missing ID

Expected Output in SAS:

ID	Name	Age	Date	Comment
1	John	25	2023-01-01	First entry, with comma
2	Jane	30	2023-02-15	Second entry
3	Bob	.	2023-03-20	Missing age
.	Maria	28	2023-04-01	Missing ID

Expected Output in R:

ID	Name	Age	Date	Comment
1	John	25	2023-01-01	First entry, with comma
2	Jane	30	2023-02-15	Second entry
3	Bob	NA	2023-03-20	Missing age
NA	Maria	28	2023-04-01	Missing ID

6. Beyond Basics: Performance Optimization

Capability	SAS	R
Large file handling	BUFSIZE= option	`data.table::fread()` or `vroom::vroom()`
Chunked reading	Custom DATA step with INFILE	`chunked = TRUE` with callback functions
Parallel processing	Limited native support	`parallel` package or `multidplyr`

SAS Advanced Example

/* Efficient reading of large CSV */
DATA large_data;
    INFILE "large_file.csv" DSD FIRSTOBS=2 BUFSIZE=32768;
    INPUT id name :$20. age date :YYMMDD10. comment :$100.;
RUN;

Explanation (SAS Advanced):

Uses direct INFILE statement instead of PROC IMPORT
BUFSIZE=32768 increases buffer size for better performance
Explicit column definitions for more control
Better for large files than standard PROC IMPORT

R Advanced Example

# Load necessary libraries
library(data.table)   # For fast fread/fwrite

# 1. Create dummy large data
set.seed(123) # For reproducibility
large_data <- data.frame(
  id = 1:10000,
  name = paste0("Name", 1:10000),
  age = sample(18:80, 10000, replace = TRUE),
  comment = sample(c("A", "B", "C"), 10000, replace = TRUE),
  stringsAsFactors = FALSE
)

# 2. Export to CSV quickly (fwrite is MUCH faster than write.csv)
fwrite(large_data, "large_file.csv", na = "")

# 3. Read file efficiently, selecting only needed columns (id, name, age)
large_data_read <- fread(
  "large_file.csv",
  nThread = 4,                        # Use 4 threads (adjust as per your CPU)
  select = c("id", "name", "age")     # Only these columns will be read
  # drop = "comment"                  # Alternatively, use drop to ignore columns
)

# 4. Display first few rows as a quick check
print(head(large_data_read))

Output:

> print(head(large_data_read))
      id   name   age
   <int> <char> <int>
1:     1  Name1    48
2:     2  Name2    32
3:     3  Name3    68
4:     4  Name4    31
5:     5  Name5    20
6:     6  Name6    59

Explanation (R Advanced):

data.table::fread() is optimized for large files
nThread = 4 enables parallel processing
Selective column reading improves performance
Can be 10-20x faster than base R for large files

7. Beyond Basics: Advanced Processing

Capability	SAS	R
Processing during import	Custom INFILE statements	`readr` with transforming functions
Column filtering	Limited at import time	Select columns during import
Encoding	ENCODING= option	`encoding` parameter
Validation during import	Limited	`col_types` with specification

SAS Advanced Example with Validation

/* Advanced data validation during import */
DATA validated_data;
    INFILE "input.csv" DSD DELIMITER=';' FIRSTOBS=2 MISSOVER ENCODING="latin1";
    
    INPUT id age name :$30.;
    
    /* Validation checks */
    IF id <= 0 THEN DELETE;
    IF age < 18 OR age > 100 THEN age = .;
    IF name = "" THEN name = "Unknown";
RUN;

Explanation (SAS Advanced Validation):

Uses DATA step for more control over import process
Applies validation rules during import
Handles special cases like negative IDs or invalid ages

R Advanced Example with Validation

# Dummy data creation and export to CSV
validated_data <- data.frame(
  id = c(1, 2, -1, 4),
  age = c(25, 17, 101, 40),
  name = c("Alice", "", "Charlie", "David"),
  stringsAsFactors = FALSE
)
write.csv(validated_data, "input.csv", row.names = FALSE, na = "")

library(readr)
library(dplyr)

validated_data <- read_csv("input.csv", 
    col_types = cols(
        id = col_integer(),
        age = col_integer(),
        name = col_character()
    ),
    col_select = c(id, age, name),  # Only these columns
    name_repair = "unique"          # Fix duplicate names
)

# Post-import validation
validated_data <- validated_data %>%
    filter(id > 0) %>%
    mutate(
        age = if_else(age < 18 | age > 100, NA_integer_, age),
        name = if_else(name == "", "Unknown", name)
    )

Output:

> validated_data
# A tibble: 3 × 3
     id   age name 
  <int> <int> <chr>
1     1    25 Alice
2     2    NA NA   
3     4    40 David

Explanation (R Advanced Validation):

Uses col_types for explicit type checking
col_select to only import needed columns
name_repair handles duplicate column names
Post-import processing with tidyverse functions

8. Beyond Basics: Input/Output Examples

SAS Code for Advanced Processing:

/* Advanced processing for complex CSV */
FILENAME MYSRC "messy_data.csv";

DATA cleaned_data;
    INFILE MYSRC DSD DELIMITER=';' FIRSTOBS=2 MISSOVER ENCODING="latin1";
    
    INPUT Customer_ID :$10. 
          Name :$50. 
          Purchase_Date :$10. 
          Amount_Raw :$20. 
          Notes :$100.;
    
    /* Date conversion from European format */
    IF Purchase_Date NE "" THEN DO;
        day = SCAN(Purchase_Date, 1, '/');
        month = SCAN(Purchase_Date, 2, '/');
        year = SCAN(Purchase_Date, 3, '/');
        Purchase_Date = MDY(INPUT(month, 8.), INPUT(day, 8.), INPUT(year, 8.));
        FORMAT Purchase_Date DATE9.;
    END;
    
    /* Amount conversion from European format */
    Amount_Raw = TRANWRD(Amount_Raw, '.', '');
    Amount_Raw = TRANWRD(Amount_Raw, ',', '.');
    Amount = INPUT(Amount_Raw, BEST.);
    
    /* Derived column */
    Priority_Customer = (FIND(Notes, "priority") > 0) ? "Yes" : "No";
    
    DROP Amount_Raw day month year;
RUN;

R Code for Advanced Processing:

# Load necessary libraries
library(readr)
library(dplyr)
library(lubridate)

# 1. Create and write messy European CSV
messy_data <- data.frame(
  `Customer ID` = c("001", "002", "003", NA),
  Name = c("John Smith", "Jane Doe", "Bob Jones", "Maria Garcia"),
  `Purchase Date` = c("01/04/2023", "15/04/2023", "", "30/04/2023"),
  `Amount (€)` = c("1,234.56", "567,89", "98,76", "1.234,56"),
  Notes = c("First purchase, priority customer", "Recurring", "Missing date", "European format, ID missing"),
  stringsAsFactors = FALSE
)
write.table(
  messy_data,
  "messy_data.csv",
  sep = ";",
  row.names = FALSE,
  na = "",
  quote = TRUE,
  fileEncoding = "UTF-8"
)

# 2. Function to convert various European number formats to numeric
convert_amount <- function(x) {
  x <- gsub("\\.", "", x)     # Remove thousand separators (dot)
  x <- gsub(",", ".", x)      # Convert decimal comma to decimal point
  as.numeric(x)
}

# 3. Read and process the "messy" CSV file
raw_data <- read_delim(
  "messy_data.csv",
  delim = ";",
  locale = locale(decimal_mark = ","),
  col_types = cols(.default = col_character())
)

# 4. Clean column names: strip quotes, then trim whitespace, make syntactically valid
colnames(raw_data) <- gsub('^"|"$', '', colnames(raw_data)) # strip quotes
colnames(raw_data) <- trimws(colnames(raw_data))            # trim whitespace
colnames(raw_data) <- make.names(colnames(raw_data))        # ensure valid column names

# (Optional) Print for debugging
print(colnames(raw_data))

# 5. Rename with the column names as they now appear
# Since make.names replaces spaces with dots, do the same in rename:
# Customer ID        --> Customer.ID
# Purchase Date      --> Purchase.Date
# Amount (€)         --> Amount.....

cleaned_data <- raw_data %>%
  rename(
    customer_id      = Customer.ID,
    name             = Name,
    purchase_date_raw = Purchase.Date,
    amount_raw       = Amount....,
    notes            = Notes
  ) %>%
  mutate(
    purchase_date    = suppressWarnings(dmy(purchase_date_raw)),
    amount           = convert_amount(amount_raw),
    priority_customer = grepl("priority", notes, ignore.case = TRUE)
  ) %>%
  select(customer_id, name, purchase_date, amount, priority_customer)

print(cleaned_data)

Output:

> print(cleaned_data)
# A tibble: 4 × 5
  customer_id name         purchase_date  amount priority_customer
  <chr>       <chr>        <date>          <dbl> <lgl>            
1 001         John Smith   2023-04-01       1.23 TRUE             
2 002         Jane Doe     2023-04-15     568.   FALSE            
3 003         Bob Jones    NA              98.8  FALSE            
4 NA          Maria Garcia 2023-04-30    1235.   FALSE

Explanation (R Advanced Processing):

Uses read_delim() for custom delimiter handling
Applies lubridate for date parsing
Implements a custom function for numeric conversion
Utilizes dplyr for data manipulation and cleaning

Resource download links

1.5.11.-PROC-Import-CSV-file-in-SAS-vs-R-equivalent.zip

⁂

1.5. SAS PROCEDURES -> Equivalent in R