1.5. SAS PROCEDURES -> Equivalent in R
1.5.11. PROC Import CSV file in SAS vs R equivalent
1. Basic CSV Import in SAS vs R
# Dummy data creation and export to CSV
my_data <- data.frame(
ID = c(1, 2, 3, NA),
Name = c("John", "Jane", "Bob", "Maria"),
Age = c(25, 30, NA, 28),
Date = c("2023-01-01", "2023-02-15", "2023-03-20", "2023-04-01"),
Comment = c("First entry, with comma", "Second entry", "Missing age", "Missing ID"),
stringsAsFactors = FALSE
)
write.csv(my_data, "myfile.csv", row.names = FALSE, na = "")
| Capability | SAS (PROC IMPORT) | R (Base and readr) |
|---|---|---|
| Basic import | PROC IMPORT DBMS=CSV |
read.csv() or readr::read_csv() |
| Header handling | getnames = yes/no |
header = TRUE/FALSE or col_names = TRUE/FALSE |
| Delimiter | delimiter = ',' |
sep = "," or automatically handled |
| Missing values | Default handling | na.strings = "" or na = "" |
| Output | SAS dataset | Data frame or tibble |
SAS Example
PROC IMPORT DATAFILE="myfile.csv"
OUT=my_data
DBMS=CSV
REPLACE;
GETNAMES=YES;
RUN;
Explanation (SAS):
- Uses
PROC IMPORTwithDBMS=CSVto read CSV files GETNAMES=YEStreats the first row as column headers- Automatically detects delimiters (usually commas)
- Outputs a SAS dataset named
my_data
R Base Example
my_data <- read.csv("myfile.csv", header = TRUE, sep = ",", na.strings = "")
head(my_data)
Output:
> head(my_data)
# A tibble: 4 × 5
ID Name Age Date Comment
<dbl> <chr> <dbl> <date> <chr>
1 1 John 25 2023-01-01 First entry, with comma
2 2 Jane 30 2023-02-15 Second entry
3 3 Bob NA 2023-03-20 Missing age
4 NA Maria 28 2023-04-01 Missing ID
Explanation (R Base):
- Uses base R's
read.csv()function header = TRUEtreats the first row as column namessep = ","specifies comma delimiterna.strings = ""treats empty strings as NA values- Returns a data frame stored in
my_data
R readr Example
library(readr)
my_data <- read_csv("myfile.csv")
head(my_data)
Output:
> head(my_data)
# A tibble: 4 × 5
ID Name Age Date Comment
<dbl> <chr> <dbl> <date> <chr>
1 1 John 25 2023-01-01 First entry, with comma
2 2 Jane 30 2023-02-15 Second entry
3 3 Bob NA 2023-03-20 Missing age
4 NA Maria 28 2023-04-01 Missing ID
Explanation (R readr):
- Uses tidyverse's
read_csv()function - More efficient than base R for large files
- Automatically handles headers and common delimiters
- Returns a tibble (modern data frame)
2. Handling Different Delimiters
| Capability | SAS | R |
|---|---|---|
| Comma-separated | delimiter = ',' (default) |
read.csv() or read_csv() |
| Tab-separated | delimiter = '09'x |
read.delim() or read_tsv() |
| Semicolon-separated | delimiter = ';' |
read.csv2() or read_csv2() |
| Custom delimiter | `delimiter = ' | '` |
SAS Example with Custom Delimiter
PROC IMPORT DATAFILE="myfile.txt"
OUT=pipe_data
DBMS=DLM
REPLACE;
DELIMITER='|';
GETNAMES=YES;
RUN;
Explanation (SAS):
- Uses
DBMS=DLMfor general delimited files DELIMITER='|'specifies pipe as the delimiter- Works with any custom delimiter character
R Example with Custom Delimiter
# Dummy data creation and export to pipe-delimited file
pipe_data <- data.frame(
ID = c(1, 2, 3, NA),
Name = c("John", "Jane", "Bob", "Maria"),
Age = c(25, 30, NA, 28),
Date = c("2023-01-01", "2023-02-15", "2023-03-20", "2023-04-01"),
Comment = c("First entry, with pipe|char", "Second entry", "Missing age", "Missing ID"),
stringsAsFactors = FALSE
)
write.table(pipe_data, "myfile.txt", sep = "|", row.names = FALSE, na = "", quote = TRUE)
# Base R
pipe_data <- read.delim("myfile.txt", sep = "|", header = TRUE)
# Using readr
library(readr)
pipe_data <- read_delim("myfile.txt", delim = "|")
Output:
> pipe_data
ID Name Age Date Comment
1 1 John 25 2023-01-01 First entry, with pipe|char
2 2 Jane 30 2023-02-15 Second entry
3 3 Bob NA 2023-03-20 Missing age
4 NA Maria 28 2023-04-01 Missing ID
Explanation (R):
- Base R:
read.delim()with customsepparameter - readr:
read_delim()with explicitdelimparameter - Both handle any single-character delimiter
Example with European Format (Semicolon and Decimal Comma)
# Dummy data creation and export to European format CSV
euro_data <- data.frame(
ID = c(1, 2, 3, NA),
Name = c("John", "Jane", "Bob", "Maria"),
Amount = c(1234.56, 567.89, 98.76, 1234.56),
stringsAsFactors = FALSE
)
write.csv2(euro_data, "european_data.csv", row.names = FALSE, na = "")
# Base R
euro_data <- read.csv2("european_data.csv")
# Using readr
library(readr)
euro_data <- read_csv2("european_data.csv")
Output:
> euro_data
# A tibble: 4 × 3
ID Name Amount
<dbl> <chr> <dbl>
1 1 John 1235.
2 2 Jane 568.
3 3 Bob 98.8
4 NA Maria 1235.
Explanation (R European Format):
read.csv2()andread_csv2()are specialized for European format- Uses semicolons (;) as field separators
- Uses commas (,) as decimal points
- Common in countries that use comma as decimal separator
3. Handling Column Types and Missing Values
| Capability | SAS | R |
|---|---|---|
| Column types | Post-import with FORMAT statement | colClasses or col_types |
| Missing values | Default or with DBSASTYPE | na.strings or na |
| Handling bad data | Limited at import time | problems() function (readr) |
SAS Example
PROC IMPORT DATAFILE="myfile.csv"
OUT=typed_data
DBMS=CSV
REPLACE;
GETNAMES=YES;
RUN;
DATA typed_data;
SET typed_data;
FORMAT date_col MMDDYY10.;
numeric_col = INPUT(char_num, 8.);
RUN;
Explanation (SAS):
- SAS typically handles type conversion after import
- Uses DATA step to apply formats or convert types
- Less control during the initial import process
R Example
# Dummy data creation and export to CSV
typed_data <- data.frame(
col1 = c("A", "B", "C"),
col2 = c(1.1, 2.2, 3.3),
col3 = as.Date(c("2023-01-01", "2023-02-01", "2023-03-01")),
stringsAsFactors = FALSE
)
write.csv(typed_data, "myfile.csv", row.names = FALSE, na = "")
# Base R
typed_data <- read.csv("myfile.csv",
colClasses = c("character", "numeric", "Date"),
na.strings = c("", "NA", "N/A"))
# Using readr
library(readr)
typed_data <- read_csv("myfile.csv",
col_types = cols(
col1 = col_character(),
col2 = col_double(),
col3 = col_date()
),
na = c("", "NA", "N/A"))
# Check for parsing problems
problems(typed_data)
Output:
> problems(typed_data)
# A tibble: 0 × 5
# ℹ 5 variables: row <int>, col <int>, expected <chr>, actual <chr>, file <chr>
> typed_data
# A tibble: 3 × 3
col1 col2 col3
<chr> <dbl> <date>
1 A 1.1 2023-01-01
2 B 2.2 2023-02-01
3 C 3.3 2023-03-01
> problems(typed_data)
# A tibble: 0 × 5
# ℹ 5 variables: row <int>, col <int>, expected <chr>, actual <chr>, file <chr>
Explanation (R):
- Base R: Uses
colClassesto specify column types - readr: More flexible
col_typesspecification - Both allow multiple NA value definitions
- readr provides
problems()to identify parsing issues
4. Skipping Rows and Managing Headers
| Capability | SAS | R |
|---|---|---|
| Skip rows | DATAROW=n |
skip=n or skip=n-1, header=TRUE |
| No column names | GETNAMES=NO |
header=FALSE or col_names=FALSE |
| Custom column names | Post-import renaming | col.names or col_names vector |
SAS Example
PROC IMPORT DATAFILE="myfile.csv"
OUT=data_with_headers
DBMS=CSV
REPLACE;
DATAROW=3;
GETNAMES=NO;
RUN;
DATA data_with_headers;
SET data_with_headers;
RENAME VAR1=ID VAR2=Name VAR3=Value;
RUN;
Explanation (SAS):
DATAROW=3starts reading from the third rowGETNAMES=NOdoesn't use any row for headers- Post-import renaming with DATA step
R Example
# Dummy data creation and export to CSV
data_with_headers <- data.frame(
ID = c(1, 2, 3),
Name = c("John", "Jane", "Bob"),
Value = c(100, 200, 300),
stringsAsFactors = FALSE
)
write.csv(data_with_headers, "myfile.csv", row.names = FALSE, na = "")
# Base R
data_with_headers <- read.csv("myfile.csv",
skip = 2,
header = FALSE,
col.names = c("ID", "Name", "Value"))
# Using readr
library(readr)
data_with_headers <- read_csv("myfile.csv",
skip = 2,
col_names = c("ID", "Name", "Value"))
s <- spec(data_with_headers)
Output:
> data_with_headers
# A tibble: 2 × 3
ID Name Value
<dbl> <chr> <dbl>
1 2 Jane 200
2 3 Bob 300
> s
cols(
ID = col_double(),
Name = col_character(),
Value = col_double()
)
Explanation (R):
skip = 2ignores the first two rows- Setting custom column names with
col.namesorcol_names - More straightforward naming during import compared to SAS
5. Input/Output Examples
Input CSV Example:
ID,Name,Age,Date,Comment
1,John,25,2023-01-01,"First entry, with comma"
2,Jane,30,2023-02-15,Second entry
3,Bob,,2023-03-20,Missing age
,Maria,28,2023-04-01,Missing ID
Expected Output in SAS:
| ID | Name | Age | Date | Comment |
|---|---|---|---|---|
| 1 | John | 25 | 2023-01-01 | First entry, with comma |
| 2 | Jane | 30 | 2023-02-15 | Second entry |
| 3 | Bob | . | 2023-03-20 | Missing age |
| . | Maria | 28 | 2023-04-01 | Missing ID |
Expected Output in R:
| ID | Name | Age | Date | Comment |
|---|---|---|---|---|
| 1 | John | 25 | 2023-01-01 | First entry, with comma |
| 2 | Jane | 30 | 2023-02-15 | Second entry |
| 3 | Bob | NA | 2023-03-20 | Missing age |
| NA | Maria | 28 | 2023-04-01 | Missing ID |
6. Beyond Basics: Performance Optimization
| Capability | SAS | R |
|---|---|---|
| Large file handling | BUFSIZE= option | data.table::fread() or vroom::vroom() |
| Chunked reading | Custom DATA step with INFILE | chunked = TRUE with callback functions |
| Parallel processing | Limited native support | parallel package or multidplyr |
SAS Advanced Example
/* Efficient reading of large CSV */
DATA large_data;
INFILE "large_file.csv" DSD FIRSTOBS=2 BUFSIZE=32768;
INPUT id name :$20. age date :YYMMDD10. comment :$100.;
RUN;
Explanation (SAS Advanced):
- Uses direct
INFILEstatement instead ofPROC IMPORT BUFSIZE=32768increases buffer size for better performance- Explicit column definitions for more control
- Better for large files than standard
PROC IMPORT
R Advanced Example
# Load necessary libraries
library(data.table) # For fast fread/fwrite
# 1. Create dummy large data
set.seed(123) # For reproducibility
large_data <- data.frame(
id = 1:10000,
name = paste0("Name", 1:10000),
age = sample(18:80, 10000, replace = TRUE),
comment = sample(c("A", "B", "C"), 10000, replace = TRUE),
stringsAsFactors = FALSE
)
# 2. Export to CSV quickly (fwrite is MUCH faster than write.csv)
fwrite(large_data, "large_file.csv", na = "")
# 3. Read file efficiently, selecting only needed columns (id, name, age)
large_data_read <- fread(
"large_file.csv",
nThread = 4, # Use 4 threads (adjust as per your CPU)
select = c("id", "name", "age") # Only these columns will be read
# drop = "comment" # Alternatively, use drop to ignore columns
)
# 4. Display first few rows as a quick check
print(head(large_data_read))
Output:
> print(head(large_data_read))
id name age
<int> <char> <int>
1: 1 Name1 48
2: 2 Name2 32
3: 3 Name3 68
4: 4 Name4 31
5: 5 Name5 20
6: 6 Name6 59
Explanation (R Advanced):
data.table::fread()is optimized for large filesnThread = 4enables parallel processing- Selective column reading improves performance
- Can be 10-20x faster than base R for large files
7. Beyond Basics: Advanced Processing
| Capability | SAS | R |
|---|---|---|
| Processing during import | Custom INFILE statements | readr with transforming functions |
| Column filtering | Limited at import time | Select columns during import |
| Encoding | ENCODING= option | encoding parameter |
| Validation during import | Limited | col_types with specification |
SAS Advanced Example with Validation
/* Advanced data validation during import */
DATA validated_data;
INFILE "input.csv" DSD DELIMITER=';' FIRSTOBS=2 MISSOVER ENCODING="latin1";
INPUT id age name :$30.;
/* Validation checks */
IF id <= 0 THEN DELETE;
IF age < 18 OR age > 100 THEN age = .;
IF name = "" THEN name = "Unknown";
RUN;
Explanation (SAS Advanced Validation):
- Uses DATA step for more control over import process
- Applies validation rules during import
- Handles special cases like negative IDs or invalid ages
R Advanced Example with Validation
# Dummy data creation and export to CSV
validated_data <- data.frame(
id = c(1, 2, -1, 4),
age = c(25, 17, 101, 40),
name = c("Alice", "", "Charlie", "David"),
stringsAsFactors = FALSE
)
write.csv(validated_data, "input.csv", row.names = FALSE, na = "")
library(readr)
library(dplyr)
validated_data <- read_csv("input.csv",
col_types = cols(
id = col_integer(),
age = col_integer(),
name = col_character()
),
col_select = c(id, age, name), # Only these columns
name_repair = "unique" # Fix duplicate names
)
# Post-import validation
validated_data <- validated_data %>%
filter(id > 0) %>%
mutate(
age = if_else(age < 18 | age > 100, NA_integer_, age),
name = if_else(name == "", "Unknown", name)
)
Output:
> validated_data
# A tibble: 3 × 3
id age name
<int> <int> <chr>
1 1 25 Alice
2 2 NA NA
3 4 40 David
Explanation (R Advanced Validation):
- Uses
col_typesfor explicit type checking col_selectto only import needed columnsname_repairhandles duplicate column names- Post-import processing with tidyverse functions
8. Beyond Basics: Input/Output Examples
SAS Code for Advanced Processing:
/* Advanced processing for complex CSV */
FILENAME MYSRC "messy_data.csv";
DATA cleaned_data;
INFILE MYSRC DSD DELIMITER=';' FIRSTOBS=2 MISSOVER ENCODING="latin1";
INPUT Customer_ID :$10.
Name :$50.
Purchase_Date :$10.
Amount_Raw :$20.
Notes :$100.;
/* Date conversion from European format */
IF Purchase_Date NE "" THEN DO;
day = SCAN(Purchase_Date, 1, '/');
month = SCAN(Purchase_Date, 2, '/');
year = SCAN(Purchase_Date, 3, '/');
Purchase_Date = MDY(INPUT(month, 8.), INPUT(day, 8.), INPUT(year, 8.));
FORMAT Purchase_Date DATE9.;
END;
/* Amount conversion from European format */
Amount_Raw = TRANWRD(Amount_Raw, '.', '');
Amount_Raw = TRANWRD(Amount_Raw, ',', '.');
Amount = INPUT(Amount_Raw, BEST.);
/* Derived column */
Priority_Customer = (FIND(Notes, "priority") > 0) ? "Yes" : "No";
DROP Amount_Raw day month year;
RUN;
R Code for Advanced Processing:
# Load necessary libraries
library(readr)
library(dplyr)
library(lubridate)
# 1. Create and write messy European CSV
messy_data <- data.frame(
`Customer ID` = c("001", "002", "003", NA),
Name = c("John Smith", "Jane Doe", "Bob Jones", "Maria Garcia"),
`Purchase Date` = c("01/04/2023", "15/04/2023", "", "30/04/2023"),
`Amount (€)` = c("1,234.56", "567,89", "98,76", "1.234,56"),
Notes = c("First purchase, priority customer", "Recurring", "Missing date", "European format, ID missing"),
stringsAsFactors = FALSE
)
write.table(
messy_data,
"messy_data.csv",
sep = ";",
row.names = FALSE,
na = "",
quote = TRUE,
fileEncoding = "UTF-8"
)
# 2. Function to convert various European number formats to numeric
convert_amount <- function(x) {
x <- gsub("\\.", "", x) # Remove thousand separators (dot)
x <- gsub(",", ".", x) # Convert decimal comma to decimal point
as.numeric(x)
}
# 3. Read and process the "messy" CSV file
raw_data <- read_delim(
"messy_data.csv",
delim = ";",
locale = locale(decimal_mark = ","),
col_types = cols(.default = col_character())
)
# 4. Clean column names: strip quotes, then trim whitespace, make syntactically valid
colnames(raw_data) <- gsub('^"|"$', '', colnames(raw_data)) # strip quotes
colnames(raw_data) <- trimws(colnames(raw_data)) # trim whitespace
colnames(raw_data) <- make.names(colnames(raw_data)) # ensure valid column names
# (Optional) Print for debugging
print(colnames(raw_data))
# 5. Rename with the column names as they now appear
# Since make.names replaces spaces with dots, do the same in rename:
# Customer ID --> Customer.ID
# Purchase Date --> Purchase.Date
# Amount (€) --> Amount.....
cleaned_data <- raw_data %>%
rename(
customer_id = Customer.ID,
name = Name,
purchase_date_raw = Purchase.Date,
amount_raw = Amount....,
notes = Notes
) %>%
mutate(
purchase_date = suppressWarnings(dmy(purchase_date_raw)),
amount = convert_amount(amount_raw),
priority_customer = grepl("priority", notes, ignore.case = TRUE)
) %>%
select(customer_id, name, purchase_date, amount, priority_customer)
print(cleaned_data)
Output:
> print(cleaned_data)
# A tibble: 4 × 5
customer_id name purchase_date amount priority_customer
<chr> <chr> <date> <dbl> <lgl>
1 001 John Smith 2023-04-01 1.23 TRUE
2 002 Jane Doe 2023-04-15 568. FALSE
3 003 Bob Jones NA 98.8 FALSE
4 NA Maria Garcia 2023-04-30 1235. FALSE
Explanation (R Advanced Processing):
- Uses
read_delim()for custom delimiter handling - Applies
lubridatefor date parsing - Implements a custom function for numeric conversion
- Utilizes
dplyrfor data manipulation and cleaning
**Resource download links**
1.5.11.-PROC-Import-CSV-file-in-SAS-vs-R-equivalent.zip