contact@a2zlearners.com

1.4. Migrating from SAS to R: A Skill Conversion Guide

1.4.5. KEEP, DROP, and RENAME in SAS and R Equivalent

1. DROP Concept

Purpose:
To remove unwanted columns from a dataset to reduce memory, improve clarity, or streamline processing.

Input Table:

id col1 col2 score
1 A X 10
2 B Y 20

SAS:

  • DROP removes variables at input (set/merge) or output (data) level.
  • Use the DROP= dataset option when you want to avoid pulling unnecessary columns into the Program Data Vector.
/* Drop during processing */
data trimmed;
  set raw;
  drop col1 col2;
run;

/* Drop while reading to save memory */
data trimmed;
  set raw(drop=col1 col2);
run;

R (dplyr):

  • Use select() with negative indexing.
  • Efficient for ad-hoc exclusion or working with dynamic patterns.
library(dplyr)
df <- data.frame(
  id = c(1, 2),
  col1 = c("A", "B"),
  col2 = c("X", "Y"),
  score = c(10, 20)
)
cleaned <- df %>% select(-col1, -col2)
print(cleaned)

R Code Explanation

  • select(-col1, -col2) removes the specified columns from the data frame.
  • This is the recommended way to drop columns in R using dplyr.
  • The result is a data frame with only the remaining columns.

Output Table:

id score
1 10
2 20

2. KEEP Concept

Purpose:
To retain only specific columns, ensuring your dataset includes just what's needed for analysis or output.

Input Table:

id col1 col2 score
1 A X 10
2 B Y 20

SAS:

data subset;
  set full(keep=id score);
run;

data subset;
  set full;
  keep id score;
run;

R:

library(dplyr)
df <- data.frame(
  id = c(1, 2),
  col1 = c("A", "B"),
  col2 = c("X", "Y"),
  score = c(10, 20)
)
selected <- df %>% select(id, score)
print(selected)

R Code Explanation

  • select(id, score) keeps only the specified columns in the data frame.
  • This is the standard approach for column selection in R with dplyr.
  • The result is a data frame containing just the columns you want.

Output Table:

id score
1 10
2 20

3. KEEP/DROP Positioning in SAS

Variables can be kept/dropped:

  • At input (SET): reduces memory usage during processing.
  • At output (DATA): allows temporary use of columns during transformation.

Input Table:

id score flag
1 10 Y
2 20 N
data final(drop=flag);
  set interim(keep=id score flag);
run;

R Equivalent (two-step):

library(dplyr)
df <- data.frame(
  id = c(1, 2),
  score = c(10, 20),
  flag = c("Y", "N")
)
df_temp <- df %>% select(id, score, flag)
final <- df_temp %>% select(-flag)
print(final)

R Code Explanation

  • You can chain select() calls to first keep columns you need for processing, then drop temporary columns before final output.
  • This mirrors the SAS approach of keeping columns at input and dropping at output.

Output Table:

id score
1 10
2 20

4. Dynamic KEEP Lists

Input Table:

id score age
1 10 25
2 20 30

SAS: Define a macro variable to pass variable names:

%let vars = id score;
data result;
  set source(keep=&vars);
run;

R: Use a character vector and all_of() for safe, non-standard variable names:

library(dplyr)
df <- data.frame(
  id = c(1, 2),
  score = c(10, 20),
  age = c(25, 30)
)
vars <- c("id", "score")
subset <- df %>% select(all_of(vars))
print(subset)

R Code Explanation

  • all_of(vars) allows you to select columns using a character vector of names.
  • This is robust to variable name changes and works with non-standard names.
  • Useful for dynamic or metadata-driven column selection.

Output Table:

id score
1 10
2 20

5. Pattern-Based Column Selection (Advanced Exploration in R)

Purpose:
To make column selection scalable, maintainable, and expressive by referencing patterns instead of literal names.

Input Table Example:

id Q1 Q2 age score_2022 score_2023 Sales_2023 KPI_A KPI_B
1 5 6 25 80 90 1000 0.8 0.9
2 7 8 30 85 95 1500 0.7 0.8

A. Select columns starting with "Q":

library(dplyr)
df <- data.frame(
  id = c(1, 2),
  Q1 = c(5, 7),
  Q2 = c(6, 8),
  age = c(25, 30),
  score_2022 = c(80, 85),
  score_2023 = c(90, 95),
  Sales_2023 = c(1000, 1500),
  KPI_A = c(0.8, 0.7),
  KPI_B = c(0.9, 0.8)
)
df %>% select(starts_with("Q"))

R Code Explanation

  • starts_with("Q") selects all columns whose names begin with "Q".
  • This is a tidyselect helper for pattern-based selection in dplyr.

Output Table:

Q1 Q2
5 6
7 8

B. Select columns ending with "_2023":

library(dplyr)
df <- data.frame(
  id = c(1, 2),
  Q1 = c(5, 7),
  Q2 = c(6, 8),
  age = c(25, 30),
  score_2022 = c(80, 85),
  score_2023 = c(90, 95),
  Sales_2023 = c(1000, 1500),
  KPI_A = c(0.8, 0.7),
  KPI_B = c(0.9, 0.8)
)
df %>% select(ends_with("_2023"))

R Code Explanation

  • ends_with("_2023") selects columns whose names end with "_2023".
  • This is useful for selecting year-specific or versioned columns.

Output Table:

score_2023 Sales_2023
90 1000
95 1500

C. Select columns containing "age":

library(dplyr)
df <- data.frame(
  id = c(1, 2),
  Q1 = c(5, 7),
  Q2 = c(6, 8),
  age = c(25, 30),
  score_2022 = c(80, 85),
  score_2023 = c(90, 95),
  Sales_2023 = c(1000, 1500),
  KPI_A = c(0.8, 0.7),
  KPI_B = c(0.9, 0.8)
)
df %>% select(contains("age"))

R Code Explanation

  • contains("age") selects columns with "age" anywhere in the name.
  • This is helpful for flexible pattern matching in column selection.

Output Table:

age
25
30

D. Select columns matching regex for "score_" followed by 4 digits:

library(dplyr)
df <- data.frame(
  id = c(1, 2),
  Q1 = c(5, 7),
  Q2 = c(6, 8),
  age = c(25, 30),
  score_2022 = c(80, 85),
  score_2023 = c(90, 95),
  Sales_2023 = c(1000, 1500),
  KPI_A = c(0.8, 0.7),
  KPI_B = c(0.9, 0.8)
)
df %>% select(matches("^score_\\d{4}$"))

R Code Explanation

  • matches("^score_\\d{4}$") selects columns whose names match the regular expression for "score_" followed by exactly four digits.
  • This is powerful for advanced pattern-based selection.

Output Table:

score_2022 score_2023
80 90
85 95

E. Move 'id' to the front, keep all columns:

library(dplyr)
df <- data.frame(
  id = c(1, 2),
  Q1 = c(5, 7),
  Q2 = c(6, 8),
  age = c(25, 30),
  score_2022 = c(80, 85),
  score_2023 = c(90, 95),
  Sales_2023 = c(1000, 1500),
  KPI_A = c(0.8, 0.7),
  KPI_B = c(0.9, 0.8)
)
df %>% select(id, everything())

R Code Explanation

  • select(id, everything()) moves the "id" column to the front, keeping all other columns in their original order.
  • everything() is a tidyselect helper for all remaining columns.

Output Table:

id Q1 Q2 age score_2022 score_2023 Sales_2023 KPI_A KPI_B
1 5 6 25 80 90 1000 0.8 0.9
2 7 8 30 85 95 1500 0.7 0.8

F. Rename a column inline using select():

library(dplyr)
df <- data.frame(
  Q1 = c(5, 7),
  Q2 = c(6, 8)
)
df %>% select(Q1 = starts_with("Q1"))

R Code Explanation

  • You can rename columns inline in select() by specifying new_name = old_name.
  • This is a concise way to rename and select columns in one step.

Output Table:

Q1
5
7

G. Aggregate all columns starting with "KPI":

library(dplyr)
df <- data.frame(
  KPI_A = c(0.8, 0.7),
  KPI_B = c(0.9, 0.8)
)
df %>% summarise(across(starts_with("KPI"), mean, na.rm = TRUE))

R Code Explanation

  • across(starts_with("KPI"), mean, na.rm = TRUE) applies the mean function to all columns starting with "KPI".
  • summarise() returns a single-row summary with the means.

Output Table:

KPI_A KPI_B
0.75 0.85

H. Select columns by type and pattern:

library(dplyr)
df <- data.frame(
  id = c(1, 2),
  Q1 = c(5, 7),
  Q2 = c(6, 8),
  age = c(25, 30),
  score_2022 = c(80, 85),
  score_2023 = c(90, 95),
  Sales_2023 = c(1000, 1500),
  KPI_A = c(0.8, 0.7),
  KPI_B = c(0.9, 0.8)
)
df %>% select(where(is.numeric), starts_with("KPI"))

R Code Explanation

  • where(is.numeric) selects all numeric columns.
  • You can combine type-based and pattern-based selection for flexible column filtering.

Output Table:

Q1 Q2 age score_2022 score_2023 Sales_2023 KPI_A KPI_B
5 6 25 80 90 1000 0.8 0.9
7 8 30 85 95 1500 0.7 0.8

Explanation:
Pattern-based selection in R using dplyr and tidyselect helpers allows you to flexibly choose columns based on their names or types, making code more maintainable and adaptable to changing data structures.


6. RENAME Concept

Input Table:

cust_id score
101 10
102 20

SAS: Renaming is done inline via dataset options.

data renamed;
  set raw(rename=(cust_id=ID));
run;

R:

library(dplyr)
df <- data.frame(
  cust_id = c(101, 102),
  score = c(10, 20)
)
# Inline rename
df %>% select(ID = cust_id)

# Explicit rename
df %>% rename(ID = cust_id)

R Code Explanation

  • select(ID = cust_id) renames "cust_id" to "ID" while selecting it.
  • rename(ID = cust_id) renames the column in the data frame.
  • Both approaches are valid; use rename() for renaming only, or select() to rename and subset columns at once.

Output Table:

ID score
101 10
102 20

7. Metadata-Driven Selections (Advanced)

Input Table:

id name age score
1 Ann 25 10
2 Bob 30 20

SAS: You can dynamically select variables based on their metadata using dictionary.columns:

proc sql noprint;
  select name into :char_vars separated by ' '
  from dictionary.columns
  where libname='WORK' and memname='DATASET' and type='char';
quit;

data only_chars;
  set dataset(keep=&char_vars);
run;

R:

library(dplyr)
df <- data.frame(
  id = c(1, 2),
  name = c("Ann", "Bob"),
  age = c(25, 30),
  score = c(10, 20)
)
only_chars <- df %>% select(where(is.character))
print(only_chars)

Output Table:

id name
1 Ann
2 Bob

8. Fuzzy Renaming and Cleanup (R)

Input Table:

First Name Last Name
Ann Smith
Bob Jones

Use janitor::clean_names() to standardize column names across messy sources:

library(janitor)
df <- data.frame(
  "First Name" = c("Ann", "Bob"),
  "Last Name" = c("Smith", "Jones")
)
df <- df %>% clean_names()  # Converts "First Name" to "first_name"
print(df)

Output Table:

first_name last_name
Ann Smith
Bob Jones

9. Labeling and Provenance

Input Table:

emp_id name
101 Ann
102 Bob

SAS: Use the label statement to track variable origin or context:

data labeled;
  set data(rename=(emp_id=id));
  label id = "Employee ID from HR system";
run;

R: Use attributes to tag metadata:

df <- data.frame(
  id = c(101, 102),
  name = c("Ann", "Bob")
)
attr(df$id, "label") <- "Customer ID from CRM import"
# Display the data frame with attributes
str(df)

R Code Explanation

  • You can attach metadata to columns using attributes, such as a "label" for provenance or documentation.
  • str(df) shows the structure and attached attributes.

Output:

> str(df)
'data.frame':	2 obs. of  2 variables:
 $ id  : num  101 102
  ..- attr(*, "label")= chr "Customer ID from CRM import"
 $ name: chr  "Ann" "Bob"

10. Column Grouping and Summaries (R)

Input Table:

region sales_2022 sales_2023
East 1000 1200
West 1500 1600
library(dplyr)
df <- data.frame(
  region = c("East", "West"),
  sales_2022 = c(1000, 1500),
  sales_2023 = c(1200, 1600)
)
df %>%
  group_by(region) %>%
  summarise(across(starts_with("sales_"), sum, na.rm = TRUE))

R Code Explanation

  • group_by(region) groups the data by region.
  • summarise(across(starts_with("sales_"), sum, na.rm = TRUE)) computes the sum of all columns starting with "sales_" for each group.
  • This is a powerful pattern for grouped summaries in R.

Output Table:

region sales_2022 sales_2023
East 1000 1200
West 1500 1600

**Final Reference Table**

Feature SAS Example R Example (dplyr)
Keep specific columns keep=var1 var2 select(var1, var2)
Drop specific columns drop=var1 var2 select(-var1, -var2)
Rename columns rename=(old=new) rename(new = old)
Pattern match columns Not directly supported starts_with(), contains(), matches()
Dynamic var list %let vars=...; keep=&vars vars <- c(...); select(all_of(vars))
Metadata-based selection dictionary.columns, proc contents select(where(is.numeric))
Column provenance label id = "..." attr(df$id, "label") <- "..."
Reusable selection %macro var_keep... function(df) { select(df, ...) }

11. Common Error/Warning Scenarios for KEEP, DROP, and RENAME in R Equivalent

11.1. Selecting Non-Existent Columns

Scenario:
Trying to select or drop a column that does not exist.

Input Table:

id score
1 10
2 20

R Code:

library(dplyr)
df <- data.frame(
  id = c(1, 2),
  score = c(10, 20)
)
df %>% select(id, age)   # 'age' does not exist

Output:
Error: Can't subset columns that don't exist.

Solution:
Use any_of() to ignore missing columns:

df %>% select(any_of(c("id", "age")))

Output Table:

id
1
2

11.2. Type Mismatch in Pattern Selection

Scenario:
Selecting columns by pattern that match different types.

Input Table:

id age age_group
1 25 "young"
2 30 "adult"

R Code:

library(dplyr)
df <- data.frame(
  id = c(1, 2),
  age = c(25, 30),
  age_group = c("young", "adult")
)
df %>% select(starts_with("age"))

Output Table:

age age_group
25 "young"
30 "adult"

Note:
If you expect only numeric columns, use:

df %>% select(where(is.numeric), starts_with("age"))

Output Table:

age
25
30

11.3. Using Non-Standard Column Names

Scenario:
Column names with spaces or special characters can cause errors if referenced without quotes or backticks.

Input Table:

First Name Score
Ann 10
Bob 20

Problem Example:

df %>% select(First Name)   # Error: unexpected symbol

Solution 1: Use backticks

df %>% select(`First.Name`)
#Note there is no space intead we have used a dot

Output Table:

First Name
Ann
Bob

Solution 2: Clean names for safe usage

library(janitor)
df_clean <- df %>% clean_names()
df_clean %>% select(first_name)

Output Table:

first_name
Ann
Bob

Explanation:
Use backticks to reference columns with spaces or special characters. For consistent and safe column names, use janitor::clean_names() to standardize them before selection.


11.4. Pattern Selection Returns No Columns

Scenario:
Pattern does not match any columns.

Input Table:

id score
1 10
2 20

R Code:

library(dplyr)
df <- data.frame(
  id = c(1, 2),
  score = c(10, 20)
)
df %>% select(starts_with("Q"))

Output Table:
Empty data frame.

Solution:
Check available column names with names(df).


11.5. Selecting Columns by Type with Pattern

Scenario:
Selecting numeric columns that match a pattern.

Input Table:

id Q1 Q2 age score
1 5 6 25 10
2 7 8 30 20

R Code:

library(dplyr)
df <- data.frame(
  id = c(1, 2),
  Q1 = c(5, 7),
  Q2 = c(6, 8),
  age = c(25, 30),
  score = c(10, 20)
)
df %>% select(where(is.numeric), starts_with("Q"))

Output Table:

Q1 Q2 age score
5 6 25 10
7 8 30 20

**Resource download links**

1.4.5.-KEEP-DROP-and-RENAME-in-SAS-and-R-Equivalent.zip