1.4.5. KEEP DROP and RENAME in SAS and R Equivalent

1.4. Migrating from SAS to R: A Skill Conversion Guide

1.4.5. KEEP, DROP, and RENAME in SAS and R Equivalent

1. DROP Concept

Purpose:
To remove unwanted columns from a dataset to reduce memory, improve clarity, or streamline processing.

Input Table:

id	col1	col2	score
1	A	X	10
2	B	Y	20

SAS:

DROP removes variables at input (set/merge) or output (data) level.
Use the DROP= dataset option when you want to avoid pulling unnecessary columns into the Program Data Vector.

/* Drop during processing */
data trimmed;
  set raw;
  drop col1 col2;
run;

/* Drop while reading to save memory */
data trimmed;
  set raw(drop=col1 col2);
run;

R (dplyr):

Use select() with negative indexing.
Efficient for ad-hoc exclusion or working with dynamic patterns.

library(dplyr)
df <- data.frame(
  id = c(1, 2),
  col1 = c("A", "B"),
  col2 = c("X", "Y"),
  score = c(10, 20)
)
cleaned <- df %>% select(-col1, -col2)
print(cleaned)

R Code Explanation

select(-col1, -col2) removes the specified columns from the data frame.
This is the recommended way to drop columns in R using dplyr.
The result is a data frame with only the remaining columns.

Output Table:

id	score
1	10
2	20

2. KEEP Concept

Purpose:
To retain only specific columns, ensuring your dataset includes just what's needed for analysis or output.

Input Table:

id	col1	col2	score
1	A	X	10
2	B	Y	20

SAS:

data subset;
  set full(keep=id score);
run;

data subset;
  set full;
  keep id score;
run;

library(dplyr)
df <- data.frame(
  id = c(1, 2),
  col1 = c("A", "B"),
  col2 = c("X", "Y"),
  score = c(10, 20)
)
selected <- df %>% select(id, score)
print(selected)

R Code Explanation

select(id, score) keeps only the specified columns in the data frame.
This is the standard approach for column selection in R with dplyr.
The result is a data frame containing just the columns you want.

Output Table:

id	score
1	10
2	20

3. KEEP/DROP Positioning in SAS

Variables can be kept/dropped:

At input (SET): reduces memory usage during processing.
At output (DATA): allows temporary use of columns during transformation.

Input Table:

id	score	flag
1	10	Y
2	20	N

data final(drop=flag);
  set interim(keep=id score flag);
run;

R Equivalent (two-step):

library(dplyr)
df <- data.frame(
  id = c(1, 2),
  score = c(10, 20),
  flag = c("Y", "N")
)
df_temp <- df %>% select(id, score, flag)
final <- df_temp %>% select(-flag)
print(final)

R Code Explanation

You can chain select() calls to first keep columns you need for processing, then drop temporary columns before final output.
This mirrors the SAS approach of keeping columns at input and dropping at output.

Output Table:

id	score
1	10
2	20

4. Dynamic KEEP Lists

Input Table:

id	score	age
1	10	25
2	20	30

SAS: Define a macro variable to pass variable names:

%let vars = id score;
data result;
  set source(keep=&vars);
run;

R: Use a character vector and all_of() for safe, non-standard variable names:

library(dplyr)
df <- data.frame(
  id = c(1, 2),
  score = c(10, 20),
  age = c(25, 30)
)
vars <- c("id", "score")
subset <- df %>% select(all_of(vars))
print(subset)

R Code Explanation

all_of(vars) allows you to select columns using a character vector of names.
This is robust to variable name changes and works with non-standard names.
Useful for dynamic or metadata-driven column selection.

Output Table:

id	score
1	10
2	20

5. Pattern-Based Column Selection (Advanced Exploration in R)

Purpose:
To make column selection scalable, maintainable, and expressive by referencing patterns instead of literal names.

Input Table Example:

id	Q1	Q2	age	score_2022	score_2023	Sales_2023	KPI_A	KPI_B
1	5	6	25	80	90	1000	0.8	0.9
2	7	8	30	85	95	1500	0.7	0.8

A. Select columns starting with "Q":

library(dplyr)
df <- data.frame(
  id = c(1, 2),
  Q1 = c(5, 7),
  Q2 = c(6, 8),
  age = c(25, 30),
  score_2022 = c(80, 85),
  score_2023 = c(90, 95),
  Sales_2023 = c(1000, 1500),
  KPI_A = c(0.8, 0.7),
  KPI_B = c(0.9, 0.8)
)
df %>% select(starts_with("Q"))

R Code Explanation

starts_with("Q") selects all columns whose names begin with "Q".
This is a tidyselect helper for pattern-based selection in dplyr.

Output Table:

Q1	Q2
5	6
7	8

B. Select columns ending with "_2023":

library(dplyr)
df <- data.frame(
  id = c(1, 2),
  Q1 = c(5, 7),
  Q2 = c(6, 8),
  age = c(25, 30),
  score_2022 = c(80, 85),
  score_2023 = c(90, 95),
  Sales_2023 = c(1000, 1500),
  KPI_A = c(0.8, 0.7),
  KPI_B = c(0.9, 0.8)
)
df %>% select(ends_with("_2023"))

R Code Explanation

ends_with("_2023") selects columns whose names end with "_2023".
This is useful for selecting year-specific or versioned columns.

Output Table:

score_2023	Sales_2023
90	1000
95	1500

C. Select columns containing "age":

library(dplyr)
df <- data.frame(
  id = c(1, 2),
  Q1 = c(5, 7),
  Q2 = c(6, 8),
  age = c(25, 30),
  score_2022 = c(80, 85),
  score_2023 = c(90, 95),
  Sales_2023 = c(1000, 1500),
  KPI_A = c(0.8, 0.7),
  KPI_B = c(0.9, 0.8)
)
df %>% select(contains("age"))

R Code Explanation

contains("age") selects columns with "age" anywhere in the name.
This is helpful for flexible pattern matching in column selection.

Output Table:

age
25
30

D. Select columns matching regex for "score_" followed by 4 digits:

library(dplyr)
df <- data.frame(
  id = c(1, 2),
  Q1 = c(5, 7),
  Q2 = c(6, 8),
  age = c(25, 30),
  score_2022 = c(80, 85),
  score_2023 = c(90, 95),
  Sales_2023 = c(1000, 1500),
  KPI_A = c(0.8, 0.7),
  KPI_B = c(0.9, 0.8)
)
df %>% select(matches("^score_\\d{4}$"))

R Code Explanation

matches("^score_\\d{4}$") selects columns whose names match the regular expression for "score_" followed by exactly four digits.
This is powerful for advanced pattern-based selection.

Output Table:

score_2022	score_2023
80	90
85	95

E. Move 'id' to the front, keep all columns:

library(dplyr)
df <- data.frame(
  id = c(1, 2),
  Q1 = c(5, 7),
  Q2 = c(6, 8),
  age = c(25, 30),
  score_2022 = c(80, 85),
  score_2023 = c(90, 95),
  Sales_2023 = c(1000, 1500),
  KPI_A = c(0.8, 0.7),
  KPI_B = c(0.9, 0.8)
)
df %>% select(id, everything())

R Code Explanation

select(id, everything()) moves the "id" column to the front, keeping all other columns in their original order.
everything() is a tidyselect helper for all remaining columns.

Output Table:

id	Q1	Q2	age	score_2022	score_2023	Sales_2023	KPI_A	KPI_B
1	5	6	25	80	90	1000	0.8	0.9
2	7	8	30	85	95	1500	0.7	0.8

F. Rename a column inline using select():

library(dplyr)
df <- data.frame(
  Q1 = c(5, 7),
  Q2 = c(6, 8)
)
df %>% select(Q1 = starts_with("Q1"))

R Code Explanation

You can rename columns inline in select() by specifying new_name = old_name.
This is a concise way to rename and select columns in one step.

Output Table:

Q1
5
7

G. Aggregate all columns starting with "KPI":

library(dplyr)
df <- data.frame(
  KPI_A = c(0.8, 0.7),
  KPI_B = c(0.9, 0.8)
)
df %>% summarise(across(starts_with("KPI"), mean, na.rm = TRUE))

R Code Explanation

across(starts_with("KPI"), mean, na.rm = TRUE) applies the mean function to all columns starting with "KPI".
summarise() returns a single-row summary with the means.

Output Table:

KPI_A	KPI_B
0.75	0.85

H. Select columns by type and pattern:

library(dplyr)
df <- data.frame(
  id = c(1, 2),
  Q1 = c(5, 7),
  Q2 = c(6, 8),
  age = c(25, 30),
  score_2022 = c(80, 85),
  score_2023 = c(90, 95),
  Sales_2023 = c(1000, 1500),
  KPI_A = c(0.8, 0.7),
  KPI_B = c(0.9, 0.8)
)
df %>% select(where(is.numeric), starts_with("KPI"))

R Code Explanation

where(is.numeric) selects all numeric columns.
You can combine type-based and pattern-based selection for flexible column filtering.

Output Table:

Q1	Q2	age	score_2022	score_2023	Sales_2023	KPI_A	KPI_B
5	6	25	80	90	1000	0.8	0.9
7	8	30	85	95	1500	0.7	0.8

Explanation:
Pattern-based selection in R using dplyr and tidyselect helpers allows you to flexibly choose columns based on their names or types, making code more maintainable and adaptable to changing data structures.

6. RENAME Concept

Input Table:

cust_id	score
101	10
102	20

SAS: Renaming is done inline via dataset options.

data renamed;
  set raw(rename=(cust_id=ID));
run;

library(dplyr)
df <- data.frame(
  cust_id = c(101, 102),
  score = c(10, 20)
)
# Inline rename
df %>% select(ID = cust_id)

# Explicit rename
df %>% rename(ID = cust_id)

R Code Explanation

select(ID = cust_id) renames "cust_id" to "ID" while selecting it.
rename(ID = cust_id) renames the column in the data frame.
Both approaches are valid; use rename() for renaming only, or select() to rename and subset columns at once.

Output Table:

ID	score
101	10
102	20

7. Metadata-Driven Selections (Advanced)

Input Table:

id	name	age	score
1	Ann	25	10
2	Bob	30	20

SAS: You can dynamically select variables based on their metadata using dictionary.columns:

proc sql noprint;
  select name into :char_vars separated by ' '
  from dictionary.columns
  where libname='WORK' and memname='DATASET' and type='char';
quit;

data only_chars;
  set dataset(keep=&char_vars);
run;

library(dplyr)
df <- data.frame(
  id = c(1, 2),
  name = c("Ann", "Bob"),
  age = c(25, 30),
  score = c(10, 20)
)
only_chars <- df %>% select(where(is.character))
print(only_chars)

Output Table:

id	name
1	Ann
2	Bob

8. Fuzzy Renaming and Cleanup (R)

Input Table:

First Name	Last Name
Ann	Smith
Bob	Jones

Use janitor::clean_names() to standardize column names across messy sources:

library(janitor)
df <- data.frame(
  "First Name" = c("Ann", "Bob"),
  "Last Name" = c("Smith", "Jones")
)
df <- df %>% clean_names()  # Converts "First Name" to "first_name"
print(df)

Output Table:

first_name	last_name
Ann	Smith
Bob	Jones

9. Labeling and Provenance

Input Table:

emp_id	name
101	Ann
102	Bob

SAS: Use the label statement to track variable origin or context:

data labeled;
  set data(rename=(emp_id=id));
  label id = "Employee ID from HR system";
run;

R: Use attributes to tag metadata:

df <- data.frame(
  id = c(101, 102),
  name = c("Ann", "Bob")
)
attr(df$id, "label") <- "Customer ID from CRM import"
# Display the data frame with attributes
str(df)

R Code Explanation

You can attach metadata to columns using attributes, such as a "label" for provenance or documentation.
str(df) shows the structure and attached attributes.

Output:

> str(df)
'data.frame':	2 obs. of  2 variables:
 $ id  : num  101 102
  ..- attr(*, "label")= chr "Customer ID from CRM import"
 $ name: chr  "Ann" "Bob"

10. Column Grouping and Summaries (R)

Input Table:

region	sales_2022	sales_2023
East	1000	1200
West	1500	1600

library(dplyr)
df <- data.frame(
  region = c("East", "West"),
  sales_2022 = c(1000, 1500),
  sales_2023 = c(1200, 1600)
)
df %>%
  group_by(region) %>%
  summarise(across(starts_with("sales_"), sum, na.rm = TRUE))

R Code Explanation

group_by(region) groups the data by region.
summarise(across(starts_with("sales_"), sum, na.rm = TRUE)) computes the sum of all columns starting with "sales_" for each group.
This is a powerful pattern for grouped summaries in R.

Output Table:

region	sales_2022	sales_2023
East	1000	1200
West	1500	1600

Final Reference Table

Feature	SAS Example	R Example (dplyr)
Keep specific columns	`keep=var1 var2`	`select(var1, var2)`
Drop specific columns	`drop=var1 var2`	`select(-var1, -var2)`
Rename columns	`rename=(old=new)`	`rename(new = old)`
Pattern match columns	Not directly supported	`starts_with()`, `contains()`, `matches()`
Dynamic var list	`%let vars=...; keep=&vars`	`vars <- c(...); select(all_of(vars))`
Metadata-based selection	`dictionary.columns`, `proc contents`	`select(where(is.numeric))`
Column provenance	`label id = "..."`	`attr(df$id, "label") <- "..."`
Reusable selection	`%macro var_keep...`	`function(df) { select(df, ...) }`

11. Common Error/Warning Scenarios for KEEP, DROP, and RENAME in R Equivalent

11.1. Selecting Non-Existent Columns

Scenario:
Trying to select or drop a column that does not exist.

Input Table:

id	score
1	10
2	20

R Code:

library(dplyr)
df <- data.frame(
  id = c(1, 2),
  score = c(10, 20)
)
df %>% select(id, age)   # 'age' does not exist

Output:
Error: Can't subset columns that don't exist.

Solution:
Use any_of() to ignore missing columns:

df %>% select(any_of(c("id", "age")))

Output Table:

id
1
2

11.2. Type Mismatch in Pattern Selection

Scenario:
Selecting columns by pattern that match different types.

Input Table:

id	age	age_group
1	25	"young"
2	30	"adult"

R Code:

library(dplyr)
df <- data.frame(
  id = c(1, 2),
  age = c(25, 30),
  age_group = c("young", "adult")
)
df %>% select(starts_with("age"))

Output Table:

age	age_group
25	"young"
30	"adult"

Note:
If you expect only numeric columns, use:

df %>% select(where(is.numeric), starts_with("age"))

Output Table:

age
25
30

11.3. Using Non-Standard Column Names

Scenario:
Column names with spaces or special characters can cause errors if referenced without quotes or backticks.

Input Table:

First Name	Score
Ann	10
Bob	20

Problem Example:

df %>% select(First Name)   # Error: unexpected symbol

Solution 1: Use backticks

df %>% select(`First.Name`)
#Note there is no space intead we have used a dot

Output Table:

First Name
Ann
Bob

Solution 2: Clean names for safe usage

library(janitor)
df_clean <- df %>% clean_names()
df_clean %>% select(first_name)

Output Table:

first_name
Ann
Bob

Explanation:
Use backticks to reference columns with spaces or special characters. For consistent and safe column names, use janitor::clean_names() to standardize them before selection.

11.4. Pattern Selection Returns No Columns

Scenario:
Pattern does not match any columns.

Input Table:

id	score
1	10
2	20

R Code:

library(dplyr)
df <- data.frame(
  id = c(1, 2),
  score = c(10, 20)
)
df %>% select(starts_with("Q"))

Output Table:
Empty data frame.

Solution:
Check available column names with names(df).

11.5. Selecting Columns by Type with Pattern

Scenario:
Selecting numeric columns that match a pattern.

Input Table:

id	Q1	Q2	age	score
1	5	6	25	10
2	7	8	30	20

R Code:

library(dplyr)
df <- data.frame(
  id = c(1, 2),
  Q1 = c(5, 7),
  Q2 = c(6, 8),
  age = c(25, 30),
  score = c(10, 20)
)
df %>% select(where(is.numeric), starts_with("Q"))

Output Table:

Q1	Q2	age	score
5	6	25	10
7	8	30	20

Resource download links

1.4.5.-KEEP-DROP-and-RENAME-in-SAS-and-R-Equivalent.zip

⁂

1.4. Migrating from SAS to R: A Skill Conversion Guide