1.4. Migrating from SAS to R: A Skill Conversion Guide
1.4.5. KEEP, DROP, and RENAME in SAS and R Equivalent
1. DROP Concept
Purpose:
To remove unwanted columns from a dataset to reduce memory, improve clarity, or streamline processing.
Input Table:
| id | col1 | col2 | score |
|---|---|---|---|
| 1 | A | X | 10 |
| 2 | B | Y | 20 |
SAS:
DROPremoves variables at input (set/merge) or output (data) level.- Use the
DROP=dataset option when you want to avoid pulling unnecessary columns into the Program Data Vector.
/* Drop during processing */
data trimmed;
set raw;
drop col1 col2;
run;
/* Drop while reading to save memory */
data trimmed;
set raw(drop=col1 col2);
run;
R (dplyr):
- Use
select()with negative indexing. - Efficient for ad-hoc exclusion or working with dynamic patterns.
library(dplyr)
df <- data.frame(
id = c(1, 2),
col1 = c("A", "B"),
col2 = c("X", "Y"),
score = c(10, 20)
)
cleaned <- df %>% select(-col1, -col2)
print(cleaned)
R Code Explanation
select(-col1, -col2)removes the specified columns from the data frame.- This is the recommended way to drop columns in R using dplyr.
- The result is a data frame with only the remaining columns.
Output Table:
| id | score |
|---|---|
| 1 | 10 |
| 2 | 20 |
2. KEEP Concept
Purpose:
To retain only specific columns, ensuring your dataset includes just what's needed for analysis or output.
Input Table:
| id | col1 | col2 | score |
|---|---|---|---|
| 1 | A | X | 10 |
| 2 | B | Y | 20 |
SAS:
data subset;
set full(keep=id score);
run;
data subset;
set full;
keep id score;
run;
R:
library(dplyr)
df <- data.frame(
id = c(1, 2),
col1 = c("A", "B"),
col2 = c("X", "Y"),
score = c(10, 20)
)
selected <- df %>% select(id, score)
print(selected)
R Code Explanation
select(id, score)keeps only the specified columns in the data frame.- This is the standard approach for column selection in R with dplyr.
- The result is a data frame containing just the columns you want.
Output Table:
| id | score |
|---|---|
| 1 | 10 |
| 2 | 20 |
3. KEEP/DROP Positioning in SAS
Variables can be kept/dropped:
- At input (
SET): reduces memory usage during processing. - At output (
DATA): allows temporary use of columns during transformation.
Input Table:
| id | score | flag |
|---|---|---|
| 1 | 10 | Y |
| 2 | 20 | N |
data final(drop=flag);
set interim(keep=id score flag);
run;
R Equivalent (two-step):
library(dplyr)
df <- data.frame(
id = c(1, 2),
score = c(10, 20),
flag = c("Y", "N")
)
df_temp <- df %>% select(id, score, flag)
final <- df_temp %>% select(-flag)
print(final)
R Code Explanation
- You can chain
select()calls to first keep columns you need for processing, then drop temporary columns before final output. - This mirrors the SAS approach of keeping columns at input and dropping at output.
Output Table:
| id | score |
|---|---|
| 1 | 10 |
| 2 | 20 |
4. Dynamic KEEP Lists
Input Table:
| id | score | age |
|---|---|---|
| 1 | 10 | 25 |
| 2 | 20 | 30 |
SAS: Define a macro variable to pass variable names:
%let vars = id score;
data result;
set source(keep=&vars);
run;
R:
Use a character vector and all_of() for safe, non-standard variable names:
library(dplyr)
df <- data.frame(
id = c(1, 2),
score = c(10, 20),
age = c(25, 30)
)
vars <- c("id", "score")
subset <- df %>% select(all_of(vars))
print(subset)
R Code Explanation
all_of(vars)allows you to select columns using a character vector of names.- This is robust to variable name changes and works with non-standard names.
- Useful for dynamic or metadata-driven column selection.
Output Table:
| id | score |
|---|---|
| 1 | 10 |
| 2 | 20 |
5. Pattern-Based Column Selection (Advanced Exploration in R)
Purpose:
To make column selection scalable, maintainable, and expressive by referencing patterns instead of literal names.
Input Table Example:
| id | Q1 | Q2 | age | score_2022 | score_2023 | Sales_2023 | KPI_A | KPI_B |
|---|---|---|---|---|---|---|---|---|
| 1 | 5 | 6 | 25 | 80 | 90 | 1000 | 0.8 | 0.9 |
| 2 | 7 | 8 | 30 | 85 | 95 | 1500 | 0.7 | 0.8 |
A. Select columns starting with "Q":
library(dplyr)
df <- data.frame(
id = c(1, 2),
Q1 = c(5, 7),
Q2 = c(6, 8),
age = c(25, 30),
score_2022 = c(80, 85),
score_2023 = c(90, 95),
Sales_2023 = c(1000, 1500),
KPI_A = c(0.8, 0.7),
KPI_B = c(0.9, 0.8)
)
df %>% select(starts_with("Q"))
R Code Explanation
starts_with("Q")selects all columns whose names begin with "Q".- This is a tidyselect helper for pattern-based selection in dplyr.
Output Table:
| Q1 | Q2 |
|---|---|
| 5 | 6 |
| 7 | 8 |
B. Select columns ending with "_2023":
library(dplyr)
df <- data.frame(
id = c(1, 2),
Q1 = c(5, 7),
Q2 = c(6, 8),
age = c(25, 30),
score_2022 = c(80, 85),
score_2023 = c(90, 95),
Sales_2023 = c(1000, 1500),
KPI_A = c(0.8, 0.7),
KPI_B = c(0.9, 0.8)
)
df %>% select(ends_with("_2023"))
R Code Explanation
ends_with("_2023")selects columns whose names end with "_2023".- This is useful for selecting year-specific or versioned columns.
Output Table:
| score_2023 | Sales_2023 |
|---|---|
| 90 | 1000 |
| 95 | 1500 |
C. Select columns containing "age":
library(dplyr)
df <- data.frame(
id = c(1, 2),
Q1 = c(5, 7),
Q2 = c(6, 8),
age = c(25, 30),
score_2022 = c(80, 85),
score_2023 = c(90, 95),
Sales_2023 = c(1000, 1500),
KPI_A = c(0.8, 0.7),
KPI_B = c(0.9, 0.8)
)
df %>% select(contains("age"))
R Code Explanation
contains("age")selects columns with "age" anywhere in the name.- This is helpful for flexible pattern matching in column selection.
Output Table:
| age |
|---|
| 25 |
| 30 |
D. Select columns matching regex for "score_" followed by 4 digits:
library(dplyr)
df <- data.frame(
id = c(1, 2),
Q1 = c(5, 7),
Q2 = c(6, 8),
age = c(25, 30),
score_2022 = c(80, 85),
score_2023 = c(90, 95),
Sales_2023 = c(1000, 1500),
KPI_A = c(0.8, 0.7),
KPI_B = c(0.9, 0.8)
)
df %>% select(matches("^score_\\d{4}$"))
R Code Explanation
matches("^score_\\d{4}$")selects columns whose names match the regular expression for "score_" followed by exactly four digits.- This is powerful for advanced pattern-based selection.
Output Table:
| score_2022 | score_2023 |
|---|---|
| 80 | 90 |
| 85 | 95 |
E. Move 'id' to the front, keep all columns:
library(dplyr)
df <- data.frame(
id = c(1, 2),
Q1 = c(5, 7),
Q2 = c(6, 8),
age = c(25, 30),
score_2022 = c(80, 85),
score_2023 = c(90, 95),
Sales_2023 = c(1000, 1500),
KPI_A = c(0.8, 0.7),
KPI_B = c(0.9, 0.8)
)
df %>% select(id, everything())
R Code Explanation
select(id, everything())moves the "id" column to the front, keeping all other columns in their original order.everything()is a tidyselect helper for all remaining columns.
Output Table:
| id | Q1 | Q2 | age | score_2022 | score_2023 | Sales_2023 | KPI_A | KPI_B |
|---|---|---|---|---|---|---|---|---|
| 1 | 5 | 6 | 25 | 80 | 90 | 1000 | 0.8 | 0.9 |
| 2 | 7 | 8 | 30 | 85 | 95 | 1500 | 0.7 | 0.8 |
F. Rename a column inline using select():
library(dplyr)
df <- data.frame(
Q1 = c(5, 7),
Q2 = c(6, 8)
)
df %>% select(Q1 = starts_with("Q1"))
R Code Explanation
- You can rename columns inline in
select()by specifyingnew_name = old_name. - This is a concise way to rename and select columns in one step.
Output Table:
| Q1 |
|---|
| 5 |
| 7 |
G. Aggregate all columns starting with "KPI":
library(dplyr)
df <- data.frame(
KPI_A = c(0.8, 0.7),
KPI_B = c(0.9, 0.8)
)
df %>% summarise(across(starts_with("KPI"), mean, na.rm = TRUE))
R Code Explanation
across(starts_with("KPI"), mean, na.rm = TRUE)applies the mean function to all columns starting with "KPI".summarise()returns a single-row summary with the means.
Output Table:
| KPI_A | KPI_B |
|---|---|
| 0.75 | 0.85 |
H. Select columns by type and pattern:
library(dplyr)
df <- data.frame(
id = c(1, 2),
Q1 = c(5, 7),
Q2 = c(6, 8),
age = c(25, 30),
score_2022 = c(80, 85),
score_2023 = c(90, 95),
Sales_2023 = c(1000, 1500),
KPI_A = c(0.8, 0.7),
KPI_B = c(0.9, 0.8)
)
df %>% select(where(is.numeric), starts_with("KPI"))
R Code Explanation
where(is.numeric)selects all numeric columns.- You can combine type-based and pattern-based selection for flexible column filtering.
Output Table:
| Q1 | Q2 | age | score_2022 | score_2023 | Sales_2023 | KPI_A | KPI_B |
|---|---|---|---|---|---|---|---|
| 5 | 6 | 25 | 80 | 90 | 1000 | 0.8 | 0.9 |
| 7 | 8 | 30 | 85 | 95 | 1500 | 0.7 | 0.8 |
Explanation:
Pattern-based selection in R using dplyr and tidyselect helpers allows you to flexibly choose columns based on their names or types, making code more maintainable and adaptable to changing data structures.
6. RENAME Concept
Input Table:
| cust_id | score |
|---|---|
| 101 | 10 |
| 102 | 20 |
SAS: Renaming is done inline via dataset options.
data renamed;
set raw(rename=(cust_id=ID));
run;
R:
library(dplyr)
df <- data.frame(
cust_id = c(101, 102),
score = c(10, 20)
)
# Inline rename
df %>% select(ID = cust_id)
# Explicit rename
df %>% rename(ID = cust_id)
R Code Explanation
select(ID = cust_id)renames "cust_id" to "ID" while selecting it.rename(ID = cust_id)renames the column in the data frame.- Both approaches are valid; use
rename()for renaming only, orselect()to rename and subset columns at once.
Output Table:
| ID | score |
|---|---|
| 101 | 10 |
| 102 | 20 |
7. Metadata-Driven Selections (Advanced)
Input Table:
| id | name | age | score |
|---|---|---|---|
| 1 | Ann | 25 | 10 |
| 2 | Bob | 30 | 20 |
SAS:
You can dynamically select variables based on their metadata using dictionary.columns:
proc sql noprint;
select name into :char_vars separated by ' '
from dictionary.columns
where libname='WORK' and memname='DATASET' and type='char';
quit;
data only_chars;
set dataset(keep=&char_vars);
run;
R:
library(dplyr)
df <- data.frame(
id = c(1, 2),
name = c("Ann", "Bob"),
age = c(25, 30),
score = c(10, 20)
)
only_chars <- df %>% select(where(is.character))
print(only_chars)
Output Table:
| id | name |
|---|---|
| 1 | Ann |
| 2 | Bob |
8. Fuzzy Renaming and Cleanup (R)
Input Table:
| First Name | Last Name |
|---|---|
| Ann | Smith |
| Bob | Jones |
Use janitor::clean_names() to standardize column names across messy sources:
library(janitor)
df <- data.frame(
"First Name" = c("Ann", "Bob"),
"Last Name" = c("Smith", "Jones")
)
df <- df %>% clean_names() # Converts "First Name" to "first_name"
print(df)
Output Table:
| first_name | last_name |
|---|---|
| Ann | Smith |
| Bob | Jones |
9. Labeling and Provenance
Input Table:
| emp_id | name |
|---|---|
| 101 | Ann |
| 102 | Bob |
SAS:
Use the label statement to track variable origin or context:
data labeled;
set data(rename=(emp_id=id));
label id = "Employee ID from HR system";
run;
R: Use attributes to tag metadata:
df <- data.frame(
id = c(101, 102),
name = c("Ann", "Bob")
)
attr(df$id, "label") <- "Customer ID from CRM import"
# Display the data frame with attributes
str(df)
R Code Explanation
- You can attach metadata to columns using attributes, such as a "label" for provenance or documentation.
str(df)shows the structure and attached attributes.
Output:
> str(df)
'data.frame': 2 obs. of 2 variables:
$ id : num 101 102
..- attr(*, "label")= chr "Customer ID from CRM import"
$ name: chr "Ann" "Bob"
10. Column Grouping and Summaries (R)
Input Table:
| region | sales_2022 | sales_2023 |
|---|---|---|
| East | 1000 | 1200 |
| West | 1500 | 1600 |
library(dplyr)
df <- data.frame(
region = c("East", "West"),
sales_2022 = c(1000, 1500),
sales_2023 = c(1200, 1600)
)
df %>%
group_by(region) %>%
summarise(across(starts_with("sales_"), sum, na.rm = TRUE))
R Code Explanation
group_by(region)groups the data by region.summarise(across(starts_with("sales_"), sum, na.rm = TRUE))computes the sum of all columns starting with "sales_" for each group.- This is a powerful pattern for grouped summaries in R.
Output Table:
| region | sales_2022 | sales_2023 |
|---|---|---|
| East | 1000 | 1200 |
| West | 1500 | 1600 |
**Final Reference Table**
| Feature | SAS Example | R Example (dplyr) |
|---|---|---|
| Keep specific columns | keep=var1 var2 |
select(var1, var2) |
| Drop specific columns | drop=var1 var2 |
select(-var1, -var2) |
| Rename columns | rename=(old=new) |
rename(new = old) |
| Pattern match columns | Not directly supported | starts_with(), contains(), matches() |
| Dynamic var list | %let vars=...; keep=&vars |
vars <- c(...); select(all_of(vars)) |
| Metadata-based selection | dictionary.columns, proc contents |
select(where(is.numeric)) |
| Column provenance | label id = "..." |
attr(df$id, "label") <- "..." |
| Reusable selection | %macro var_keep... |
function(df) { select(df, ...) } |
11. Common Error/Warning Scenarios for KEEP, DROP, and RENAME in R Equivalent
11.1. Selecting Non-Existent Columns
Scenario:
Trying to select or drop a column that does not exist.
Input Table:
| id | score |
|---|---|
| 1 | 10 |
| 2 | 20 |
R Code:
library(dplyr)
df <- data.frame(
id = c(1, 2),
score = c(10, 20)
)
df %>% select(id, age) # 'age' does not exist
Output:
Error: Can't subset columns that don't exist.
Solution:
Use any_of() to ignore missing columns:
df %>% select(any_of(c("id", "age")))
Output Table:
| id |
|---|
| 1 |
| 2 |
11.2. Type Mismatch in Pattern Selection
Scenario:
Selecting columns by pattern that match different types.
Input Table:
| id | age | age_group |
|---|---|---|
| 1 | 25 | "young" |
| 2 | 30 | "adult" |
R Code:
library(dplyr)
df <- data.frame(
id = c(1, 2),
age = c(25, 30),
age_group = c("young", "adult")
)
df %>% select(starts_with("age"))
Output Table:
| age | age_group |
|---|---|
| 25 | "young" |
| 30 | "adult" |
Note:
If you expect only numeric columns, use:
df %>% select(where(is.numeric), starts_with("age"))
Output Table:
| age |
|---|
| 25 |
| 30 |
11.3. Using Non-Standard Column Names
Scenario:
Column names with spaces or special characters can cause errors if referenced without quotes or backticks.
Input Table:
| First Name | Score |
|---|---|
| Ann | 10 |
| Bob | 20 |
Problem Example:
df %>% select(First Name) # Error: unexpected symbol
Solution 1: Use backticks
df %>% select(`First.Name`)
#Note there is no space intead we have used a dot
Output Table:
| First Name |
|---|
| Ann |
| Bob |
Solution 2: Clean names for safe usage
library(janitor)
df_clean <- df %>% clean_names()
df_clean %>% select(first_name)
Output Table:
| first_name |
|---|
| Ann |
| Bob |
Explanation:
Use backticks to reference columns with spaces or special characters. For consistent and safe column names, use janitor::clean_names() to standardize them before selection.
11.4. Pattern Selection Returns No Columns
Scenario:
Pattern does not match any columns.
Input Table:
| id | score |
|---|---|
| 1 | 10 |
| 2 | 20 |
R Code:
library(dplyr)
df <- data.frame(
id = c(1, 2),
score = c(10, 20)
)
df %>% select(starts_with("Q"))
Output Table:
Empty data frame.
Solution:
Check available column names with names(df).
11.5. Selecting Columns by Type with Pattern
Scenario:
Selecting numeric columns that match a pattern.
Input Table:
| id | Q1 | Q2 | age | score |
|---|---|---|---|---|
| 1 | 5 | 6 | 25 | 10 |
| 2 | 7 | 8 | 30 | 20 |
R Code:
library(dplyr)
df <- data.frame(
id = c(1, 2),
Q1 = c(5, 7),
Q2 = c(6, 8),
age = c(25, 30),
score = c(10, 20)
)
df %>% select(where(is.numeric), starts_with("Q"))
Output Table:
| Q1 | Q2 | age | score |
|---|---|---|---|
| 5 | 6 | 25 | 10 |
| 7 | 8 | 30 | 20 |
**Resource download links**
1.4.5.-KEEP-DROP-and-RENAME-in-SAS-and-R-Equivalent.zip