1.4. Migrating from SAS to R: A Skill Conversion Guide
1.4.6. SAS WHERE vs. R filter()
Purpose
Both WHERE (SAS) and filter() (R) are used to subset rows based on specific conditions—but they differ in when and how they’re applied.
1. Basic Usage
SAS WHERE (Statement or Dataset Option)
- Filters before data hits the PDV (Program Data Vector).
- Only accessible for existing dataset variables.
Example:
data filtered;
set sales;
where region = 'East';
run;
Explanation: This pulls only those rows from sales where region equals 'East', directly during read-in.
Alternative using dataset option:
data filtered;
set sales(where=(region = 'East'));
run;
R filter() from dplyr
- Pipe-compatible and evaluates logical expressions on the fly.
- Supports temporary/computed variables and chaining with
%>%.
Example:
library(dplyr)
sales <- data.frame(
region = c("East", "West", "East"),
amount = c(100, 200, 150)
)
filtered <- sales %>% filter(region == "East")
print(filtered)
R Code Explanation
filter(region == "East")selects rows where theregioncolumn equals "East".- The
%>%pipe passes the data frame tofilter(), making code readable and chainable. - This is the standard way to subset rows in R using dplyr.
2. Beyond Basics – Deeper Capabilities
**2.1. Computed Columns**
SAS:
data result;
set sales;
if price * quantity > 1000;
run;
Explanation: WHERE can’t use computed logic like price * quantity. This requires IF after loading rows.
R:
library(dplyr)
sales <- data.frame(
price = c(100, 200, 300),
quantity = c(5, 2, 1)
)
sales %>% filter(price * quantity > 1000)
R Code Explanation
filter(price * quantity > 1000)allows you to use computed expressions directly in the filter condition.- This enables flexible, on-the-fly filtering based on calculations, which is not possible with SAS
WHERE.
**2.2. Multiple Conditions**
SAS:
where age > 30 and city = 'Delhi';
R:
library(dplyr)
df <- data.frame(
age = c(25, 35, 40),
city = c("Delhi", "Mumbai", "Delhi")
)
filter(df, age > 30, city == 'Delhi')
R Code Explanation
- Multiple conditions in
filter()are combined with logical AND. - Only rows where both conditions are TRUE are returned.
**2.3. List Matching (IN or %in%)**
SAS:
where region in ('East', 'West');
R:
library(dplyr)
df <- data.frame(
region = c("East", "West", "North"),
value = c(1, 2, 3)
)
filter(df, region %in% c("East", "West"))
R Code Explanation
%in%checks if the value is in a set of options.- This is the R equivalent of SAS
INfor list-based filtering.
**2.4. Missing or Blank Strings**
SAS:
where name = '' or name is missing;
With space-trim for added safety:
where strip(name) = '';
R:
library(dplyr)
df <- data.frame(
name = c("Ann", "", NA, "Bob")
)
filter(df, trimws(name) == "" | is.na(name))
R Code Explanation
trimws(name) == ""checks for blank or whitespace-only strings.is.na(name)checks for missing values.- The filter returns rows where the name is blank or missing.
**2.5. Text Search (INDEX, FIND)**
SAS:
where index(name, 'Smith') > 0; /* Case-sensitive */
where find(name, 'smith', 'i') > 0; /* Case-insensitive */
R:
library(dplyr)
library(stringr)
df <- data.frame(
name = c("Smith", "smith", "Jones")
)
filter(df, str_detect(name, "Smith")) # Case-sensitive
filter(df, str_detect(name, regex("smith", ignore_case=TRUE))) # Case-insensitive
# Or with base R:
filter(df, grepl("smith", name, ignore.case = TRUE))
R Code Explanation
str_detect()from stringr checks for substring matches.- Use
regex(..., ignore_case=TRUE)for case-insensitive matching. grepl()is the base R alternative for pattern matching in filters.
**2.6. Grouped Filtering (Using Aggregates)**
SAS (PROC SQL):
proc sql;
create table result as
select dept, sum(sales) as total_sales
from sales
group by dept
having total_sales > 10000;
quit;
R:
library(dplyr)
sales <- data.frame(
dept = c("A", "A", "B", "B"),
sales = c(6000, 5000, 4000, 8000)
)
sales %>%
group_by(dept) %>%
summarise(total_sales = sum(sales), .groups = "drop") %>%
filter(total_sales > 10000)
R Code Explanation
group_by(dept)groups the data by department.summarise(total_sales = sum(sales))computes total sales per group.filter(total_sales > 10000)keeps only groups with total sales above 10,000.
**2.7. Date Range Filtering**
SAS:
proc sql;
select *
from patients
where age > 60
and diagnosis in ('CVD', 'Diabetes')
and admission_date between '01JAN2024'd and '31DEC2024'd;
R:
library(dplyr)
patients <- data.frame(
age = c(61, 59, 70),
diagnosis = c("CVD", "Diabetes", "Other"),
admission_date = as.Date(c("2024-02-01", "2024-06-15", "2023-12-31"))
)
patients %>%
filter(age > 60,
diagnosis %in% c("CVD", "Diabetes"),
admission_date >= as.Date("2024-01-01"),
admission_date <= as.Date("2024-12-31"))
R Code Explanation
filter()can handle multiple conditions, including date comparisons.- Use
as.Date()to ensure date columns are properly compared. - This approach is concise and readable for complex filters.
**Summary of SAS `as.Date()` `WHERE` vs R `filder`**
| Feature | SAS WHERE |
R filter() (dplyr) |
|---|---|---|
| Language Base | Procedural (SAS) | Functional (Tidyverse R) |
| Computed filters | IF needed post-read |
Inline expressions supported |
| Temporary variables | Not supported | Supported |
| Group-based filtering | PROC SQL or macros | group_by() + summarise() + filter() |
| Null/blank string filter | name = '' or is missing |
`trimws(name) == "" |
| Case-insensitive text search | find(..., 'i') > 0 |
regex(..., ignore_case=TRUE) via str_detect() |
| Date handling | Native SAS date literals | Must convert via as.Date() |
| Syntax readability | Verbose, segmented | Pipe-chained, concise |
3. Common Errors and Warnings with R filter() and Solutions
1. Column Not Found
Error:Error in filter(): object 'region' not found
Cause:
The column name is misspelled or does not exist in the dataframe.
Solution:
Check column names with names(df) and use correct spelling.
Use backticks for non-standard names:
library(dplyr)
df <- data.frame(
"Region Name" = c("East", "West"),
value = c(1, 2)
)
df %>% filter(`Region.Name` == "East")
2. Non-logical Expression
Error:Error: Problem with filter() input ... Input must be a logical vector
Cause:
The condition inside filter() does not return TRUE/FALSE.
Solution:
Ensure the condition is a logical expression:
library(dplyr)
df <- data.frame(age = c(25, 35))
df %>% filter(age > 30) # Correct
df %>% filter(age) # Incorrect, returns numbers not logical
3. NA Values in Filter Condition
Warning:
Rows with NA in the filter condition are dropped silently.
Solution:
Use is.na() to explicitly handle missing values:
library(dplyr)
df <- data.frame(score = c(NA, 60, 40))
df %>% filter(is.na(score) | score > 50)
4. Non-Standard Column Names
Error:Error: unexpected symbol in "filter(df, First Name == 'Ann')"
Solution:
Use backticks or clean names:
library(dplyr)
df <- data.frame("First Name" = c("Ann", "Bob"))
df %>% filter(`First.Name` == "Ann")
library(janitor)
df <- data.frame("First Name" = c("Ann", "Bob"))
df <- df %>% clean_names()
df %>% filter(first_name == "Ann")
5. Multiple Conditions: AND vs OR
Error:
Unexpected results due to misuse of , (AND) and | (OR).
Solution:
- Use commas for AND:
library(dplyr)
df <- data.frame(age = c(25, 35), city = c("Delhi", "Mumbai"))
df %>% filter(age > 30, city == "Delhi")
- Use
|for OR:
library(dplyr)
df <- data.frame(age = c(25, 35), city = c("Delhi", "Mumbai"))
df %>% filter(age > 30 | city == "Delhi")
6. Filtering on Grouped Data
Warning:filter() after group_by() filters rows, not groups.
Solution:
To filter groups, use summarise() then filter():
library(dplyr)
df <- data.frame(city = c("Delhi", "Delhi", "Mumbai"), value = c(1, 2, 3))
df %>% group_by(city) %>% summarise(n = n()) %>% filter(n > 1)
7. Empty Result
Warning:filter() returns an empty dataframe if no rows match.
Solution:
Check your condition and input data.
library(dplyr)
df <- data.frame(age = c(25, 35))
df %>% filter(age > 100)
8. Ambiguous Column Reference
Error:Error: Column 'region' is ambiguous
Cause:
Column exists in multiple dataframes in a join or environment.
Solution:
Use explicit dataframe reference:
library(dplyr)
df <- data.frame(region = c("East", "West"))
filter(df, df$region == "East")
Summary Table: Error/Warning and Solution
| Scenario | Error/Warning | Solution/Code Example |
|---|---|---|
| Column not found | object 'col' not found | Check names, use backticks |
| Non-logical expression | Must be logical vector | Use logical conditions (==, >, etc.) |
| NA in condition | NA rows dropped | Use is.na() in filter |
| Non-standard col names | unexpected symbol | Use backticks or clean_names() |
| Assignment vs comparison | unexpected '=' | Use == for comparison |
| Factor level not present | 0 rows returned | Check levels before filtering |
| AND/OR confusion | Unexpected results | Use , for AND, ` |
| Grouped data filtering | Filters rows, not groups | Use summarise() then filter() |
| Empty result | Empty dataframe | Check condition and input |
| Ambiguous reference | Column is ambiguous | Use explicit dataframe reference |
**Resource download links**
1.4.6.-SAS-WHERE-vs-R-filter.zip