contact@a2zlearners.com

1.4. Migrating from SAS to R: A Skill Conversion Guide

1.4.6. SAS WHERE vs. R filter()

Purpose

Both WHERE (SAS) and filter() (R) are used to subset rows based on specific conditions—but they differ in when and how they’re applied.


1. Basic Usage

SAS WHERE (Statement or Dataset Option)

  • Filters before data hits the PDV (Program Data Vector).
  • Only accessible for existing dataset variables.

Example:

data filtered;
  set sales;
  where region = 'East';
run;

Explanation: This pulls only those rows from sales where region equals 'East', directly during read-in.

Alternative using dataset option:

data filtered;
  set sales(where=(region = 'East'));
run;

R filter() from dplyr

  • Pipe-compatible and evaluates logical expressions on the fly.
  • Supports temporary/computed variables and chaining with %>%.

Example:

library(dplyr)
sales <- data.frame(
  region = c("East", "West", "East"),
  amount = c(100, 200, 150)
)
filtered <- sales %>% filter(region == "East")
print(filtered)

R Code Explanation

  • filter(region == "East") selects rows where the region column equals "East".
  • The %>% pipe passes the data frame to filter(), making code readable and chainable.
  • This is the standard way to subset rows in R using dplyr.

2. Beyond Basics – Deeper Capabilities

**2.1. Computed Columns**

SAS:

data result;
  set sales;
  if price * quantity > 1000;
run;

Explanation: WHERE can’t use computed logic like price * quantity. This requires IF after loading rows.

R:

library(dplyr)
sales <- data.frame(
  price = c(100, 200, 300),
  quantity = c(5, 2, 1)
)
sales %>% filter(price * quantity > 1000)

R Code Explanation

  • filter(price * quantity > 1000) allows you to use computed expressions directly in the filter condition.
  • This enables flexible, on-the-fly filtering based on calculations, which is not possible with SAS WHERE.

**2.2. Multiple Conditions**

SAS:

where age > 30 and city = 'Delhi';

R:

library(dplyr)
df <- data.frame(
  age = c(25, 35, 40),
  city = c("Delhi", "Mumbai", "Delhi")
)
filter(df, age > 30, city == 'Delhi')

R Code Explanation

  • Multiple conditions in filter() are combined with logical AND.
  • Only rows where both conditions are TRUE are returned.

**2.3. List Matching (IN or %in%)**

SAS:

where region in ('East', 'West');

R:

library(dplyr)
df <- data.frame(
  region = c("East", "West", "North"),
  value = c(1, 2, 3)
)
filter(df, region %in% c("East", "West"))

R Code Explanation

  • %in% checks if the value is in a set of options.
  • This is the R equivalent of SAS IN for list-based filtering.

**2.4. Missing or Blank Strings**

SAS:

where name = '' or name is missing;

With space-trim for added safety:

where strip(name) = '';

R:

library(dplyr)
df <- data.frame(
  name = c("Ann", "", NA, "Bob")
)
filter(df, trimws(name) == "" | is.na(name))

R Code Explanation

  • trimws(name) == "" checks for blank or whitespace-only strings.
  • is.na(name) checks for missing values.
  • The filter returns rows where the name is blank or missing.

**2.5. Text Search (INDEX, FIND)**

SAS:

where index(name, 'Smith') > 0;           /* Case-sensitive */
where find(name, 'smith', 'i') > 0;       /* Case-insensitive */

R:

library(dplyr)
library(stringr)
df <- data.frame(
  name = c("Smith", "smith", "Jones")
)
filter(df, str_detect(name, "Smith"))                      # Case-sensitive
filter(df, str_detect(name, regex("smith", ignore_case=TRUE))) # Case-insensitive

# Or with base R:
filter(df, grepl("smith", name, ignore.case = TRUE))

R Code Explanation

  • str_detect() from stringr checks for substring matches.
  • Use regex(..., ignore_case=TRUE) for case-insensitive matching.
  • grepl() is the base R alternative for pattern matching in filters.

**2.6. Grouped Filtering (Using Aggregates)**

SAS (PROC SQL):

proc sql;
  create table result as
  select dept, sum(sales) as total_sales
  from sales
  group by dept
  having total_sales > 10000;
quit;

R:

library(dplyr)
sales <- data.frame(
  dept = c("A", "A", "B", "B"),
  sales = c(6000, 5000, 4000, 8000)
)
sales %>%
  group_by(dept) %>%
  summarise(total_sales = sum(sales), .groups = "drop") %>%
  filter(total_sales > 10000)

R Code Explanation

  • group_by(dept) groups the data by department.
  • summarise(total_sales = sum(sales)) computes total sales per group.
  • filter(total_sales > 10000) keeps only groups with total sales above 10,000.

**2.7. Date Range Filtering**

SAS:

proc sql;
  select *
  from patients
  where age > 60 
    and diagnosis in ('CVD', 'Diabetes')
    and admission_date between '01JAN2024'd and '31DEC2024'd;

R:

library(dplyr)
patients <- data.frame(
  age = c(61, 59, 70),
  diagnosis = c("CVD", "Diabetes", "Other"),
  admission_date = as.Date(c("2024-02-01", "2024-06-15", "2023-12-31"))
)
patients %>%
  filter(age > 60,
         diagnosis %in% c("CVD", "Diabetes"),
         admission_date >= as.Date("2024-01-01"),
         admission_date <= as.Date("2024-12-31"))

R Code Explanation

  • filter() can handle multiple conditions, including date comparisons.
  • Use as.Date() to ensure date columns are properly compared.
  • This approach is concise and readable for complex filters.

**Summary of SAS `as.Date()` `WHERE` vs R `filder`**

Feature SAS WHERE R filter() (dplyr)
Language Base Procedural (SAS) Functional (Tidyverse R)
Computed filters IF needed post-read Inline expressions supported
Temporary variables Not supported Supported
Group-based filtering PROC SQL or macros group_by() + summarise() + filter()
Null/blank string filter name = '' or is missing `trimws(name) == ""
Case-insensitive text search find(..., 'i') > 0 regex(..., ignore_case=TRUE) via str_detect()
Date handling Native SAS date literals Must convert via as.Date()
Syntax readability Verbose, segmented Pipe-chained, concise

3. Common Errors and Warnings with R filter() and Solutions

1. Column Not Found

Error:
Error in filter(): object 'region' not found

Cause:
The column name is misspelled or does not exist in the dataframe.

Solution:
Check column names with names(df) and use correct spelling.
Use backticks for non-standard names:

library(dplyr)
df <- data.frame(
  "Region Name" = c("East", "West"),
  value = c(1, 2)
)
df %>% filter(`Region.Name` == "East")

2. Non-logical Expression

Error:
Error: Problem with filter() input ... Input must be a logical vector

Cause:
The condition inside filter() does not return TRUE/FALSE.

Solution:
Ensure the condition is a logical expression:

library(dplyr)
df <- data.frame(age = c(25, 35))
df %>% filter(age > 30)   # Correct
df %>% filter(age)        # Incorrect, returns numbers not logical

3. NA Values in Filter Condition

Warning:
Rows with NA in the filter condition are dropped silently.

Solution:
Use is.na() to explicitly handle missing values:

library(dplyr)
df <- data.frame(score = c(NA, 60, 40))
df %>% filter(is.na(score) | score > 50)

4. Non-Standard Column Names

Error:
Error: unexpected symbol in "filter(df, First Name == 'Ann')"

Solution:
Use backticks or clean names:

library(dplyr)
df <- data.frame("First Name" = c("Ann", "Bob"))
df %>% filter(`First.Name` == "Ann")

library(janitor)
df <- data.frame("First Name" = c("Ann", "Bob"))
df <- df %>% clean_names()
df %>% filter(first_name == "Ann")

5. Multiple Conditions: AND vs OR

Error:
Unexpected results due to misuse of , (AND) and | (OR).

Solution:

  • Use commas for AND:
library(dplyr)
df <- data.frame(age = c(25, 35), city = c("Delhi", "Mumbai"))
df %>% filter(age > 30, city == "Delhi")
  • Use | for OR:
library(dplyr)
df <- data.frame(age = c(25, 35), city = c("Delhi", "Mumbai"))
df %>% filter(age > 30 | city == "Delhi")

6. Filtering on Grouped Data

Warning:
filter() after group_by() filters rows, not groups.

Solution:
To filter groups, use summarise() then filter():

library(dplyr)
df <- data.frame(city = c("Delhi", "Delhi", "Mumbai"), value = c(1, 2, 3))
df %>% group_by(city) %>% summarise(n = n()) %>% filter(n > 1)

7. Empty Result

Warning:
filter() returns an empty dataframe if no rows match.

Solution:
Check your condition and input data.

library(dplyr)
df <- data.frame(age = c(25, 35))
df %>% filter(age > 100)

8. Ambiguous Column Reference

Error:
Error: Column 'region' is ambiguous

Cause:
Column exists in multiple dataframes in a join or environment.

Solution:
Use explicit dataframe reference:

library(dplyr)
df <- data.frame(region = c("East", "West"))
filter(df, df$region == "East")

Summary Table: Error/Warning and Solution

Scenario Error/Warning Solution/Code Example
Column not found object 'col' not found Check names, use backticks
Non-logical expression Must be logical vector Use logical conditions (==, >, etc.)
NA in condition NA rows dropped Use is.na() in filter
Non-standard col names unexpected symbol Use backticks or clean_names()
Assignment vs comparison unexpected '=' Use == for comparison
Factor level not present 0 rows returned Check levels before filtering
AND/OR confusion Unexpected results Use , for AND, `
Grouped data filtering Filters rows, not groups Use summarise() then filter()
Empty result Empty dataframe Check condition and input
Ambiguous reference Column is ambiguous Use explicit dataframe reference

**Resource download links**

1.4.6.-SAS-WHERE-vs-R-filter.zip