contact@a2zlearners.com

1.4. Migrating from SAS to R: A Skill Conversion Guide

1.4.13. First and Last Records in SAS vs R equivalent

1. Basic Approaches: Identifying First/Last Records by Group

Capability SAS (first./last. variables) R (dplyr functions)
First record in group first.groupvar temporary variable row_number() == 1 or dplyr::first()
Last record in group last.groupvar temporary variable row_number() == n() or dplyr::last()
Multiple grouping by var1 var2; if first.var2 then... group_by(var1, var2) %>% filter(row_number() == 1)
Required sorting Must sort data before using first./last. No explicit sorting required when using dplyr functions
Efficiency Very efficient with SAS's sorted data model Optimized through dplyr's group processing

SAS Example: Basic First/Last Record Identification

/* Find first observation for each patient */
proc sort data=patients;
  by patientid visitdate;
run;

data first_visits;
  set patients;
  by patientid;
  if first.patientid;
run;

Explanation (SAS):

  • Sorts the dataset by patientid and visitdate using PROC SORT
  • In the DATA step, the by patientid statement establishes groups
  • first.patientid is TRUE only for the first observation of each patient
  • The IF statement retains only records where first.patientid is TRUE
  • Result is a dataset containing only the first visit for each patient

R Example: Basic First/Last Record Identification

library(dplyr)

# Dummy data for patients
patients <- data.frame(
  patientid = c(101, 101, 101, 102, 102, 103, 103),
  visitdate = as.Date(c("2023-01-15", "2023-02-20", "2023-03-25", "2023-01-10", "2023-03-05", "2023-02-01", "2023-02-28")),
  bp_systolic = c(120, 118, 122, 130, 128, 115, 118),
  bp_diastolic = c(80, 78, 82, 85, 84, 75, 76),
  weight = c(70.5, 71.2, 70.8, 65.4, 64.9, 80.2, 79.5)
)

# Find first observation for each patient
first_visits <- patients %>%
  group_by(patientid) %>%
  arrange(patientid, visitdate) %>%
  filter(row_number() == 1)

# Alternative approach
first_visits_alt <- patients %>%
  group_by(patientid) %>%
  slice_min(order_by = visitdate, n = 1)

Explanation (R):

  • Uses the dplyr package for data manipulation
  • group_by(patientid) establishes patient groups
  • arrange(patientid, visitdate) sorts data within each group
  • filter(row_number() == 1) keeps only the first record in each group
  • Alternative approach uses slice_min() to directly select the minimum visitdate
  • No temporary variables needed; operations are performed in a pipeline

Input: Patient Visits Dataset

patientid visitdate bp_systolic bp_diastolic weight
101 2023-01-15 120 80 70.5
101 2023-02-20 118 78 71.2
101 2023-03-25 122 82 70.8
102 2023-01-10 130 85 65.4
102 2023-03-05 128 84 64.9
103 2023-02-01 115 75 80.2
103 2023-02-28 118 76 79.5

Expected Output: First Visit for Each Patient

patientid visitdate bp_systolic bp_diastolic weight
101 2023-01-15 120 80 70.5
102 2023-01-10 130 85 65.4
103 2023-02-01 115 75 80.2

2. Last Records by Group: Identifying Final Observations

SAS Example: Last Record by Group

/* Find last observation for each patient */
proc sort data=patients;
  by patientid visitdate;
run;

data last_visits;
  set patients;
  by patientid;
  if last.patientid;
run;

Explanation (SAS):

  • Same sorting approach as for first records
  • Uses last.patientid to identify the last observation in each patient group
  • Retains only records where last.patientid is TRUE
  • Result is a dataset with only the most recent visit for each patient

R Example: Last Record by Group

# Find last observation for each patient
last_visits <- patients %>%
  group_by(patientid) %>%
  arrange(patientid, visitdate) %>%
  filter(row_number() == n())

# Alternative approach
last_visits_alt <- patients %>%
  group_by(patientid) %>%
  slice_max(order_by = visitdate, n = 1)

Explanation (R):

  • Similar approach to finding first records
  • filter(row_number() == n()) keeps the last record in each group
    • n() returns the number of rows in the current group
  • Alternative approach uses slice_max() to select the record with maximum visitdate
  • Both approaches give the same result: one record per patient with the most recent visit

Expected Output: Last Visit for Each Patient

patientid visitdate bp_systolic bp_diastolic weight
101 2023-03-25 122 82 70.8
102 2023-03-05 128 84 64.9
103 2023-02-28 118 76 79.5

3. Multiple Grouping Variables: Hierarchical First/Last Identification

SAS Example: Hierarchical First/Last with Multiple Groups

/* Find first BP measurement for each visit for each patient */
proc sort data=vital_signs;
  by patientid visitdate measuretime;
run;

data first_bp_per_visit;
  set vital_signs;
  by patientid visitdate;
  if first.visitdate and parameter = "BP";
run;

Explanation (SAS):

  • Sorts by multiple variables: patientid, visitdate, measuretime
  • The by patientid visitdate statement creates hierarchical grouping
  • first.visitdate identifies the first record within each patient-visit combination
  • Additional condition parameter = "BP" ensures only BP measurements are selected
  • Result is first BP measurement for each visit for each patient

R Example: Hierarchical First/Last with Multiple Groups

# Find first BP measurement for each visit for each patient
first_bp_per_visit <- vital_signs %>%
  filter(parameter == "BP") %>%
  group_by(patientid, visitdate) %>%
  arrange(patientid, visitdate, measuretime) %>%
  filter(row_number() == 1)

# Alternative approach using slice
first_bp_per_visit_alt <- vital_signs %>%
  filter(parameter == "BP") %>%
  group_by(patientid, visitdate) %>%
  slice_min(order_by = measuretime, n = 1)

Explanation (R):

  • Filters for BP measurements first
  • Groups by both patientid and visitdate
  • Arranges by all relevant variables including measuretime
  • Uses filter(row_number() == 1) to keep the first record in each group
  • Alternative approach uses slice_min() for more concise code

Input: Vital Signs Dataset

patientid visitdate measuretime parameter value
101 2023-01-15 09:15 BP 120/80
101 2023-01-15 09:30 BP 118/79
101 2023-01-15 09:15 TEMP 37.2
101 2023-02-20 10:00 BP 118/78
101 2023-02-20 10:20 BP 119/79
102 2023-01-10 11:00 BP 130/85
102 2023-01-10 11:30 TEMP 36.8

Expected Output: First BP per Visit

patientid visitdate measuretime parameter value
101 2023-01-15 09:15 BP 120/80
101 2023-02-20 10:00 BP 118/78
102 2023-01-10 11:00 BP 130/85

4. Creating Flags for First/Last Records

SAS Example: Creating First/Last Flags

/* Create flags for first and last visits while keeping all records */
proc sort data=patients;
  by patientid visitdate;
run;

data patients_with_flags;
  set patients;
  by patientid;
  first_visit = first.patientid;
  last_visit = last.patientid;
run;

Explanation (SAS):

  • Creates new variables first_visit and last_visit as flags
  • These variables will be 1 (TRUE) or 0 (FALSE) based on the first./last. values
  • All records are kept, but now include flags identifying first and last visits

R Example: Creating First/Last Flags

# Create flags for first and last visits
patients_with_flags <- patients %>%
  group_by(patientid) %>%
  arrange(patientid, visitdate) %>%
  mutate(
    first_visit = row_number() == 1,
    last_visit = row_number() == n()
  )

Explanation (R):

  • Uses mutate() to add new variables without filtering any records
  • first_visit is TRUE when the row is first in its group, FALSE otherwise
  • last_visit is TRUE when the row is last in its group, FALSE otherwise
  • All records are retained with appropriate flags

Expected Output with Flags

patientid visitdate bp_systolic bp_diastolic weight first_visit last_visit
101 2023-01-15 120 80 70.5 TRUE FALSE
101 2023-02-20 118 78 71.2 FALSE FALSE
101 2023-03-25 122 82 70.8 FALSE TRUE
102 2023-01-10 130 85 65.4 TRUE FALSE
102 2023-03-05 128 84 64.9 FALSE TRUE
103 2023-02-01 115 75 80.2 TRUE FALSE
103 2023-02-28 118 76 79.5 FALSE TRUE

5. Beyond Basics: Finding Records Based on Complex Criteria

Capability SAS Approach R Approach
First record meeting condition if condition and first.group then... filter(condition) %>% group_by(var) %>% slice_head()
Last record before event Requires multiple steps, sort and first./last. filter(date < event_date) %>% slice_max(date)
Nth record in group Custom logic with counters slice(n) or filter(row_number() == n)
First non-missing record if not(missing(var)) and first.group then... filter(!is.na(var)) %>% slice_head()

SAS Example: First Non-Missing Lab Value

/* Find first non-missing lab value for each patient and test */
proc sort data=lab_results;
  by patientid testcode result_dt;
run;

data first_valid_result;
  set lab_results;
  by patientid testcode;
  if not missing(result) and first.testcode then output;
run;

Explanation (SAS):

  • Sorts by patientid, testcode, and result_dt
  • Checks for first occurrence by test within each patient where result is not missing
  • Only outputs records meeting both conditions
  • Result is first valid lab value for each patient-test combination

R Example: First Non-Missing Lab Value

# Find first non-missing lab value for each patient and test
first_valid_result <- lab_results %>%
  filter(!is.na(result)) %>%
  group_by(patientid, testcode) %>%
  arrange(patientid, testcode, result_dt) %>%
  slice_head(n = 1)

# Alternative approach
first_valid_result_alt <- lab_results %>%
  group_by(patientid, testcode) %>%
  arrange(patientid, testcode, result_dt) %>%
  filter(!is.na(result)) %>%
  slice_head(n = 1)

Explanation (R):

  • Two valid approaches with subtle differences:
    1. First approach filters out missing values before grouping
    2. Second approach filters after grouping, which preserves groups even if all values are missing
  • Both use slice_head() to take the first record in each group
  • Result is first non-missing lab value for each test for each patient

Input: Lab Results Dataset

patientid testcode result_dt result unit
101 GLU 2023-01-15 NA mg/dL
101 GLU 2023-02-20 95 mg/dL
101 HGB 2023-01-15 14.2 g/dL
101 HGB 2023-02-20 14.0 g/dL
102 GLU 2023-01-10 110 mg/dL
102 HGB 2023-01-10 NA g/dL
102 HGB 2023-03-05 13.5 g/dL

Expected Output: First Non-Missing Results

patientid testcode result_dt result unit
101 GLU 2023-02-20 95 mg/dL
101 HGB 2023-01-15 14.2 g/dL
102 GLU 2023-01-10 110 mg/dL
102 HGB 2023-03-05 13.5 g/dL

6. Advanced Techniques: Windowed First/Last Operations

SAS Example: First Record in Rolling Window

/* Find first record within 30 days before each treatment */
proc sort data=medical_events;
  by patientid event_date;
run;

data baseline_events;
  set medical_events;
  by patientid;
  retain treatment_date;
  
  if event_type = "TREATMENT" then do;
    treatment_date = event_date;
    output;
  end;
  else if event_type = "ASSESSMENT" and 
          0 <= (treatment_date - event_date) <= 30 then do;
    is_baseline = 1;
    if first.patientid or first_baseline then do;
      first_baseline = 1;
      output;
    end;
  end;
run;

Explanation (SAS):

  • Uses retention and first. variables to track the first assessment within 30 days before each treatment
  • Complex logic with multiple conditions
  • Requires careful handling of date comparisons and group processing

R Example: First Record in Rolling Window

# Find first assessment within 30 days before each treatment
baseline_events <- medical_events %>%
  filter(event_type == "TREATMENT") %>%
  select(patientid, treatment_date = event_date) %>%
  inner_join(
    medical_events %>% filter(event_type == "ASSESSMENT"),
    by = "patientid"
  ) %>%
  filter(
    event_date <= treatment_date,
    event_date >= treatment_date - 30
  ) %>%
  group_by(patientid, treatment_date) %>%
  arrange(patientid, treatment_date, desc(event_date)) %>%
  slice_head(n = 1) %>%
  mutate(is_baseline = 1)

Explanation (R):

  • Uses join operations to connect treatments with assessments
  • Filters for assessments within the 30-day window before treatment
  • Groups by patient and treatment date
  • Selects the most recent assessment in each window with slice_head()
  • Creates clearer, declarative code despite complex logic

Input: Medical Events Dataset

patientid event_type event_date value
101 ASSESSMENT 2023-01-05 45
101 ASSESSMENT 2023-01-25 50
101 TREATMENT 2023-02-01 NA
101 ASSESSMENT 2023-02-15 40
102 ASSESSMENT 2023-01-01 60
102 ASSESSMENT 2023-01-15 55
102 TREATMENT 2023-02-01 NA
102 ASSESSMENT 2023-03-01 50

Expected Output: Baseline Assessments

patientid treatment_date event_type event_date value is_baseline
101 2023-02-01 ASSESSMENT 2023-01-25 50 1
102 2023-02-01 ASSESSMENT 2023-01-15 55 1

7. Beyond Basics: Extreme Values Per Group

SAS Example: Min and Max Values with First/Last

/* Find both minimum and maximum lab values for each patient-test */
proc sort data=lab_results;
  by patientid testcode result;
run;

data min_max_results;
  set lab_results;
  by patientid testcode result;
  if first.testcode then min_flag = "Y";
  if last.testcode then max_flag = "Y";
run;

Explanation (SAS):

  • Sorts by patientid, testcode, and the actual result value
  • Sets a flag for minimum (first) and maximum (last) values
  • Logical approach leveraging first./last. after appropriate sorting

R Example: Min and Max Values

# Find both minimum and maximum lab values for each patient-test
min_max_results <- lab_results %>%
  filter(!is.na(result)) %>%
  group_by(patientid, testcode) %>%
  mutate(
    min_flag = if_else(result == min(result), "Y", NA_character_),
    max_flag = if_else(result == max(result), "Y", NA_character_)
  )

# Alternative approach showing exact first/last record logic
min_max_results_alt <- lab_results %>%
  filter(!is.na(result)) %>%
  group_by(patientid, testcode) %>%
  arrange(patientid, testcode, result) %>%
  mutate(
    min_flag = if_else(row_number() == 1, "Y", NA_character_),
    max_flag = if_else(row_number() == n(), "Y", NA_character_)
  )

Explanation (R):

  • First approach uses min() and max() directly to identify extreme values
  • Second approach follows SAS logic more directly with sorting and row positions
  • Both create flags for minimum and maximum values
  • Demonstrates how R can solve the problem in multiple ways

Input: Lab Results Dataset

patientid testcode result_dt result unit
101 GLU 2023-01-15 90 mg/dL
101 GLU 2023-02-20 95 mg/dL
101 GLU 2023-03-25 88 mg/dL
101 HGB 2023-01-15 14.2 g/dL
101 HGB 2023-02-20 14.0 g/dL
102 GLU 2023-01-10 110 mg/dL
102 GLU 2023-03-05 105 mg/dL

Expected Output: Min and Max Flagged

patientid testcode result_dt result unit min_flag max_flag
101 GLU 2023-01-15 90 mg/dL
101 GLU 2023-02-20 95 mg/dL Y
101 GLU 2023-03-25 88 mg/dL Y
101 HGB 2023-01-15 14.2 g/dL Y
101 HGB 2023-02-20 14.0 g/dL Y
102 GLU 2023-01-10 110 mg/dL Y
102 GLU 2023-03-05 105 mg/dL Y

8. Summary: SAS First/Last vs R Equivalents

Feature SAS First/Last R Approach
Syntax complexity Simple once understood, requires sorting Slightly more verbose, but more flexible
Performance Very efficient for large datasets Efficient with grouped operations
Integration with language Tightly integrated with DATA step processing Fits well within tidyverse data manipulation patterns
Multiple grouping variables Supports hierarchical first/last variables Supports multiple grouping levels
Complex conditions Combines with other DATA step logic Combines well with other dplyr functions
Sorting requirements Must sort data correctly before using Explicitly sort with arrange() when order matters
Readability Concise but requires understanding of implied logic More explicit about operations being performed

Key Takeaways:

  1. SAS first./last. variables provide a powerful way to identify boundary records after sorting
  2. R's dplyr functions offer equivalent functionality with row_number(), slice_head(), slice_tail() and related functions
  3. Both approaches require understanding the underlying data structure and proper grouping
  4. R often separates the sorting step from the selection step, making logic more explicit
  5. For complex scenarios, R may require more planning but offers more flexibility
  6. Both approaches can handle advanced tasks like windowed operations and conditional selections
  7. The choice between SAS and R approaches often depends on familiarity and specific project requirements

**Resource download links**

1.4.13.-First-and-Last-Records-in-SAS-vs-R-equivalent.zip