1.4.13. First and Last Records in SAS vs R equivalent

1.4. Migrating from SAS to R: A Skill Conversion Guide

1.4.13. First and Last Records in SAS vs R equivalent

1. Basic Approaches: Identifying First/Last Records by Group

Capability	SAS (first./last. variables)	R (dplyr functions)
First record in group	`first.groupvar` temporary variable	`row_number() == 1` or `dplyr::first()`
Last record in group	`last.groupvar` temporary variable	`row_number() == n()` or `dplyr::last()`
Multiple grouping	`by var1 var2; if first.var2 then...`	`group_by(var1, var2) %>% filter(row_number() == 1)`
Required sorting	Must sort data before using first./last.	No explicit sorting required when using dplyr functions
Efficiency	Very efficient with SAS's sorted data model	Optimized through dplyr's group processing

SAS Example: Basic First/Last Record Identification

/* Find first observation for each patient */
proc sort data=patients;
  by patientid visitdate;
run;

data first_visits;
  set patients;
  by patientid;
  if first.patientid;
run;

Explanation (SAS):

Sorts the dataset by patientid and visitdate using PROC SORT
In the DATA step, the by patientid statement establishes groups
first.patientid is TRUE only for the first observation of each patient
The IF statement retains only records where first.patientid is TRUE
Result is a dataset containing only the first visit for each patient

R Example: Basic First/Last Record Identification

library(dplyr)

# Dummy data for patients
patients <- data.frame(
  patientid = c(101, 101, 101, 102, 102, 103, 103),
  visitdate = as.Date(c("2023-01-15", "2023-02-20", "2023-03-25", "2023-01-10", "2023-03-05", "2023-02-01", "2023-02-28")),
  bp_systolic = c(120, 118, 122, 130, 128, 115, 118),
  bp_diastolic = c(80, 78, 82, 85, 84, 75, 76),
  weight = c(70.5, 71.2, 70.8, 65.4, 64.9, 80.2, 79.5)
)

# Find first observation for each patient
first_visits <- patients %>%
  group_by(patientid) %>%
  arrange(patientid, visitdate) %>%
  filter(row_number() == 1)

# Alternative approach
first_visits_alt <- patients %>%
  group_by(patientid) %>%
  slice_min(order_by = visitdate, n = 1)

Explanation (R):

Uses the dplyr package for data manipulation
group_by(patientid) establishes patient groups
arrange(patientid, visitdate) sorts data within each group
filter(row_number() == 1) keeps only the first record in each group
Alternative approach uses slice_min() to directly select the minimum visitdate
No temporary variables needed; operations are performed in a pipeline

Input: Patient Visits Dataset

patientid	visitdate	bp_systolic	bp_diastolic	weight
101	2023-01-15	120	80	70.5
101	2023-02-20	118	78	71.2
101	2023-03-25	122	82	70.8
102	2023-01-10	130	85	65.4
102	2023-03-05	128	84	64.9
103	2023-02-01	115	75	80.2
103	2023-02-28	118	76	79.5

Expected Output: First Visit for Each Patient

patientid	visitdate	bp_systolic	bp_diastolic	weight
101	2023-01-15	120	80	70.5
102	2023-01-10	130	85	65.4
103	2023-02-01	115	75	80.2

2. Last Records by Group: Identifying Final Observations

SAS Example: Last Record by Group

/* Find last observation for each patient */
proc sort data=patients;
  by patientid visitdate;
run;

data last_visits;
  set patients;
  by patientid;
  if last.patientid;
run;

Explanation (SAS):

Same sorting approach as for first records
Uses last.patientid to identify the last observation in each patient group
Retains only records where last.patientid is TRUE
Result is a dataset with only the most recent visit for each patient

R Example: Last Record by Group

# Find last observation for each patient
last_visits <- patients %>%
  group_by(patientid) %>%
  arrange(patientid, visitdate) %>%
  filter(row_number() == n())

# Alternative approach
last_visits_alt <- patients %>%
  group_by(patientid) %>%
  slice_max(order_by = visitdate, n = 1)

Explanation (R):

Similar approach to finding first records
filter(row_number() == n()) keeps the last record in each group
- n() returns the number of rows in the current group
Alternative approach uses slice_max() to select the record with maximum visitdate
Both approaches give the same result: one record per patient with the most recent visit

Expected Output: Last Visit for Each Patient

patientid	visitdate	bp_systolic	bp_diastolic	weight
101	2023-03-25	122	82	70.8
102	2023-03-05	128	84	64.9
103	2023-02-28	118	76	79.5

3. Multiple Grouping Variables: Hierarchical First/Last Identification

SAS Example: Hierarchical First/Last with Multiple Groups

/* Find first BP measurement for each visit for each patient */
proc sort data=vital_signs;
  by patientid visitdate measuretime;
run;

data first_bp_per_visit;
  set vital_signs;
  by patientid visitdate;
  if first.visitdate and parameter = "BP";
run;

Explanation (SAS):

Sorts by multiple variables: patientid, visitdate, measuretime
The by patientid visitdate statement creates hierarchical grouping
first.visitdate identifies the first record within each patient-visit combination
Additional condition parameter = "BP" ensures only BP measurements are selected
Result is first BP measurement for each visit for each patient

R Example: Hierarchical First/Last with Multiple Groups

# Find first BP measurement for each visit for each patient
first_bp_per_visit <- vital_signs %>%
  filter(parameter == "BP") %>%
  group_by(patientid, visitdate) %>%
  arrange(patientid, visitdate, measuretime) %>%
  filter(row_number() == 1)

# Alternative approach using slice
first_bp_per_visit_alt <- vital_signs %>%
  filter(parameter == "BP") %>%
  group_by(patientid, visitdate) %>%
  slice_min(order_by = measuretime, n = 1)

Explanation (R):

Filters for BP measurements first
Groups by both patientid and visitdate
Arranges by all relevant variables including measuretime
Uses filter(row_number() == 1) to keep the first record in each group
Alternative approach uses slice_min() for more concise code

Input: Vital Signs Dataset

patientid	visitdate	measuretime	parameter	value
101	2023-01-15	09:15	BP	120/80
101	2023-01-15	09:30	BP	118/79
101	2023-01-15	09:15	TEMP	37.2
101	2023-02-20	10:00	BP	118/78
101	2023-02-20	10:20	BP	119/79
102	2023-01-10	11:00	BP	130/85
102	2023-01-10	11:30	TEMP	36.8

Expected Output: First BP per Visit

patientid	visitdate	measuretime	parameter	value
101	2023-01-15	09:15	BP	120/80
101	2023-02-20	10:00	BP	118/78
102	2023-01-10	11:00	BP	130/85

4. Creating Flags for First/Last Records

SAS Example: Creating First/Last Flags

/* Create flags for first and last visits while keeping all records */
proc sort data=patients;
  by patientid visitdate;
run;

data patients_with_flags;
  set patients;
  by patientid;
  first_visit = first.patientid;
  last_visit = last.patientid;
run;

Explanation (SAS):

Creates new variables first_visit and last_visit as flags
These variables will be 1 (TRUE) or 0 (FALSE) based on the first./last. values
All records are kept, but now include flags identifying first and last visits

R Example: Creating First/Last Flags

# Create flags for first and last visits
patients_with_flags <- patients %>%
  group_by(patientid) %>%
  arrange(patientid, visitdate) %>%
  mutate(
    first_visit = row_number() == 1,
    last_visit = row_number() == n()
  )

Explanation (R):

Uses mutate() to add new variables without filtering any records
first_visit is TRUE when the row is first in its group, FALSE otherwise
last_visit is TRUE when the row is last in its group, FALSE otherwise
All records are retained with appropriate flags

Expected Output with Flags

patientid	visitdate	bp_systolic	bp_diastolic	weight	first_visit	last_visit
101	2023-01-15	120	80	70.5	TRUE	FALSE
101	2023-02-20	118	78	71.2	FALSE	FALSE
101	2023-03-25	122	82	70.8	FALSE	TRUE
102	2023-01-10	130	85	65.4	TRUE	FALSE
102	2023-03-05	128	84	64.9	FALSE	TRUE
103	2023-02-01	115	75	80.2	TRUE	FALSE
103	2023-02-28	118	76	79.5	FALSE	TRUE

5. Beyond Basics: Finding Records Based on Complex Criteria

Capability	SAS Approach	R Approach
First record meeting condition	`if condition and first.group then...`	`filter(condition) %>% group_by(var) %>% slice_head()`
Last record before event	Requires multiple steps, sort and first./last.	`filter(date < event_date) %>% slice_max(date)`
Nth record in group	Custom logic with counters	`slice(n)` or `filter(row_number() == n)`
First non-missing record	`if not(missing(var)) and first.group then...`	`filter(!is.na(var)) %>% slice_head()`

SAS Example: First Non-Missing Lab Value

/* Find first non-missing lab value for each patient and test */
proc sort data=lab_results;
  by patientid testcode result_dt;
run;

data first_valid_result;
  set lab_results;
  by patientid testcode;
  if not missing(result) and first.testcode then output;
run;

Explanation (SAS):

Sorts by patientid, testcode, and result_dt
Checks for first occurrence by test within each patient where result is not missing
Only outputs records meeting both conditions
Result is first valid lab value for each patient-test combination

R Example: First Non-Missing Lab Value

# Find first non-missing lab value for each patient and test
first_valid_result <- lab_results %>%
  filter(!is.na(result)) %>%
  group_by(patientid, testcode) %>%
  arrange(patientid, testcode, result_dt) %>%
  slice_head(n = 1)

# Alternative approach
first_valid_result_alt <- lab_results %>%
  group_by(patientid, testcode) %>%
  arrange(patientid, testcode, result_dt) %>%
  filter(!is.na(result)) %>%
  slice_head(n = 1)

Explanation (R):

Two valid approaches with subtle differences:
1. First approach filters out missing values before grouping
2. Second approach filters after grouping, which preserves groups even if all values are missing
Both use slice_head() to take the first record in each group
Result is first non-missing lab value for each test for each patient

Input: Lab Results Dataset

patientid	testcode	result_dt	result	unit
101	GLU	2023-01-15	NA	mg/dL
101	GLU	2023-02-20	95	mg/dL
101	HGB	2023-01-15	14.2	g/dL
101	HGB	2023-02-20	14.0	g/dL
102	GLU	2023-01-10	110	mg/dL
102	HGB	2023-01-10	NA	g/dL
102	HGB	2023-03-05	13.5	g/dL

Expected Output: First Non-Missing Results

patientid	testcode	result_dt	result	unit
101	GLU	2023-02-20	95	mg/dL
101	HGB	2023-01-15	14.2	g/dL
102	GLU	2023-01-10	110	mg/dL
102	HGB	2023-03-05	13.5	g/dL

6. Advanced Techniques: Windowed First/Last Operations

SAS Example: First Record in Rolling Window

/* Find first record within 30 days before each treatment */
proc sort data=medical_events;
  by patientid event_date;
run;

data baseline_events;
  set medical_events;
  by patientid;
  retain treatment_date;
  
  if event_type = "TREATMENT" then do;
    treatment_date = event_date;
    output;
  end;
  else if event_type = "ASSESSMENT" and 
          0 <= (treatment_date - event_date) <= 30 then do;
    is_baseline = 1;
    if first.patientid or first_baseline then do;
      first_baseline = 1;
      output;
    end;
  end;
run;

Explanation (SAS):

Uses retention and first. variables to track the first assessment within 30 days before each treatment
Complex logic with multiple conditions
Requires careful handling of date comparisons and group processing

R Example: First Record in Rolling Window

# Find first assessment within 30 days before each treatment
baseline_events <- medical_events %>%
  filter(event_type == "TREATMENT") %>%
  select(patientid, treatment_date = event_date) %>%
  inner_join(
    medical_events %>% filter(event_type == "ASSESSMENT"),
    by = "patientid"
  ) %>%
  filter(
    event_date <= treatment_date,
    event_date >= treatment_date - 30
  ) %>%
  group_by(patientid, treatment_date) %>%
  arrange(patientid, treatment_date, desc(event_date)) %>%
  slice_head(n = 1) %>%
  mutate(is_baseline = 1)

Explanation (R):

Uses join operations to connect treatments with assessments
Filters for assessments within the 30-day window before treatment
Groups by patient and treatment date
Selects the most recent assessment in each window with slice_head()
Creates clearer, declarative code despite complex logic

Input: Medical Events Dataset

patientid	event_type	event_date	value
101	ASSESSMENT	2023-01-05	45
101	ASSESSMENT	2023-01-25	50
101	TREATMENT	2023-02-01	NA
101	ASSESSMENT	2023-02-15	40
102	ASSESSMENT	2023-01-01	60
102	ASSESSMENT	2023-01-15	55
102	TREATMENT	2023-02-01	NA
102	ASSESSMENT	2023-03-01	50

Expected Output: Baseline Assessments

patientid	treatment_date	event_type	event_date	value	is_baseline
101	2023-02-01	ASSESSMENT	2023-01-25	50	1
102	2023-02-01	ASSESSMENT	2023-01-15	55	1

7. Beyond Basics: Extreme Values Per Group

SAS Example: Min and Max Values with First/Last

/* Find both minimum and maximum lab values for each patient-test */
proc sort data=lab_results;
  by patientid testcode result;
run;

data min_max_results;
  set lab_results;
  by patientid testcode result;
  if first.testcode then min_flag = "Y";
  if last.testcode then max_flag = "Y";
run;

Explanation (SAS):

Sorts by patientid, testcode, and the actual result value
Sets a flag for minimum (first) and maximum (last) values
Logical approach leveraging first./last. after appropriate sorting

R Example: Min and Max Values

# Find both minimum and maximum lab values for each patient-test
min_max_results <- lab_results %>%
  filter(!is.na(result)) %>%
  group_by(patientid, testcode) %>%
  mutate(
    min_flag = if_else(result == min(result), "Y", NA_character_),
    max_flag = if_else(result == max(result), "Y", NA_character_)
  )

# Alternative approach showing exact first/last record logic
min_max_results_alt <- lab_results %>%
  filter(!is.na(result)) %>%
  group_by(patientid, testcode) %>%
  arrange(patientid, testcode, result) %>%
  mutate(
    min_flag = if_else(row_number() == 1, "Y", NA_character_),
    max_flag = if_else(row_number() == n(), "Y", NA_character_)
  )

Explanation (R):

First approach uses min() and max() directly to identify extreme values
Second approach follows SAS logic more directly with sorting and row positions
Both create flags for minimum and maximum values
Demonstrates how R can solve the problem in multiple ways

Input: Lab Results Dataset

patientid	testcode	result_dt	result	unit
101	GLU	2023-01-15	90	mg/dL
101	GLU	2023-02-20	95	mg/dL
101	GLU	2023-03-25	88	mg/dL
101	HGB	2023-01-15	14.2	g/dL
101	HGB	2023-02-20	14.0	g/dL
102	GLU	2023-01-10	110	mg/dL
102	GLU	2023-03-05	105	mg/dL

Expected Output: Min and Max Flagged

patientid	testcode	result_dt	result	unit	min_flag	max_flag
101	GLU	2023-01-15	90	mg/dL
101	GLU	2023-02-20	95	mg/dL		Y
101	GLU	2023-03-25	88	mg/dL	Y
101	HGB	2023-01-15	14.2	g/dL		Y
101	HGB	2023-02-20	14.0	g/dL	Y
102	GLU	2023-01-10	110	mg/dL		Y
102	GLU	2023-03-05	105	mg/dL	Y

8. Summary: SAS First/Last vs R Equivalents

Feature	SAS First/Last	R Approach
Syntax complexity	Simple once understood, requires sorting	Slightly more verbose, but more flexible
Performance	Very efficient for large datasets	Efficient with grouped operations
Integration with language	Tightly integrated with DATA step processing	Fits well within tidyverse data manipulation patterns
Multiple grouping variables	Supports hierarchical first/last variables	Supports multiple grouping levels
Complex conditions	Combines with other DATA step logic	Combines well with other dplyr functions
Sorting requirements	Must sort data correctly before using	Explicitly sort with arrange() when order matters
Readability	Concise but requires understanding of implied logic	More explicit about operations being performed

Key Takeaways:

SAS first./last. variables provide a powerful way to identify boundary records after sorting
R's dplyr functions offer equivalent functionality with row_number(), slice_head(), slice_tail() and related functions
Both approaches require understanding the underlying data structure and proper grouping
R often separates the sorting step from the selection step, making logic more explicit
For complex scenarios, R may require more planning but offers more flexibility
Both approaches can handle advanced tasks like windowed operations and conditional selections
The choice between SAS and R approaches often depends on familiarity and specific project requirements

Resource download links

1.4.13.-First-and-Last-Records-in-SAS-vs-R-equivalent.zip

⁂

1.4. Migrating from SAS to R: A Skill Conversion Guide