1.4. Migrating from SAS to R: A Skill Conversion Guide
1.4.13. First and Last Records in SAS vs R equivalent
1. Basic Approaches: Identifying First/Last Records by Group
| Capability | SAS (first./last. variables) | R (dplyr functions) |
|---|---|---|
| First record in group | first.groupvar temporary variable |
row_number() == 1 or dplyr::first() |
| Last record in group | last.groupvar temporary variable |
row_number() == n() or dplyr::last() |
| Multiple grouping | by var1 var2; if first.var2 then... |
group_by(var1, var2) %>% filter(row_number() == 1) |
| Required sorting | Must sort data before using first./last. | No explicit sorting required when using dplyr functions |
| Efficiency | Very efficient with SAS's sorted data model | Optimized through dplyr's group processing |
SAS Example: Basic First/Last Record Identification
/* Find first observation for each patient */
proc sort data=patients;
by patientid visitdate;
run;
data first_visits;
set patients;
by patientid;
if first.patientid;
run;
Explanation (SAS):
- Sorts the dataset by
patientidandvisitdateusingPROC SORT - In the DATA step, the
by patientidstatement establishes groups first.patientidis TRUE only for the first observation of each patient- The IF statement retains only records where
first.patientidis TRUE - Result is a dataset containing only the first visit for each patient
R Example: Basic First/Last Record Identification
library(dplyr)
# Dummy data for patients
patients <- data.frame(
patientid = c(101, 101, 101, 102, 102, 103, 103),
visitdate = as.Date(c("2023-01-15", "2023-02-20", "2023-03-25", "2023-01-10", "2023-03-05", "2023-02-01", "2023-02-28")),
bp_systolic = c(120, 118, 122, 130, 128, 115, 118),
bp_diastolic = c(80, 78, 82, 85, 84, 75, 76),
weight = c(70.5, 71.2, 70.8, 65.4, 64.9, 80.2, 79.5)
)
# Find first observation for each patient
first_visits <- patients %>%
group_by(patientid) %>%
arrange(patientid, visitdate) %>%
filter(row_number() == 1)
# Alternative approach
first_visits_alt <- patients %>%
group_by(patientid) %>%
slice_min(order_by = visitdate, n = 1)
Explanation (R):
- Uses the dplyr package for data manipulation
group_by(patientid)establishes patient groupsarrange(patientid, visitdate)sorts data within each groupfilter(row_number() == 1)keeps only the first record in each group- Alternative approach uses
slice_min()to directly select the minimum visitdate - No temporary variables needed; operations are performed in a pipeline
Input: Patient Visits Dataset
| patientid | visitdate | bp_systolic | bp_diastolic | weight |
|---|---|---|---|---|
| 101 | 2023-01-15 | 120 | 80 | 70.5 |
| 101 | 2023-02-20 | 118 | 78 | 71.2 |
| 101 | 2023-03-25 | 122 | 82 | 70.8 |
| 102 | 2023-01-10 | 130 | 85 | 65.4 |
| 102 | 2023-03-05 | 128 | 84 | 64.9 |
| 103 | 2023-02-01 | 115 | 75 | 80.2 |
| 103 | 2023-02-28 | 118 | 76 | 79.5 |
Expected Output: First Visit for Each Patient
| patientid | visitdate | bp_systolic | bp_diastolic | weight |
|---|---|---|---|---|
| 101 | 2023-01-15 | 120 | 80 | 70.5 |
| 102 | 2023-01-10 | 130 | 85 | 65.4 |
| 103 | 2023-02-01 | 115 | 75 | 80.2 |
2. Last Records by Group: Identifying Final Observations
SAS Example: Last Record by Group
/* Find last observation for each patient */
proc sort data=patients;
by patientid visitdate;
run;
data last_visits;
set patients;
by patientid;
if last.patientid;
run;
Explanation (SAS):
- Same sorting approach as for first records
- Uses
last.patientidto identify the last observation in each patient group - Retains only records where
last.patientidis TRUE - Result is a dataset with only the most recent visit for each patient
R Example: Last Record by Group
# Find last observation for each patient
last_visits <- patients %>%
group_by(patientid) %>%
arrange(patientid, visitdate) %>%
filter(row_number() == n())
# Alternative approach
last_visits_alt <- patients %>%
group_by(patientid) %>%
slice_max(order_by = visitdate, n = 1)
Explanation (R):
- Similar approach to finding first records
filter(row_number() == n())keeps the last record in each groupn()returns the number of rows in the current group
- Alternative approach uses
slice_max()to select the record with maximum visitdate - Both approaches give the same result: one record per patient with the most recent visit
Expected Output: Last Visit for Each Patient
| patientid | visitdate | bp_systolic | bp_diastolic | weight |
|---|---|---|---|---|
| 101 | 2023-03-25 | 122 | 82 | 70.8 |
| 102 | 2023-03-05 | 128 | 84 | 64.9 |
| 103 | 2023-02-28 | 118 | 76 | 79.5 |
3. Multiple Grouping Variables: Hierarchical First/Last Identification
SAS Example: Hierarchical First/Last with Multiple Groups
/* Find first BP measurement for each visit for each patient */
proc sort data=vital_signs;
by patientid visitdate measuretime;
run;
data first_bp_per_visit;
set vital_signs;
by patientid visitdate;
if first.visitdate and parameter = "BP";
run;
Explanation (SAS):
- Sorts by multiple variables:
patientid,visitdate,measuretime - The
by patientid visitdatestatement creates hierarchical grouping first.visitdateidentifies the first record within each patient-visit combination- Additional condition
parameter = "BP"ensures only BP measurements are selected - Result is first BP measurement for each visit for each patient
R Example: Hierarchical First/Last with Multiple Groups
# Find first BP measurement for each visit for each patient
first_bp_per_visit <- vital_signs %>%
filter(parameter == "BP") %>%
group_by(patientid, visitdate) %>%
arrange(patientid, visitdate, measuretime) %>%
filter(row_number() == 1)
# Alternative approach using slice
first_bp_per_visit_alt <- vital_signs %>%
filter(parameter == "BP") %>%
group_by(patientid, visitdate) %>%
slice_min(order_by = measuretime, n = 1)
Explanation (R):
- Filters for BP measurements first
- Groups by both
patientidandvisitdate - Arranges by all relevant variables including
measuretime - Uses
filter(row_number() == 1)to keep the first record in each group - Alternative approach uses
slice_min()for more concise code
Input: Vital Signs Dataset
| patientid | visitdate | measuretime | parameter | value |
|---|---|---|---|---|
| 101 | 2023-01-15 | 09:15 | BP | 120/80 |
| 101 | 2023-01-15 | 09:30 | BP | 118/79 |
| 101 | 2023-01-15 | 09:15 | TEMP | 37.2 |
| 101 | 2023-02-20 | 10:00 | BP | 118/78 |
| 101 | 2023-02-20 | 10:20 | BP | 119/79 |
| 102 | 2023-01-10 | 11:00 | BP | 130/85 |
| 102 | 2023-01-10 | 11:30 | TEMP | 36.8 |
Expected Output: First BP per Visit
| patientid | visitdate | measuretime | parameter | value |
|---|---|---|---|---|
| 101 | 2023-01-15 | 09:15 | BP | 120/80 |
| 101 | 2023-02-20 | 10:00 | BP | 118/78 |
| 102 | 2023-01-10 | 11:00 | BP | 130/85 |
4. Creating Flags for First/Last Records
SAS Example: Creating First/Last Flags
/* Create flags for first and last visits while keeping all records */
proc sort data=patients;
by patientid visitdate;
run;
data patients_with_flags;
set patients;
by patientid;
first_visit = first.patientid;
last_visit = last.patientid;
run;
Explanation (SAS):
- Creates new variables
first_visitandlast_visitas flags - These variables will be 1 (TRUE) or 0 (FALSE) based on the first./last. values
- All records are kept, but now include flags identifying first and last visits
R Example: Creating First/Last Flags
# Create flags for first and last visits
patients_with_flags <- patients %>%
group_by(patientid) %>%
arrange(patientid, visitdate) %>%
mutate(
first_visit = row_number() == 1,
last_visit = row_number() == n()
)
Explanation (R):
- Uses
mutate()to add new variables without filtering any records first_visitis TRUE when the row is first in its group, FALSE otherwiselast_visitis TRUE when the row is last in its group, FALSE otherwise- All records are retained with appropriate flags
Expected Output with Flags
| patientid | visitdate | bp_systolic | bp_diastolic | weight | first_visit | last_visit |
|---|---|---|---|---|---|---|
| 101 | 2023-01-15 | 120 | 80 | 70.5 | TRUE | FALSE |
| 101 | 2023-02-20 | 118 | 78 | 71.2 | FALSE | FALSE |
| 101 | 2023-03-25 | 122 | 82 | 70.8 | FALSE | TRUE |
| 102 | 2023-01-10 | 130 | 85 | 65.4 | TRUE | FALSE |
| 102 | 2023-03-05 | 128 | 84 | 64.9 | FALSE | TRUE |
| 103 | 2023-02-01 | 115 | 75 | 80.2 | TRUE | FALSE |
| 103 | 2023-02-28 | 118 | 76 | 79.5 | FALSE | TRUE |
5. Beyond Basics: Finding Records Based on Complex Criteria
| Capability | SAS Approach | R Approach |
|---|---|---|
| First record meeting condition | if condition and first.group then... |
filter(condition) %>% group_by(var) %>% slice_head() |
| Last record before event | Requires multiple steps, sort and first./last. | filter(date < event_date) %>% slice_max(date) |
| Nth record in group | Custom logic with counters | slice(n) or filter(row_number() == n) |
| First non-missing record | if not(missing(var)) and first.group then... |
filter(!is.na(var)) %>% slice_head() |
SAS Example: First Non-Missing Lab Value
/* Find first non-missing lab value for each patient and test */
proc sort data=lab_results;
by patientid testcode result_dt;
run;
data first_valid_result;
set lab_results;
by patientid testcode;
if not missing(result) and first.testcode then output;
run;
Explanation (SAS):
- Sorts by
patientid,testcode, andresult_dt - Checks for first occurrence by test within each patient where result is not missing
- Only outputs records meeting both conditions
- Result is first valid lab value for each patient-test combination
R Example: First Non-Missing Lab Value
# Find first non-missing lab value for each patient and test
first_valid_result <- lab_results %>%
filter(!is.na(result)) %>%
group_by(patientid, testcode) %>%
arrange(patientid, testcode, result_dt) %>%
slice_head(n = 1)
# Alternative approach
first_valid_result_alt <- lab_results %>%
group_by(patientid, testcode) %>%
arrange(patientid, testcode, result_dt) %>%
filter(!is.na(result)) %>%
slice_head(n = 1)
Explanation (R):
- Two valid approaches with subtle differences:
- First approach filters out missing values before grouping
- Second approach filters after grouping, which preserves groups even if all values are missing
- Both use
slice_head()to take the first record in each group - Result is first non-missing lab value for each test for each patient
Input: Lab Results Dataset
| patientid | testcode | result_dt | result | unit |
|---|---|---|---|---|
| 101 | GLU | 2023-01-15 | NA | mg/dL |
| 101 | GLU | 2023-02-20 | 95 | mg/dL |
| 101 | HGB | 2023-01-15 | 14.2 | g/dL |
| 101 | HGB | 2023-02-20 | 14.0 | g/dL |
| 102 | GLU | 2023-01-10 | 110 | mg/dL |
| 102 | HGB | 2023-01-10 | NA | g/dL |
| 102 | HGB | 2023-03-05 | 13.5 | g/dL |
Expected Output: First Non-Missing Results
| patientid | testcode | result_dt | result | unit |
|---|---|---|---|---|
| 101 | GLU | 2023-02-20 | 95 | mg/dL |
| 101 | HGB | 2023-01-15 | 14.2 | g/dL |
| 102 | GLU | 2023-01-10 | 110 | mg/dL |
| 102 | HGB | 2023-03-05 | 13.5 | g/dL |
6. Advanced Techniques: Windowed First/Last Operations
SAS Example: First Record in Rolling Window
/* Find first record within 30 days before each treatment */
proc sort data=medical_events;
by patientid event_date;
run;
data baseline_events;
set medical_events;
by patientid;
retain treatment_date;
if event_type = "TREATMENT" then do;
treatment_date = event_date;
output;
end;
else if event_type = "ASSESSMENT" and
0 <= (treatment_date - event_date) <= 30 then do;
is_baseline = 1;
if first.patientid or first_baseline then do;
first_baseline = 1;
output;
end;
end;
run;
Explanation (SAS):
- Uses retention and first. variables to track the first assessment within 30 days before each treatment
- Complex logic with multiple conditions
- Requires careful handling of date comparisons and group processing
R Example: First Record in Rolling Window
# Find first assessment within 30 days before each treatment
baseline_events <- medical_events %>%
filter(event_type == "TREATMENT") %>%
select(patientid, treatment_date = event_date) %>%
inner_join(
medical_events %>% filter(event_type == "ASSESSMENT"),
by = "patientid"
) %>%
filter(
event_date <= treatment_date,
event_date >= treatment_date - 30
) %>%
group_by(patientid, treatment_date) %>%
arrange(patientid, treatment_date, desc(event_date)) %>%
slice_head(n = 1) %>%
mutate(is_baseline = 1)
Explanation (R):
- Uses join operations to connect treatments with assessments
- Filters for assessments within the 30-day window before treatment
- Groups by patient and treatment date
- Selects the most recent assessment in each window with
slice_head() - Creates clearer, declarative code despite complex logic
Input: Medical Events Dataset
| patientid | event_type | event_date | value |
|---|---|---|---|
| 101 | ASSESSMENT | 2023-01-05 | 45 |
| 101 | ASSESSMENT | 2023-01-25 | 50 |
| 101 | TREATMENT | 2023-02-01 | NA |
| 101 | ASSESSMENT | 2023-02-15 | 40 |
| 102 | ASSESSMENT | 2023-01-01 | 60 |
| 102 | ASSESSMENT | 2023-01-15 | 55 |
| 102 | TREATMENT | 2023-02-01 | NA |
| 102 | ASSESSMENT | 2023-03-01 | 50 |
Expected Output: Baseline Assessments
| patientid | treatment_date | event_type | event_date | value | is_baseline |
|---|---|---|---|---|---|
| 101 | 2023-02-01 | ASSESSMENT | 2023-01-25 | 50 | 1 |
| 102 | 2023-02-01 | ASSESSMENT | 2023-01-15 | 55 | 1 |
7. Beyond Basics: Extreme Values Per Group
SAS Example: Min and Max Values with First/Last
/* Find both minimum and maximum lab values for each patient-test */
proc sort data=lab_results;
by patientid testcode result;
run;
data min_max_results;
set lab_results;
by patientid testcode result;
if first.testcode then min_flag = "Y";
if last.testcode then max_flag = "Y";
run;
Explanation (SAS):
- Sorts by
patientid,testcode, and the actualresultvalue - Sets a flag for minimum (first) and maximum (last) values
- Logical approach leveraging first./last. after appropriate sorting
R Example: Min and Max Values
# Find both minimum and maximum lab values for each patient-test
min_max_results <- lab_results %>%
filter(!is.na(result)) %>%
group_by(patientid, testcode) %>%
mutate(
min_flag = if_else(result == min(result), "Y", NA_character_),
max_flag = if_else(result == max(result), "Y", NA_character_)
)
# Alternative approach showing exact first/last record logic
min_max_results_alt <- lab_results %>%
filter(!is.na(result)) %>%
group_by(patientid, testcode) %>%
arrange(patientid, testcode, result) %>%
mutate(
min_flag = if_else(row_number() == 1, "Y", NA_character_),
max_flag = if_else(row_number() == n(), "Y", NA_character_)
)
Explanation (R):
- First approach uses
min()andmax()directly to identify extreme values - Second approach follows SAS logic more directly with sorting and row positions
- Both create flags for minimum and maximum values
- Demonstrates how R can solve the problem in multiple ways
Input: Lab Results Dataset
| patientid | testcode | result_dt | result | unit |
|---|---|---|---|---|
| 101 | GLU | 2023-01-15 | 90 | mg/dL |
| 101 | GLU | 2023-02-20 | 95 | mg/dL |
| 101 | GLU | 2023-03-25 | 88 | mg/dL |
| 101 | HGB | 2023-01-15 | 14.2 | g/dL |
| 101 | HGB | 2023-02-20 | 14.0 | g/dL |
| 102 | GLU | 2023-01-10 | 110 | mg/dL |
| 102 | GLU | 2023-03-05 | 105 | mg/dL |
Expected Output: Min and Max Flagged
| patientid | testcode | result_dt | result | unit | min_flag | max_flag |
|---|---|---|---|---|---|---|
| 101 | GLU | 2023-01-15 | 90 | mg/dL | ||
| 101 | GLU | 2023-02-20 | 95 | mg/dL | Y | |
| 101 | GLU | 2023-03-25 | 88 | mg/dL | Y | |
| 101 | HGB | 2023-01-15 | 14.2 | g/dL | Y | |
| 101 | HGB | 2023-02-20 | 14.0 | g/dL | Y | |
| 102 | GLU | 2023-01-10 | 110 | mg/dL | Y | |
| 102 | GLU | 2023-03-05 | 105 | mg/dL | Y |
8. Summary: SAS First/Last vs R Equivalents
| Feature | SAS First/Last | R Approach |
|---|---|---|
| Syntax complexity | Simple once understood, requires sorting | Slightly more verbose, but more flexible |
| Performance | Very efficient for large datasets | Efficient with grouped operations |
| Integration with language | Tightly integrated with DATA step processing | Fits well within tidyverse data manipulation patterns |
| Multiple grouping variables | Supports hierarchical first/last variables | Supports multiple grouping levels |
| Complex conditions | Combines with other DATA step logic | Combines well with other dplyr functions |
| Sorting requirements | Must sort data correctly before using | Explicitly sort with arrange() when order matters |
| Readability | Concise but requires understanding of implied logic | More explicit about operations being performed |
Key Takeaways:
- SAS first./last. variables provide a powerful way to identify boundary records after sorting
- R's dplyr functions offer equivalent functionality with
row_number(),slice_head(),slice_tail()and related functions - Both approaches require understanding the underlying data structure and proper grouping
- R often separates the sorting step from the selection step, making logic more explicit
- For complex scenarios, R may require more planning but offers more flexibility
- Both approaches can handle advanced tasks like windowed operations and conditional selections
- The choice between SAS and R approaches often depends on familiarity and specific project requirements
**Resource download links**
1.4.13.-First-and-Last-Records-in-SAS-vs-R-equivalent.zip