2.5.2. Strings Regular Expressions
1. Introduction
Regular expressions (regexps) are patterns used to match, extract, or manipulate text within strings. In SDTM and clinical data, regexps are invaluable for cleaning, parsing, and validating string variables such as subject IDs, visit names, lab codes, and comments.
2. Why Use Regular Expressions?
- Find patterns in strings (e.g., extract visit numbers, check for valid USUBJID format).
- Clean and standardize text data.
- Extract or replace specific information in SDTM domains.
- Powerful for data validation and reporting.
3. Basic Regular Expression Functions in stringr
All stringr functions for regexps start with str_. Key functions include:
str_view(): View the first occurrence in a string that matches the regex.str_view_all(): View all occurrences in a string that match the regex.str_count(): Count the number of times a regex matches within a string.str_detect(): Determine if regex is found within string.str_subset(): Return subset of strings that match the regex.str_extract(): Return portion of each string that matches the regex.str_replace(): Replace portion of string that matches the regex with something else.
4. Anchors: Start and End of String
^matches the start of a string.$matches the end of a string.
R Code:
library(stringr)
fruits <- c("Apple", "Banana", "Mango", "Melon", "Grape")
# Identify strings that start with "M"
str_view(fruits, "^M")
# Identify strings that end with "e"
str_view(fruits, "e$")
# Identify strings that end with "A" (case sensitive)
str_view(fruits, "A$")
Input Table:
| fruits |
|---|
| "Apple" |
| "Banana" |
| "Mango" |
| "Melon" |
| "Grape" |
Output:
> print(str_view(fruits, "^M")) # Strings starting with "M"
[3] │ <M>ango
[4] │ <M>elon
> print(str_view(fruits, "e$")) # Strings ending with "e"
[1] │ Appl<e>
[5] │ Grap<e>
> print(str_view(fruits, "A$"))
- Explanation:
^Mmatches fruits starting with "M".e$matches fruits ending with "e" ("Apple", "Grape").A$is case sensitive; none of the fruits end with uppercase "A".
5. Counting Matches: str_count()
R Code:
str_count(fruits, "^M") # Count fruits starting with "M"
str_count(fruits, "a") # Count lowercase "a" in each fruit
Output Table:
| fruits | str_count("^M") | str_count("a") |
|---|---|---|
| Apple | 0 | 0 |
| Banana | 0 | 3 |
| Mango | 1 | 1 |
| Melon | 1 | 0 |
| Grape | 0 | 1 |
6. Detecting Matches: str_detect()
R Code:
str_detect(fruits, "^M")
Output Table:
| fruits | str_detect("^M") |
|---|---|
| Apple | FALSE |
| Banana | FALSE |
| Mango | TRUE |
| Melon | TRUE |
| Grape | FALSE |
7. Subsetting Matches: str_subset()
R Code:
str_subset(fruits, "^M")
Output Table:
| str_subset("^M") |
|---|
| "Mango" |
| "Melon" |
8. Extracting Matches: str_extract()
R Code:
str_extract(fruits, "^M")
Output Table:
| fruits | str_extract("^M") |
|---|---|
| Apple | NA |
| Banana | NA |
| Mango | "M" |
| Melon | "M" |
| Grape | NA |
9. Replacing Matches: str_replace()
R Code:
str_replace(fruits, "^M", "?")
Output Table:
| fruits | str_replace("^M", "?") |
|---|---|
| Apple | "Apple" |
| Banana | "Banana" |
| Mango | "?ango" |
| Melon | "?elon" |
| Grape | "Grape" |
10. Common Regular Expressions
- Regular expressions allow you to match specific patterns in strings. Here are some of the most useful ones:
[aeiou]: Matches any single lowercase vowel.[^aeiou]: Matches any character except lowercase vowels.\\d: Matches any digit (0-9).\\s: Matches any whitespace (space, tab, newline)..: Matches any character except newline.
R Code:
fruits <- c("Apple", "Banana", "Mango", "Melon", "Grape")
codes <- c("A123", "B456", "C789", "D000")
# Highlight vowels in fruits
str_view_all(fruits, "[aeiou]")
# Highlight non-vowels in fruits
str_view_all(fruits, "[^aeiou]")
# Highlight digits in codes
str_view_all(codes, "\\d")
# Highlight whitespace in codes (none here)
str_view_all(codes, "\\s")
# Highlight all characters in codes
str_view_all(codes, ".")
Input Table (fruits):
| fruits |
|---|
| Apple |
| Banana |
| Mango |
| Melon |
| Grape |
Output Table (str_view_all(fruits, "[aeiou]")):
[1] │ Appl<e>
[2] │ B<a>n<a>n<a>
[3] │ M<a>ng<o>
[4] │ M<e>l<o>n
[5] │ Gr<a>p<e>
Output Table (str_view_all(fruits, "[^aeiou]")):
[1] │ <A><p><p><l>e
[2] │ <B>a<n>a<n>a
[3] │ <M>a<n><g>o
[4] │ <M>e<l>o<n>
[5] │ <G><r>a<p>e
Output Table (str_view_all(codes, "\d")):
[1] │ A<1><2><3>
[2] │ B<4><5><6>
[3] │ C<7><8><9>
[4] │ D<0><0><0>
Output Table (str_view_all(codes, "\s")):
[1] │ A123
[2] │ B456
[3] │ C789
[4] │ D000
Output Table (str_view_all(codes, ".")):
[1] │ <A><1><2><3>
[2] │ <B><4><5><6>
[3] │ <C><7><8><9>
[4] │ <D><0><0><0>
- Explanation:
[aeiou]highlights all vowels in each string.[^aeiou]highlights all non-vowel characters.\\dhighlights digits in codes (e.g., "A123" → "123").\\shighlights whitespace (none in codes)..highlights every character.
11. Repetition in Regular Expressions
- You can specify how many times a pattern should be matched:
?: 0 or 1 times (optional)+: 1 or more times (at least once)*: 0 or more times (any number, including zero){n}: exactly n times{n,}: n or more times{n,m}: between n and m times
R Code:
codes <- c("A123", "B456", "C789", "D000")
# One or more "0"
str_view_all(codes, "0+")
# Exactly two "1" in a row
str_view_all(codes, "1{2}")
# "2" one or more times
str_view_all(codes, "2+")
# "9" one or two times
str_view_all(codes, "9{1,2}")
# "3" two or three times
str_view_all(codes, "3{2,3}")
Input Table (codes):
| codes |
|---|
| A123 |
| B456 |
| C789 |
| D000 |
Output:
> print(str_view_all(codes, "0+")) # One or more "0"
[1] │ A123
[2] │ B456
[3] │ C789
[4] │ D<000>
> print(str_view_all(codes, "1{2}")) # Exactly two "1"
[1] │ A123
[2] │ B456
[3] │ C789
[4] │ D000
> print(str_view_all(codes, "2+")) # One or more "2"
[1] │ A1<2>3
[2] │ B456
[3] │ C789
[4] │ D000
> print(str_view_all(codes, "9{1,2}")) # One or two "9"
[1] │ A123
[2] │ B456
[3] │ C78<9>
[4] │ D000
> print(str_view_all(codes, "3{2,3}")) # Two or three "3"
[1] │ A123
[2] │ B456
[3] │ C789
[4] │ D000
- Explanation:
"0+"matches one or more consecutive zeros."1{2}"matches exactly two consecutive ones (none in these codes)."2+"matches one or more consecutive twos (A123 → "2")."9{1,2}"matches one or two consecutive nines (C789 → "9")."3{2,3}"matches two or three consecutive threes (none in these codes).
12. Exploring Beyond the Basics
Regular expressions can be used for advanced string manipulation and validation:
Extracting numbers from strings:
str_extract("WEEK 12", "\\d+") # Output: "12"- Extracts the first sequence of digits from the string.
Validating formats (e.g., USUBJID):
usubjid <- c("01-701-101", "123-456-789", "A01-701-101") str_detect(usubjid, "^\\d{2}-\\d{3}-\\d{3}$") # Output: TRUE TRUE FALSE- Checks if the string matches the SDTM subject ID format (two digits, dash, three digits, dash, three digits).
Parsing lab codes or comments for specific patterns:
comments <- c("ALT high", "AST normal", "HGB low") str_subset(comments, "high|low") # Output: "ALT high" "HGB low"- Returns comments containing "high" or "low".
Advanced replacements and cleaning:
str_replace_all("Visit 001", "\\d+", "XXX") # Output: "Visit XXX"- Replaces all digit sequences with "XXX".
Explanation:
- Regular expressions are flexible and powerful for extracting, validating, and cleaning string data in SDTM and clinical datasets.
**Resource download links**
2.5.2.-Strings-Regular-Expressions.zip