contact@a2zlearners.com

2.5.2. Strings Regular Expressions

1. Introduction

Regular expressions (regexps) are patterns used to match, extract, or manipulate text within strings. In SDTM and clinical data, regexps are invaluable for cleaning, parsing, and validating string variables such as subject IDs, visit names, lab codes, and comments.


2. Why Use Regular Expressions?

  • Find patterns in strings (e.g., extract visit numbers, check for valid USUBJID format).
  • Clean and standardize text data.
  • Extract or replace specific information in SDTM domains.
  • Powerful for data validation and reporting.

3. Basic Regular Expression Functions in stringr

All stringr functions for regexps start with str_. Key functions include:

  • str_view(): View the first occurrence in a string that matches the regex.
  • str_view_all(): View all occurrences in a string that match the regex.
  • str_count(): Count the number of times a regex matches within a string.
  • str_detect(): Determine if regex is found within string.
  • str_subset(): Return subset of strings that match the regex.
  • str_extract(): Return portion of each string that matches the regex.
  • str_replace(): Replace portion of string that matches the regex with something else.

4. Anchors: Start and End of String

  • ^ matches the start of a string.
  • $ matches the end of a string.

R Code:

library(stringr)
fruits <- c("Apple", "Banana", "Mango", "Melon", "Grape")

# Identify strings that start with "M"
str_view(fruits, "^M")
# Identify strings that end with "e"
str_view(fruits, "e$")
# Identify strings that end with "A" (case sensitive)
str_view(fruits, "A$")

Input Table:

fruits
"Apple"
"Banana"
"Mango"
"Melon"
"Grape"

Output:

> print(str_view(fruits, "^M"))  # Strings starting with "M"
[3] │ <M>ango
[4] │ <M>elon
> print(str_view(fruits, "e$"))  # Strings ending with "e"
[1] │ Appl<e>
[5] │ Grap<e>
> print(str_view(fruits, "A$"))
  • Explanation:
    • ^M matches fruits starting with "M".
    • e$ matches fruits ending with "e" ("Apple", "Grape").
    • A$ is case sensitive; none of the fruits end with uppercase "A".

5. Counting Matches: str_count()

R Code:

str_count(fruits, "^M")   # Count fruits starting with "M"
str_count(fruits, "a")    # Count lowercase "a" in each fruit

Output Table:

fruits str_count("^M") str_count("a")
Apple 0 0
Banana 0 3
Mango 1 1
Melon 1 0
Grape 0 1

6. Detecting Matches: str_detect()

R Code:

str_detect(fruits, "^M")

Output Table:

fruits str_detect("^M")
Apple FALSE
Banana FALSE
Mango TRUE
Melon TRUE
Grape FALSE

7. Subsetting Matches: str_subset()

R Code:

str_subset(fruits, "^M")

Output Table:

str_subset("^M")
"Mango"
"Melon"

8. Extracting Matches: str_extract()

R Code:

str_extract(fruits, "^M")

Output Table:

fruits str_extract("^M")
Apple NA
Banana NA
Mango "M"
Melon "M"
Grape NA

9. Replacing Matches: str_replace()

R Code:

str_replace(fruits, "^M", "?")

Output Table:

fruits str_replace("^M", "?")
Apple "Apple"
Banana "Banana"
Mango "?ango"
Melon "?elon"
Grape "Grape"

10. Common Regular Expressions

  • Regular expressions allow you to match specific patterns in strings. Here are some of the most useful ones:
    • [aeiou]: Matches any single lowercase vowel.
    • [^aeiou]: Matches any character except lowercase vowels.
    • \\d: Matches any digit (0-9).
    • \\s: Matches any whitespace (space, tab, newline).
    • .: Matches any character except newline.

R Code:

fruits <- c("Apple", "Banana", "Mango", "Melon", "Grape")
codes <- c("A123", "B456", "C789", "D000")

# Highlight vowels in fruits
str_view_all(fruits, "[aeiou]")

# Highlight non-vowels in fruits
str_view_all(fruits, "[^aeiou]")

# Highlight digits in codes
str_view_all(codes, "\\d")

# Highlight whitespace in codes (none here)
str_view_all(codes, "\\s")

# Highlight all characters in codes
str_view_all(codes, ".")

Input Table (fruits):

fruits
Apple
Banana
Mango
Melon
Grape

Output Table (str_view_all(fruits, "[aeiou]")):

[1] │ Appl<e>
[2] │ B<a>n<a>n<a>
[3] │ M<a>ng<o>
[4] │ M<e>l<o>n
[5] │ Gr<a>p<e>

Output Table (str_view_all(fruits, "[^aeiou]")):

[1] │ <A><p><p><l>e
[2] │ <B>a<n>a<n>a
[3] │ <M>a<n><g>o
[4] │ <M>e<l>o<n>
[5] │ <G><r>a<p>e

Output Table (str_view_all(codes, "\d")):

[1] │ A<1><2><3>
[2] │ B<4><5><6>
[3] │ C<7><8><9>
[4] │ D<0><0><0>

Output Table (str_view_all(codes, "\s")):

[1] │ A123
[2] │ B456
[3] │ C789
[4] │ D000

Output Table (str_view_all(codes, ".")):

[1] │ <A><1><2><3>
[2] │ <B><4><5><6>
[3] │ <C><7><8><9>
[4] │ <D><0><0><0>
  • Explanation:
    • [aeiou] highlights all vowels in each string.
    • [^aeiou] highlights all non-vowel characters.
    • \\d highlights digits in codes (e.g., "A123" → "123").
    • \\s highlights whitespace (none in codes).
    • . highlights every character.

11. Repetition in Regular Expressions

  • You can specify how many times a pattern should be matched:
    • ? : 0 or 1 times (optional)
    • + : 1 or more times (at least once)
    • * : 0 or more times (any number, including zero)
    • {n} : exactly n times
    • {n,} : n or more times
    • {n,m} : between n and m times

R Code:

codes <- c("A123", "B456", "C789", "D000")

# One or more "0"
str_view_all(codes, "0+")

# Exactly two "1" in a row
str_view_all(codes, "1{2}")

# "2" one or more times
str_view_all(codes, "2+")

# "9" one or two times
str_view_all(codes, "9{1,2}")

# "3" two or three times
str_view_all(codes, "3{2,3}")

Input Table (codes):

codes
A123
B456
C789
D000

Output:

> print(str_view_all(codes, "0+"))        # One or more "0"
[1] │ A123
[2] │ B456
[3] │ C789
[4] │ D<000>
> print(str_view_all(codes, "1{2}"))      # Exactly two "1"
[1] │ A123
[2] │ B456
[3] │ C789
[4] │ D000
> print(str_view_all(codes, "2+"))        # One or more "2"
[1] │ A1<2>3
[2] │ B456
[3] │ C789
[4] │ D000
> print(str_view_all(codes, "9{1,2}"))    # One or two "9"
[1] │ A123
[2] │ B456
[3] │ C78<9>
[4] │ D000
> print(str_view_all(codes, "3{2,3}"))    # Two or three "3"
[1] │ A123
[2] │ B456
[3] │ C789
[4] │ D000
  • Explanation:
    • "0+" matches one or more consecutive zeros.
    • "1{2}" matches exactly two consecutive ones (none in these codes).
    • "2+" matches one or more consecutive twos (A123 → "2").
    • "9{1,2}" matches one or two consecutive nines (C789 → "9").
    • "3{2,3}" matches two or three consecutive threes (none in these codes).

12. Exploring Beyond the Basics

  • Regular expressions can be used for advanced string manipulation and validation:

    • Extracting numbers from strings:

      str_extract("WEEK 12", "\\d+")
      # Output: "12"
      
      • Extracts the first sequence of digits from the string.
    • Validating formats (e.g., USUBJID):

      usubjid <- c("01-701-101", "123-456-789", "A01-701-101")
      str_detect(usubjid, "^\\d{2}-\\d{3}-\\d{3}$")
      # Output: TRUE TRUE FALSE
      
      • Checks if the string matches the SDTM subject ID format (two digits, dash, three digits, dash, three digits).
    • Parsing lab codes or comments for specific patterns:

      comments <- c("ALT high", "AST normal", "HGB low")
      str_subset(comments, "high|low")
      # Output: "ALT high" "HGB low"
      
      • Returns comments containing "high" or "low".
    • Advanced replacements and cleaning:

      str_replace_all("Visit 001", "\\d+", "XXX")
      # Output: "Visit XXX"
      
      • Replaces all digit sequences with "XXX".
  • Explanation:

    • Regular expressions are flexible and powerful for extracting, validating, and cleaning string data in SDTM and clinical datasets.

**Resource download links**

2.5.2.-Strings-Regular-Expressions.zip