contact@a2zlearners.com

1. R Fundamentals

1.2 Getting Started with R Programming

  • Introduction to R Programming:
    • R is a language designed for statistical computing and graphics.
    • Understanding its data structures is essential for efficient data analysis and manipulation.

1.2.2 Data Structures (Objects) in R

  • What are Data Structures in R?
    • Data structures are ways to organize and store data for efficient access and modification.
    • R provides several built-in structures, each suited for different types of data and operations.

Overview of Core Structures

  • Dimensionality vs. Homogeneity:
    • R classifies data structures by their number of dimensions (1D, 2D, ND) and whether they hold only one type of data (homogeneous) or multiple types (heterogeneous).
    • The chart below visually compares these structures.

R Data Structures Comparison Chart

R Data Structures Comparison Chart

  • 1-D homogeneous – atomic vectors (numeric, character, logical, integer, complex)
  • 1-D heterogeneous – lists (can nest any object)
  • 2-D homogeneous – matrices (all elements same type)
  • 2-D heterogeneous – data frames & tibbles (mixed column types)
  • N-D homogeneous – arrays (3-D, 4-D, multi-dimensional)
  • Factors – specialized 1-D categorical vectors with predefined levels

R Data Structures Comparison Chart2


1. Vectors

  • Vectors in R:
    • The most basic data structure in R, holding elements of the same type (numeric, character, logical, etc.).
    • All operations in R are vectorized, meaning they apply to entire vectors at once.

Creating and Basic Operations

  • Creating Vectors:
    • c() combines values into a vector.
    • Named vectors allow you to assign names to elements for easier access.
    • The : operator creates sequences.
# Multiple creation methods
sales_data <- c(1200, 1500, 980, 2100, 1800) # Numeric vector
quarterly_results <- c(Q1=25000, Q2=31000, Q3=28000, Q4=35000) # Named vector
temperature_range <- -5:10  # Sequence from -5 to 10

Output: 1.2.2.1.Vectors Examples Output

  • Explanation:
    • sales_data <- c(1200, 1500, 980, 2100, 1800)
      • This creates a numeric vector called sales_data containing five sales figures.
      • Each value is a separate element in the vector, and all are stored as numbers.
    • quarterly_results <- c(Q1=25000, Q2=31000, Q3=28000, Q4=35000)
      • This creates a named numeric vector, where each element is associated with a name (Q1, Q2, etc.).
      • You can access values by their names, e.g., quarterly_results["Q2"] returns 31000.
    • temperature_range <- -5:10
      • The colon operator : generates a sequence of integers from -5 to 10 (inclusive).
      • The resulting vector contains all integer values in that range, useful for generating regular sequences.
**Common Warning Scenarios**
  • Mixing Data Types

    c(1, "two", 3)
    
    • R coerces everything to character, and usually gives no warning —but this can lead to unexpected behavior if you assume it's numeric.
  • Coercion in Logical Operations

    x <- c(TRUE, FALSE, "maybe")
    
    • The string forces coercion to character. No warning, but semantic confusion!
  • Named Vector Mismatches

    named_vec <- c(A=1, B=2)
    named_vec[c("A","B")]
    named_vec[c("A", "C")]
    

    Output:

    > named_vec <- c(A=1, B=2)
    > named_vec
    A B 
    1 2 
    > named_vec[c("A","B")]
    A B 
    1 2 
    > named_vec[c("A", "C")]
      A <NA> 
      1   NA 
    
    • Accessing a nonexistent name ("C") returns NA silently. No warning, but something to watch out for.
**Typical Error Triggers**
  • Out-of-Bounds Indexing

    x <- c(5, 10)
    x[3]
    
    • No error, but returns NA. Error happens if you try to assign to an index that doesn’t exist in a fixed-length structure like a matrix.
  • Wrong Type for Indexing

    x <- c(5, 10)
    x["not_a_name"]
    
    • If no such name exists, returns NA, but some cases (like a NULL or list index) can throw errors.
  • Non-Numeric Operations on a Numeric Vector

    x <- c(1, 2, 3)
    mean("apple")
    
    • You'll get an error: “argument is not numeric or logical: returning NA”.
  • Length Mismatch in Assignment

    x <- c(1, 2, 3)
    x[] <- c(1, 2)  # Warning: number of items to replace is not a multiple of replacement length
    

Output:

> x <- c(1, 2, 3)
> x[] <- c(1, 2)  # Warning: number of items to replace is not a multiple of replacement length
Warning message:
In x[] <- c(1, 2) :
  number of items to replace is not a multiple of replacement length
> x
[1] 1 2 1
  • This warning occurs because you are trying to replace elements in x with a vector of a different length. R will recycle the shorter vector, but it may not be what you intended.

1.1 Vector Math Operations
  • Element-wise Operations:
    • Arithmetic operations on vectors are performed element by element, without explicit loops.
# Create sample dataset
monthly_sales <- c(15000, 18000, 22000, 19000, 25000, 21000)

# Basic arithmetic (vectorized)
monthly_sales * 1.15        # 15% increase: 17250, 20700, 25300...
monthly_sales + 5000        # Bonus addition: 20000, 23000, 27000...
monthly_sales / 1000        # Convert to thousands: 15.0, 18.0, 22.0...

# Vector interaction
target_sales <- c(16000, 20000, 24000, 18000, 26000, 22000)
performance_ratio <- monthly_sales / target_sales  # Element-wise division
variance <- monthly_sales - target_sales           # Difference calculation

Output: 1.2.2.1.Vectors Math Operation Output

  • Explanation:
    • monthly_sales <- c(15000, 18000, 22000, 19000, 25000, 21000)
      • Creates a vector of monthly sales figures for six months.
    • monthly_sales * 1.15
      • Multiplies each sales value by 1.15, increasing each by 15%.
      • The result is a new vector: each element is the original sales value increased by 15%.
    • monthly_sales + 5000
      • Adds 5000 to each element, simulating a bonus or adjustment to every month's sales.
    • monthly_sales / 1000
      • Divides each sales value by 1000, converting the numbers to thousands for easier reading or plotting.
    • target_sales <- c(16000, 20000, 24000, 18000, 26000, 22000)
      • Creates a vector of target sales for the same six months.
    • performance_ratio <- monthly_sales / target_sales
      • Divides each actual sales value by the corresponding target sales value, producing a ratio for each month.
      • If the ratio is above 1, the target was exceeded; below 1, the target was missed.
    • variance <- monthly_sales - target_sales
      • Subtracts the target sales from the actual sales for each month, showing the difference (positive or negative) for each period.

1.2 Vector Recycling in Practice
  • Vector Recycling:
    • If two vectors are of unequal length, R repeats (recycles) the shorter one to match the longer vector's length.
base_prices <- c(100, 150, 200, 120, 180)
discount_rates <- c(0.1, 0.2)  # Only 2 elements

# R recycles discount_rates: 0.1, 0.2, 0.1, 0.2, 0.1
final_prices <- base_prices * (1 - discount_rates)
# Result: 90, 120, 180, 96, 162

1.2.2.2.Vectors Recycling Output

  • Explanation:
    • base_prices <- c(100, 150, 200, 120, 180)
      • A vector of five product base prices.
    • discount_rates <- c(0.1, 0.2)
      • A vector of two discount rates (10% and 20%).
    • When calculating final_prices, R automatically repeats the discount_rates vector to match the length of base_prices:
      • The operation is performed as:
        • 100 * (1 - 0.1) = 90
        • 150 * (1 - 0.2) = 120
        • 200 * (1 - 0.1) = 180
        • 120 * (1 - 0.2) = 96
        • 180 * (1 - 0.1) = 162
      • This demonstrates how R handles operations between vectors of different lengths.

1.3 Vector Functions
  • Common Vector Functions:
    • Functions like sum(), mean(), min(), max(), and which.max() operate on entire vectors.
product_ratings <- c(4.2, 3.8, 4.7, 4.1, 3.9, 4.5, 4.3)

sum(product_ratings)         # 29.5
mean(product_ratings)        # 4.214
min(product_ratings)         # 3.8
max(product_ratings)         # 4.7
which.max(product_ratings)   # 3 (position of maximum)
  • Explanation:
    • product_ratings is a vector of customer ratings for a product.
    • sum(product_ratings)
      • Adds up all the ratings, giving the total score.
    • mean(product_ratings)
      • Calculates the average rating by dividing the sum by the number of ratings.
    • min(product_ratings) and max(product_ratings)
      • Find the lowest and highest ratings in the vector.
    • which.max(product_ratings)
      • Returns the index (position) of the highest rating in the vector (here, the 3rd element).

2. Vector Math

  • Efficient Computation:
    • R's vectorization allows for fast, concise calculations on entire datasets.
**Arithmetic Operations**
  • Bulk Calculations:
    • Multiplying, adding, or dividing vectors applies the operation to each element.
# E-commerce example
product_costs <- c(25, 40, 15, 60, 35, 20)
markup_multiplier <- 2.5

# Calculate retail prices (vectorized)
retail_prices <- product_costs * markup_multiplier
# Result: 62.5, 100.0, 37.5, 150.0, 87.5, 50.0

# Bulk operations
tax_rate <- 0.08
final_prices <- retail_prices * (1 + tax_rate)
savings <- retail_prices - product_costs
profit_margins <- savings / retail_prices
  • Explanation:
    • retail_prices <- product_costs * markup_multiplier
      • Calculates retail prices by multiplying each product's cost by the markup factor.
    • final_prices <- retail_prices * (1 + tax_rate)
      • Applies tax to each retail price to get the final price.
    • savings <- retail_prices - product_costs
      • Calculates the savings (or profit) for each product by subtracting the cost from the retail price.
    • profit_margins <- savings / retail_prices
      • Calculates the profit margin for each product as a percentage of the retail price.
**Vector Interaction Examples**
  • Logical Comparisons:
    • Logical operations return TRUE/FALSE for each element, useful for filtering or flagging data.
# Inventory management
current_stock <- c(45, 23, 67, 12, 89, 34)
reorder_levels <- c(20, 15, 30, 25, 40, 15)

# Determine reorder needs
needs_reorder <- current_stock < reorder_levels  # Logical vector
reorder_quantity <- pmax(0, reorder_levels - current_stock)
  • Explanation:
    • needs_reorder <- current_stock < reorder_levels
      • Compares current stock levels to reorder levels, returning a logical vector (TRUE/FALSE) indicating which items need reordering.
    • reorder_quantity <- pmax(0, reorder_levels - current_stock)
      • Calculates the quantity to reorder for each item, ensuring no negative values (using pmax to compare with 0).

3. Subsetting

  • Extracting Data:
    • Subsetting allows you to select, exclude, or filter elements from vectors and data frames.

Vector Subsetting Examples

  • Indexing Methods:
    • Positive indexing selects elements.
    • Negative indexing excludes elements.
    • Logical indexing selects elements based on a condition.
    • Named indexing uses element names.
# Customer satisfaction scores
customer_scores <- c(8.5, 6.2, 9.1, 7.8, 5.4, 8.9, 7.2, 9.5, 6.8, 8.1)
names(customer_scores) <- paste0("Customer_", 1:10)

# 1. Positive indexing
customer_scores[c(1, 5, 8)]           # Specific customers
customer_scores[3:7]                  # Range of customers

# 2. Negative indexing (exclusion)
customer_scores[-c(2, 4, 6)]          # Exclude specific customers

# 3. Logical subsetting
high_scores <- customer_scores[customer_scores >= 8.0]
low_satisfaction <- customer_scores[customer_scores < 7.0]

# 4. Named access
customer_scores["Customer_3"]
customer_scores[c("Customer_1", "Customer_9")]

1.2.2.3. Vectors Subsetting Output

  • Explanation:
    • customer_scores[c(1, 5, 8)]
      • Selects the 1st, 5th, and 8th elements from the vector, returning the scores for these specific customers.
    • customer_scores[3:7]
      • Selects a range of elements from the 3rd to the 7th, returning the scores for these customers.
    • customer_scores[-c(2, 4, 6)]
      • Excludes the 2nd, 4th, and 6th customers, returning scores for all other customers.
    • high_scores <- customer_scores[customer_scores >= 8.0]
      • Creates a new vector with scores that are 8.0 or higher.
    • low_satisfaction <- customer_scores[customer_scores < 7.0]
      • Creates a new vector with scores lower than 7.0.
    • customer_scores["Customer_3"]
      • Selects the element named "Customer_3", returning the score for this customer.
    • customer_scores[c("Customer_1", "Customer_9")]
      • Selects multiple elements by their names, returning the scores for these customers.

Data Frame Subsetting

  • Accessing Rows and Columns:
    • Data frames can be subset by row and column indices, names, or logical conditions.
# Employee dataset
employee_data <- data.frame(
  Name = c("Sarah", "Mike", "Lisa", "Tom", "Ana"),
  Department = c("Sales", "IT", "HR", "Sales", "IT"),
  Salary = c(55000, 62000, 58000, 51000, 65000),
  Years_Experience = c(3, 5, 7, 2, 6)
)

# Row and column access
employee_data[2, ]                    # Mike's data
employee_data[, "Salary"]             # All salaries
employee_data[1:3, c("Name", "Department")]  # First 3 employees

# Conditional subsetting
sales_team <- employee_data[employee_data$Department == "Sales", ]
senior_staff <- employee_data[employee_data$Years_Experience >= 5, ]
high_earners <- employee_data[employee_data$Salary > 60000, c("Name", "Salary")]
  • Explanation:
    • employee_data is a data frame with columns for employee name, department, salary, and years of experience.
    • employee_data[2, ]
      • Selects the entire second row, returning all information for "Mike".
    • employee_data[, "Salary"]
      • Selects the "Salary" column for all employees, returning a vector of salaries.
    • employee_data[1:3, c("Name", "Department")]
      • Selects the first three rows and only the "Name" and "Department" columns, returning a smaller data frame.
    • sales_team <- employee_data[employee_data$Department == "Sales", ]
      • Filters the data frame to include only employees in the "Sales" department.
    • senior_staff <- employee_data[employee_data$Years_Experience >= 5, ]
      • Filters for employees with five or more years of experience.
    • high_earners <- employee_data[employee_data$Salary > 60000, c("Name", "Salary")]
      • Selects only the names and salaries of employees earning more than 60,000.

4. R Data Types: Basic Types

  • Atomic Types:
    • R's atomic types are the building blocks for all other structures: logical, integer, numeric, complex, character, and raw.
4.1 Logical
  • TRUE/FALSE Values:
    • Used for flags, conditions, and logical operations.
# Quality control flags
passed_inspection <- c(TRUE, TRUE, FALSE, TRUE, FALSE, TRUE)
is_premium <- c(FALSE, TRUE, FALSE, FALSE, TRUE, TRUE)

# Logical operations
all_passed <- all(passed_inspection)          # FALSE
any_premium <- any(is_premium)                # TRUE
premium_count <- sum(is_premium)              # 3 (TRUE counts as 1)

# Combining conditions  
premium_and_passed <- is_premium & passed_inspection
premium_or_passed <- is_premium | passed_inspection
  • Explanation:
    • Shows how to combine logical vectors and count TRUE values.
**Common Warning Scenarios**
  • Mixing logical and numeric types

    x <- c(TRUE, 1, FALSE)
    
    • R coerces logicals to numeric (TRUE = 1, FALSE = 0) silently.
  • Logical operations on non-logical data

    y <- c("yes", "no", "maybe")
    z <- y & TRUE
    
    • Warning: NAs introduced by coercion.
**Typical Error Triggers**
  • Invalid logical comparisons

    x <- c(TRUE, FALSE)
    x > 1
    
    • Returns FALSE, but may not be meaningful.
  • Length mismatch in logical indexing

    x <- c(TRUE, FALSE)
    y <- 1:3
    y[x]
    
    • Warning: longer object length is not a multiple of shorter object length.

4.2 Integer
  • Whole Numbers:
    • Use less memory than numeric (double) types.
# Inventory counts (integers save memory for whole numbers)
widget_inventory <- as.integer(c(150, 89, 234, 67, 445))
batch_numbers <- 1001L:1050L  # L suffix ensures integer type

class(widget_inventory)       # "integer"
is.integer(batch_numbers)     # TRUE
storage.mode(widget_inventory) # "integer" (more memory efficient)
  • Explanation:
    • Demonstrates integer creation and checking type.
**Common Warning Scenarios**
  • Mixing integer and numeric

    x <- c(1L, 2.5)
    
    • Coerces to numeric (double).
  • Large integer values

    x <- 2^31
    
    • Stored as numeric, not integer.
**Typical Error Triggers**
  • Assigning non-integer to integer vector

    x <- integer(2)
    x[1] <- 3.5
    
    • Value truncated to 3.
  • Overflow

    x <- as.integer(2^31)
    
    • Results in NA with warning: NAs introduced by coercion.

4.3 Numeric (Double)
  • Decimal Numbers:
    • Used for precise calculations.
# Financial calculations requiring precision
stock_prices <- c(134.56, 89.23, 156.78, 201.45, 78.90)
price_changes <- c(-2.34, +5.67, -0.89, +12.45, -3.21)
percentage_change <- price_changes / stock_prices * 100

# Mathematical operations
volatility <- sd(price_changes)                    # Standard deviation
moving_average <- mean(stock_prices[1:3])          # First 3 stocks
  • Explanation:
    • Shows financial calculations and statistical summaries.
**Common Warning Scenarios**
  • Precision loss

    x <- 0.1 + 0.2
    x == 0.3
    
    • Returns FALSE due to floating-point precision.
  • Mixing numeric and character

    x <- c(1.5, "two")
    
    • Coerces to character.
**Typical Error Triggers**
  • Non-numeric input to numeric functions
    mean("apple")
    
    • Error: argument is not numeric or logical: returning NA.

4.4 Complex
  • Complex Numbers:
    • Useful in engineering and scientific calculations.
# Engineering calculations with complex numbers
impedance_values <- c(3+4i, 2-5i, 1+1i, 4+0i)
phase_angles <- Arg(impedance_values)              # Phase angles
magnitudes <- Mod(impedance_values)               # Magnitudes

# Complex arithmetic
total_impedance <- sum(impedance_values)
power_factor <- Re(impedance_values) / Mod(impedance_values)
  • Explanation:
    • Demonstrates operations on complex numbers, such as magnitude and phase.
**Common Warning Scenarios**
  • Mixing complex and real

    x <- c(1+2i, 3)
    
    • 3 is coerced to 3+0i.
  • Coercion to complex

    as.complex("abc")
    
    • Warning: NAs introduced by coercion.
**Typical Error Triggers**
  • Invalid operations
    sqrt(-1)
    
    • Returns NaN unless explicitly using complex numbers.

4.5 Character
  • Text Data:
    • Used for names, labels, and string manipulation.
# Customer database
customer_names <- c("Johnson Electronics", "Smith & Co", "Brown Industries")
email_addresses <- c("info@johnson.com", "contact@smith.co", "sales@brown.ind")

# String operations
name_lengths <- nchar(customer_names)             # Character counts
total_contacts <- length(customer_names)          # Number of customers

# String manipulation
full_contact <- paste(customer_names, email_addresses, sep = " - ")
company_codes <- substr(customer_names, 1, 3)    # First 3 characters
uppercase_names <- toupper(customer_names)
  • Explanation:
    • Shows string operations like counting characters, concatenation, and case conversion.
**Common Warning Scenarios**
  • Mixing character and numeric

    x <- c("a", 1)
    
    • Coerces all to character.
  • NA in character operations

    x <- c("a", NA)
    nchar(x)
    
    • Returns NA for missing values.
**Typical Error Triggers**
  • Invalid substring indices

    substr("abc", 2, 1)
    
    • Returns "" (empty string).
  • Non-character input to string functions

    nchar(123)
    
    • Works (coerces to character), but may be unexpected.

4.6 Raw
  • Binary Data:
    • Used for low-level data storage and manipulation.
# Binary data handling
file_header <- charToRaw("PNG")                   # Convert to raw bytes
hex_values <- as.raw(c(137, 80, 78, 71))         # PNG file signature
binary_data <- as.raw(0:255)                     # Full byte range
  • Explanation:
    • Demonstrates conversion to raw bytes and handling binary data.
**Common Warning Scenarios**
  • Coercion to raw

    as.raw(300)
    
    • Warning: out-of-range values are truncated modulo 256.
  • Mixing raw and other types

    x <- c(as.raw(1), 2)
    
    • Coerces to integer.
**Typical Error Triggers**
  • Invalid conversion
    as.raw("abc")
    
    • Error: cannot coerce type 'character' to vector of type 'raw'.

5. R Data Types: Vector

  • Homogeneous Sequences:
    • Vectors are the backbone of R, storing data of a single type.

Vector Characteristics and Operations

  • Properties and Naming:
    • Vectors can be named for easier access and summarized with functions like length() and sum().
# Product catalog
product_names <- c("Laptop", "Mouse", "Keyboard", "Monitor", "Webcam")
product_prices <- c(899.99, 25.50, 89.00, 299.99, 75.00)

# Vector properties
length(product_names)                 # 5
sum(nchar(product_names))            # Total characters: 28

# Named vectors (dictionary-style)
names(product_prices) <- product_names
product_prices["Laptop"]             # 899.99
product_prices[c("Mouse", "Webcam")] # Multiple items
  • Explanation:
    • Shows how to name vector elements and access them by name.

Vector Combination and Recycling

  • Combining and Recycling:
    • Vectors can be combined, and shorter vectors are recycled in operations.
electronics <- c("Smartphone", "Tablet", "Headphones")
accessories <- c("Case", "Charger", "Stand", "Cable")

# Combine vectors
all_products <- c(electronics, accessories)
# Result: "Smartphone", "Tablet", "Headphones", "Case", "Charger", "Stand", "Cable"

# Element-wise string operations
category_prefix <- c("ELEC", "ACC")
product_codes <- paste(category_prefix, 1:7, sep="-")
# Recycling: "ELEC-1", "ACC-2", "ELEC-3", "ACC-4", "ELEC-5", "ACC-6", "ELEC-7"
  • Explanation:
    • Demonstrates combining vectors and recycling in string operations.

6. R Data Types: List

  • Heterogeneous Containers:
    • Lists can hold any type of R object, including other lists.

Basic List Creation and Access

  • Creating Lists:
    • Lists can contain vectors, data frames, and other objects.
# Customer order information
customer_order <- list(
  order_id = "ORD-2024-001",
  customer_info = c(name="Alice Johnson", email="alice@email.com"),
  items = data.frame(
    product = c("Laptop", "Mouse", "Keyboard"),
    quantity = c(1, 2, 1),
    price = c(899.99, 25.50, 89.00)
  ),
  order_total = 939.99,
  shipped = TRUE
)

# Access methods
customer_order$order_id              # "ORD-2024-001"
customer_order[["customer_info"]]    # Named vector
customer_order[[3]]                  # Data frame (items)

Output: 1.2.2.6.-R-Data-Types-List.png

  • Explanation:
    • customer_order is a list containing various elements: a character string for the order ID, a named vector for customer info, a data frame for items, a numeric value for order total, and a logical value for shipment status.
    • customer_order$order_id
      • Extracts the order_id element from the list.
    • customer_order[["customer_info"]]
      • Extracts the customer_info element as a named vector.
    • customer_order[[3]]
      • Extracts the third element (data frame of items) from the list.
**Nested Lists**
  • Lists within Lists:
    • Useful for representing hierarchical or grouped data.
# Sales reporting system
quarterly_report <- list(
  Q1 = list(
    revenue = c(Jan=45000, Feb=52000, Mar=48000),
    top_products = c("Widget A", "Widget B", "Widget C"),
    customer_count = 234
  ),
  Q2 = list(
    revenue = c(Apr=55000, May=61000, Jun=58000),
    top_products = c("Widget B", "Widget D", "Widget A"),
    customer_count = 267
  )
)

**Nested access**
quarterly_report$Q1$revenue["Feb"]   # 52000
quarterly_report[[1]][[1]][2]        # 52000 (same result)
quarterly_report$Q2$customer_count   # 267
  • Explanation:
    • quarterly_report is a nested list: each quarter (Q1, Q2) contains its own list with elements for revenue, top products, and customer count.
    • quarterly_report$Q1$revenue["Feb"]
      • Accesses the revenue for February in Q1.
    • quarterly_report[[1]][[1]][2]
      • Another way to access the same value using double bracket notation.
    • quarterly_report$Q2$customer_count
      • Accesses the customer count for Q2.
**List Manipulation**
  • Modifying Lists:
    • Lists can be expanded or processed with functions like sapply().
# Add new quarter
quarterly_report$Q3 <- list(
  revenue = c(Jul=63000, Aug=59000, Sep=65000),
  top_products = c("Widget D", "Widget A", "Widget E"),
  customer_count = 289
)

# Extract all customer counts
customer_counts <- sapply(quarterly_report, function(q) q$customer_count)
# Result: Q1=234, Q2=267, Q3=289
  • Explanation:
    • quarterly_report$Q3 <- list(...)
      • Adds a new element for Q3 with its own nested list.
    • sapply(quarterly_report, function(q) q$customer_count)
      • Applies a function to each element of the quarterly_report list, extracting the customer_count from each sub-list.

7. R Data Types: Factor

  • Categorical Data:
    • Factors efficiently store categorical variables with fixed levels.

Creating and Managing Factors

  • Defining Levels:
    • Factors can be ordered or unordered, and levels can be set explicitly.
# Customer satisfaction survey
satisfaction_levels <- c("Very Satisfied", "Satisfied", "Neutral", 
                        "Satisfied", "Very Satisfied", "Dissatisfied", 
                        "Very Satisfied", "Neutral", "Satisfied")

satisfaction_factor <- factor(satisfaction_levels, 
  levels = c("Very Dissatisfied", "Dissatisfied", "Neutral", 
             "Satisfied", "Very Satisfied"),
  ordered = TRUE
)

# Factor properties
levels(satisfaction_factor)          # All possible levels
table(satisfaction_factor)           # Frequency count
nlevels(satisfaction_factor)         # 5 levels

Output: 1.2.2.7.-R-Data-Types-Factor.png

  • Explanation:
    • satisfaction_levels is a character vector of survey responses.
    • factor(satisfaction_levels, levels = ..., ordered = TRUE)
      • Converts the character vector to an ordered factor, specifying the order of levels.
    • levels(satisfaction_factor)
      • Returns the defined levels of the factor.
    • table(satisfaction_factor)
      • Provides a frequency count of each level in the factor.
    • nlevels(satisfaction_factor)
      • Returns the number of levels in the factor.

Factor Operations

  • Analyzing and Modifying Factors:
    • Frequency tables and level modification are common operations.
# Product categories
product_categories <- factor(c("Electronics", "Clothing", "Books", 
                              "Electronics", "Home", "Books", "Clothing"),
                            levels = c("Electronics", "Clothing", "Books", "Home", "Sports"))

# Frequency analysis
category_counts <- table(product_categories)
# Electronics: 2, Clothing: 2, Books: 2, Home: 1, Sports: 0

# Level modification
levels(product_categories)[levels(product_categories) == "Electronics"] <- "Tech"
# Changes "Electronics" to "Tech" throughout the factor
  • Explanation:
    • product_categories is a factor representing the category of each product.
    • table(product_categories)
      • Creates a frequency table showing the count of each category.
    • levels(product_categories)[levels(product_categories) == "Electronics"] <- "Tech"
      • Modifies the levels of the factor, renaming "Electronics" to "Tech".

Ordered Factors

  • Ordered Categories:
    • Useful for ordinal data where order matters.
# Priority levels
task_priority <- ordered(c("High", "Medium", "Low", "High", "Medium"),
                        levels = c("Low", "Medium", "High"))

# Ordered comparisons
high_priority_tasks <- task_priority >= "Medium"  # TRUE, TRUE, FALSE, TRUE, TRUE
urgent_count <- sum(task_priority == "High")      # 2
  • Explanation:
    • task_priority is an ordered factor representing the priority of tasks.
    • task_priority >= "Medium"
      • Compares each element to "Medium", returning a logical vector indicating if the condition is TRUE or FALSE for each task.
    • sum(task_priority == "High")
      • Counts the number of tasks with "High" priority.

8. R Data Types: Matrix

  • 2D Homogeneous Storage:
    • Matrices store data of a single type in two dimensions.

Matrix Creation and Properties

  • Creating Matrices:
    • Use matrix() and set row/column names for clarity.
# Sales data by month and region
sales_matrix <- matrix(
  c(15000, 18000, 22000, 19000,    # Region 1
    12000, 16000, 20000, 17000,    # Region 2  
    14000, 15000, 18000, 16000),   # Region 3
  nrow = 3, ncol = 4, byrow = TRUE,
  dimnames = list(
    Region = c("North", "South", "West"),
    Month = c("Jan", "Feb", "Mar", "Apr")
  )
)

dim(sales_matrix)                    # 3 4 (3 rows, 4 columns)
rownames(sales_matrix)               # "North", "South", "West"
colnames(sales_matrix)               # "Jan", "Feb", "Mar", "Apr"

Output:

1.2.2.8.1.R-Data-Types-Matrix.png

  • Explanation:
    • sales_matrix is a matrix with 3 rows and 4 columns, filled by rows with sales data for three regions over four months.
    • dim(sales_matrix)
      • Returns the dimensions of the matrix: 3 rows and 4 columns.
    • rownames(sales_matrix) and colnames(sales_matrix)
      • Return the row and column names, respectively.

Matrix Operations

  • Mathematical Operations:
    • Row/column sums, element-wise arithmetic, and transposition.
# Mathematical operations
total_by_region <- rowSums(sales_matrix)    # Regional totals
total_by_month <- colSums(sales_matrix)     # Monthly totals
overall_total <- sum(sales_matrix)          # Grand total

# Matrix arithmetic
growth_factors <- matrix(c(1.05, 1.08, 1.12, 1.10), nrow = 3, ncol = 4, byrow = TRUE)
projected_sales <- sales_matrix * growth_factors  # Element-wise multiplication

# Matrix transpose
monthly_by_region <- t(sales_matrix)        # 4x3 matrix (months as rows)
  • Explanation:
    • rowSums(sales_matrix) and colSums(sales_matrix)
      • Calculate the sum of each row and each column, respectively.
    • sum(sales_matrix)
      • Calculates the grand total of all elements in the matrix.
    • t(sales_matrix)
      • Transposes the matrix, swapping rows and columns.

Matrix Subsetting

  • Extracting Data:
    • Access elements by row/column names or indices.
# Access specific elements
sales_matrix["North", "Mar"]         # North region, March sales
sales_matrix[1, ]                    # First row (North region)
sales_matrix[, "Feb"]                # February column
sales_matrix[c("North", "South"), c("Jan", "Mar")]  # Subset regions and months
  • Explanation:
    • sales_matrix["North", "Mar"]
      • Selects the element at the intersection of the "North" row and "Mar" column.
    • sales_matrix[1, ]
      • Selects the entire first row, returning all data for the "North" region.
    • sales_matrix[, "Feb"]
      • Selects the entire "Feb" column, returning sales data for February across all regions.
    • sales_matrix[c("North", "South"), c("Jan", "Mar")]
      • Selects specific rows and columns, returning a subset of the matrix.
**Common Warning Scenarios**
  • Mixing types in matrix

    m <- matrix(c(1, "a"), nrow=1)
    
    • All elements coerced to character.
  • Assigning wrong length

    m <- matrix(1:4, 2, 2)
    m[] <- 1:3
    
    • Warning: number of items to replace is not a multiple of replacement length.
**Typical Error Triggers**
  • Invalid subscript

    m <- matrix(1:4, 2, 2)
    m[3, 1]
    
    • Error: subscript out of bounds.
  • Assigning incompatible type

    m <- matrix(1:4, 2, 2)
    m[1,1] <- list(1)
    
    • Error: replacement has length zero.

9. R Data Types: Array

  • Multi-dimensional Data:
    • Arrays generalize matrices to more than two dimensions.

Creating Multi-dimensional Arrays

  • Defining Dimensions:
    • Use array() and specify dimension names for clarity.
# Sales data by Product × Region × Quarter
sales_array <- array(
  data = c(
    # Product A: Regions × Quarters
    15000, 12000, 14000,  # Q1: North, South, West
    18000, 16000, 15000,  # Q2: North, South, West
    # Product B: Regions × Quarters  
    22000, 20000, 18000,  # Q1: North, South, West
    25000, 23000, 21000   # Q2: North, South, West
  ),
  dim = c(3, 2, 2),  # 3 regions, 2 quarters, 2 products
  dimnames = list(
    Region = c("North", "South", "West"),
    Quarter = c("Q1", "Q2"),
    Product = c("Product_A", "Product_B")
  )
)
  • Explanation:
    • sales_array is a 3-dimensional array with dimensions for regions, quarters, and products, containing sales data for two products across two quarters in three regions.
    • dim(sales_array)
      • Returns the dimensions of the array: 3 regions, 2 quarters, and 2 products.
    • dimnames(sales_array)
      • Sets the names for each dimension, allowing for easier identification of data.

Array Access and Operations

  • Accessing and Summarizing:
    • Use indices and apply() to extract and summarize data along dimensions.
# Access specific elements
sales_array["North", "Q2", "Product_B"]     # 25000
sales_array[, "Q1", ]                       # All regions, Q1, all products
sales_array["South", , "Product_A"]         # South region, all quarters, Product A

# Calculations across dimensions
regional_totals <- apply(sales_array, 1, sum)      # Sum by region
quarterly_totals <- apply(sales_array, 2, sum)     # Sum by quarter
product_totals <- apply(sales_array, 3, sum)       # Sum by product
  • Explanation:
    • sales_array["North", "Q2", "Product_B"]
      • Selects the sales figure for Product B in the North region for Q2.
    • apply(sales_array, 1, sum)
      • Applies the sum function across the first dimension (rows), calculating total sales for each region.
    • apply(sales_array, 2, sum)
      • Applies the sum function across the second dimension (columns), calculating total sales for each quarter.
    • apply(sales_array, 3, sum)
      • Applies the sum function across the third dimension (slices), calculating total sales for each product.

Array Manipulation

  • Expanding Arrays:
    • Arrays can be extended by combining with new data (using packages like abind).
# Add new quarter data
Q3_data <- array(c(20000, 18000, 16000, 28000, 26000, 24000), 
                dim = c(3, 1, 2),
                dimnames = list(c("North", "South", "West"), "Q3", c("Product_A", "Product_B")))

# Combine arrays (would require abind package in practice)
# expanded_sales <- abind(sales_array, Q3_data, along = 2)
  • Explanation:
    • Q3_data is a new array containing sales data for Q3, structured similarly to the original sales_array.
    • abind(sales_array, Q3_data, along = 2)
      • This hypothetical line (commented out) would combine the original sales array with the new Q3 data along the second dimension (quarters), expanding the array to include the new data.

10. R Data Types: Data Frame

  • Spreadsheet-like Tables:
    • Data frames store heterogeneous columns, similar to Excel tables.

R DataFrame Structure

Creating and Managing Data Frames

  • Defining Data Frames:
    • Use data.frame() to create tables with named columns of different types.
# Employee management system
employee_df <- data.frame(
  EmployeeID = c(1001, 1002, 1003, 1004, 1005),
  Name = c("Sarah Wilson", "Mike Chen", "Lisa Anderson", "Tom Rodriguez", "Ana Petrov"),
  Department = factor(c("Sales", "Engineering", "HR", "Sales", "Engineering")),
  Salary = c(65000, 85000, 58000, 62000, 90000),
  StartDate = as.Date(c("2020-03-15", "2019-07-22", "2021-01-10", "2020-11-05", "2018-09-30")),
  Remote = c(FALSE, TRUE, FALSE, TRUE, TRUE),
  stringsAsFactors = FALSE
)

# Data frame properties
str(employee_df)                     # Structure overview
nrow(employee_df)                    # 5 employees
ncol(employee_df)                    # 6 columns
names(employee_df)                   # Column names
  • Explanation:
    • employee_df is a data frame with columns for employee ID, name, department, salary, start date, and remote work status.
    • str(employee_df)
      • Displays the structure of the data frame, including the type and example values of each column.
    • nrow(employee_df) and ncol(employee_df)
      • Return the number of rows (employees) and columns (attributes) in the data frame, respectively.
    • names(employee_df)
      • Returns the names of the columns in the data frame.

Data Frame Operations

  • Accessing and Filtering:
    • Access columns by name, filter rows by condition, and add new columns.
# Column access
employee_df$Name                     # Name column
employee_df[["Salary"]]             # Salary column  
employee_df$Department              # Factor column

# Row filtering
engineering_team <- employee_df[employee_df$Department == "Engineering", ]
high_earners <- employee_df[employee_df$Salary >= 70000, ]
remote_workers <- employee_df[employee_df$Remote == TRUE, c("Name", "Department")]

# Adding new data
employee_df$YearsService <- as.numeric(Sys.Date() - employee_df$StartDate) / 365.25
employee_df$SalaryGrade <- cut(employee_df$Salary, 
                              breaks = c(0, 60000, 80000, Inf), 
                              labels = c("Junior", "Mid", "Senior"))
  • Explanation:
    • employee_df$Name
      • Extracts the "Name" column as a character vector.
    • employee_df[["Salary"]]
      • Extracts the "Salary" column as a numeric vector.
    • employee_df$Department
      • Extracts the "Department" column, which is a factor (categorical variable).
    • engineering_team <- employee_df[employee_df$Department == "Engineering", ]
      • Filters the data frame to include only employees in the "Engineering" department.
    • high_earners <- employee_df[employee_df$Salary >= 70000, ]
      • Filters for employees with a salary of 70,000 or more.
    • remote_workers <- employee_df[employee_df$Remote == TRUE, c("Name", "Department")]
      • Selects the names and departments of employees who work remotely.
    • employee_df$YearsService <- as.numeric(Sys.Date() - employee_df$StartDate) / 365.25
      • Calculates the number of years each employee has worked by subtracting their start date from the current date and dividing by 365.25 (to account for leap years).
      • Adds this as a new column "YearsService".
    • employee_df$SalaryGrade <- cut(employee_df$Salary, breaks = c(0, 60000, 80000, Inf), labels = c("Junior", "Mid", "Senior"))
      • Categorizes employees into "Junior", "Mid", or "Senior" based on their salary, and adds this as a new column.

Data Frame Summary and Analysis

  • Summarizing Data:
    • Use summary(), mean(), table(), and aggregate() for quick analysis.
# Statistical summaries
summary(employee_df)                 # Overall summary
mean(employee_df$Salary)             # Average salary: 72000
table(employee_df$Department)        # Department distribution

# Advanced operations
by_dept_salary <- aggregate(Salary ~ Department, employee_df, mean)
# Average salary by department
  • Explanation:
    • summary(employee_df)
      • Provides a summary of each column in the data frame, including min, max, mean, and quartiles for numeric columns, and counts for factors.
    • mean(employee_df$Salary)
      • Calculates the average salary across all employees.
    • table(employee_df$Department)
      • Counts the number of employees in each department.
    • by_dept_salary <- aggregate(Salary ~ Department, employee_df, mean)
      • Groups the data by department and calculates the mean salary for each group, returning a summary table.

11. Data Frames: Order and Merge

  • Sorting and Joining:
    • Data frames can be sorted by one or more columns and merged (joined) with others.

Ordering Data Frames

  • Sorting Rows:
    • Use order() to sort by one or more columns.
# Customer transaction data
transaction_df <- data.frame(
  CustomerID = c(101, 102, 101, 103, 102, 104),
  TransactionDate = as.Date(c("2024-01-15", "2024-01-18", "2024-01-20", 
                             "2024-01-22", "2024-01-25", "2024-01-28")),
  Amount = c(150.00, 89.50, 220.00, 175.50, 95.00, 310.00),
  Product = c("Widget A", "Widget B", "Widget C", "Widget A", "Widget B", "Widget D")
)

# Single column sorting
by_amount <- transaction_df[order(transaction_df$Amount), ]              # Ascending
by_amount_desc <- transaction_df[order(transaction_df$Amount, decreasing = TRUE), ] # Descending

# Multiple column sorting
by_customer_date <- transaction_df[order(transaction_df$CustomerID, transaction_df$TransactionDate), ]
by_customer_amount <- transaction_df[order(transaction_df$CustomerID, -transaction_df$Amount), ]
  • Explanation:
    • transaction_df is a data frame containing transaction data with columns for customer ID, transaction date, amount, and product.
    • order(transaction_df$Amount)
      • Returns the order of indices that would sort the Amount column in ascending order.
    • transaction_df[order(transaction_df$Amount), ]
      • Reorders the rows of the data frame according to the sorted order of the Amount column, giving a data frame sorted by transaction amount.
    • transaction_df[order(transaction_df$Amount, decreasing = TRUE), ]
      • Sorts the data frame by transaction amount in descending order.
    • order(transaction_df$CustomerID, transaction_df$TransactionDate)
      • Returns the order of indices that would sort the data first by CustomerID and then by TransactionDate within each customer ID group.
    • transaction_df[order(transaction_df$CustomerID, transaction_df$TransactionDate), ]
      • Sorts the data frame first by customer ID and then by transaction date.

Data Frame Merging

  • Combining Data Frames:
    • Use merge() for inner, left, and full joins.
# Customer information
customer_info <- data.frame(
  CustomerID = c(101, 102, 103, 104, 105),
  CustomerName = c("ABC Corp", "XYZ Ltd", "Johnson Inc", "Smith Co", "Brown LLC"),
  CustomerType = c("Premium", "Standard", "Premium", "Standard", "Premium")
)

# Transaction summary
transaction_summary <- data.frame(
  CustomerID = c(101, 102, 103, 104),
  TotalAmount = c(370.00, 184.50, 175.50, 310.00),
  TransactionCount = c(2, 2, 1, 1)
)

# Different join types
# Inner join (only matching records)
inner_result <- merge(customer_info, transaction_summary, by = "CustomerID")

# Left join (all customers, matched transactions)
left_result <- merge(customer_info, transaction_summary, by = "CustomerID", all.x = TRUE)

# Full outer join (all records from both)
full_result <- merge(customer_info, transaction_summary, by = "CustomerID", all = TRUE)

# Add merge indicator
full_result$merge_flag <- with(full_result,
  ifelse(!is.na(CustomerName) & is.na(TotalAmount), "customer_only",
         ifelse(is.na(CustomerName) & !is.na(TotalAmount), "transaction_only", "both")))
  • Explanation:
    • customer_info and transaction_summary are data frames containing customer details and transaction summaries, respectively.
    • merge(customer_info, transaction_summary, by = "CustomerID")
      • Performs an inner join, merging the two data frames by CustomerID and keeping only the rows with matching CustomerID in both data frames.
    • merge(customer_info, transaction_summary, by = "CustomerID", all.x = TRUE)
      • Performs a left join, keeping all rows from customer_info and adding matching rows from transaction_summary.
      • If there is no match, NA is filled in for columns from transaction_summary.
    • merge(customer_info, transaction_summary, by = "CustomerID", all = TRUE)
      • Performs a full outer join, keeping all rows from both data frames and filling in NA where there are no matches.
    • full_result$merge_flag <- with(full_result, ...)
      • Creates a new column merge_flag to indicate the source of each row: "customer_only", "transaction_only", or "both".

Advanced Merging Examples

  • Merging on Multiple Keys:
    • Useful for combining detailed transaction and product data.
# Multiple key merging
product_sales <- data.frame(
  Product = c("Widget A", "Widget B", "Widget C", "Widget D"),
  Category = c("Electronics", "Electronics", "Home", "Home"),
  UnitPrice = c(25.00, 15.50, 45.00, 65.00)
)

detailed_transactions <- merge(
  transaction_df, 
  product_sales, 
  by = "Product", 
  all.x = TRUE
)

# Calculate extended amounts
detailed_transactions$ExtendedAmount <- 
  detailed_transactions$Amount / detailed_transactions$UnitPrice * detailed_transactions$UnitPrice
  • Explanation:
    • product_sales is a data frame containing product details, including unit price.
    • merge(transaction_df, product_sales, by = "Product", all.x = TRUE)
      • Merges transaction data with product details by Product, adding product information to each transaction.
    • detailed_transactions$ExtendedAmount <- ...
      • Calculates the extended amount for each transaction based on the unit price and amount.

Performance & Memory Optimization

  • Tips for Efficient R Code:
    • Pre-allocate vectors for loops to avoid memory overhead.
    • Use vectorized operations for speed.
    • Store categorical data as factors to save memory.
    • Use matrices for fast linear algebra.
    • For large data, use data.table for performance.

Quick Reference Summary

Structure Creation Function Access Method Key Benefit
Vector c(), seq() vec[i], vec["name"] Vectorized operations
Factor factor(), ordered() fac[i], levels() Categorical efficiency
Matrix matrix(), cbind() mat[i,j] Mathematical operations
Array array() arr[i,j,k,...] Multi-dimensional data
Data Frame data.frame() df$col, df[i,j] Mixed-type datasets
List list() list[[i]], list$name Flexible containers

**Resource download links**

1.2.2.-Data-Structures-or-Objects-in-R.zip