contact@a2zlearners.com

2.3.1. Introduction to Data Wrangling

1. Introduction

Data wrangling is the process of transforming and preparing raw data into a tidy, usable format for analysis. After importing your data and reshaping it as needed, wrangling ensures your dataset is clean, consistent, and ready for further exploration.

2. Why Data Wrangling?

  • Real-world data is often messy, incomplete, or inconsistent.
  • Wrangling helps:
    • Filter and select relevant data.
    • Clean and standardize variable names.
    • Create new variables or transform existing ones.
    • Summarize, group, and join datasets.
    • Prepare data for visualization and modeling.

3. Essential R Packages for Data Wrangling

  • dplyr: Core tidyverse package for data manipulation.
  • tidyr: For reshaping and tidying data.
  • janitor: For cleaning data frames and variable names.
  • skimr: For quick, comprehensive data summaries.

4. Key dplyr Functions

  • %>% : Pipe operator for chaining commands.
  • glimpse() : Overview of dataset structure.
  • filter() : Filter rows by condition.
  • select() : Select, rename, or reorder columns.
  • rename() : Rename columns.
  • arrange() : Reorder rows.
  • mutate() : Create or transform columns.
  • group_by() : Group data for summary.
  • summarize() : Summarize grouped data.
  • left_join() : Merge datasets.
  • tally() : Count rows or sum values.
  • count() : Count unique values.
  • add_count() : Add counts as a new column.
  • add_tally() : Add tally as a new column.

5. Key tidyr Functions

  • unite() : Combine multiple columns into one.
  • separate() : Split one column into multiple columns.

6. Key janitor Functions

  • clean_names() : Standardize column names.
  • tabyl() : Frequency tables for variables.
  • get_dupes() : Identify duplicate rows.
# Install janitor if needed
# install.packages('janitor')
library(janitor)

7. Key skimr Function

  • skim() : Quick, detailed summary of a data frame.
# Install skimr if needed
# install.packages('skimr')
library(skimr)

8. Data Wrangling Examples

8.1. Filtering and Selecting Data

Input Table:

Name Age Gender Score
Alice 30 F 88
Bob 28 M 75
Carol 25 F 92
David 35 M 85
library(dplyr)
filtered <- df %>%
  filter(Age > 28) %>%
  select(Name, Score)
filtered

Output Table:

Name Score
Alice 88
David 85
  • Filters rows where Age > 28 and selects only Name and Score columns.

8.2. Creating and Transforming Columns

Input Table:

Name Score
Alice 88
Bob 75
Carol 92
mutated <- df %>%
  mutate(Passed = Score >= 80)
mutated

Output Table:

Name Score Passed
Alice 88 TRUE
Bob 75 FALSE
Carol 92 TRUE
  • Adds a new column Passed based on Score.

8.3. Grouping and Summarizing Data

Input Table:

Name Gender Score
Alice F 88
Bob M 75
Carol F 92
David M 85
summary <- df %>%
  group_by(Gender) %>%
  summarize(Avg_Score = mean(Score))
summary

Output Table:

Gender Avg_Score
F 90
M 80
  • Groups by Gender and calculates average Score.

8.4. Cleaning Column Names

Input Table:

First Name Last Name Test Score
Alice Smith 88
Bob Jones 75
cleaned <- df %>%
  clean_names()
cleaned

Output Table:

first_name last_name test_score
Alice Smith 88
Bob Jones 75
  • Converts column names to snake_case for consistency.

8.5. Summarizing Data with skimr

Input Table:

Name Age Gender Score
Alice 30 F 88
Bob 28 M 75
Carol 25 F 92
David 35 M 85
library(skimr)
skim(df)

Expected Output (summary excerpt):

variable type missing complete mean sd min max n_unique
Name character 0 4 4
Age numeric 0 4 29.5 4.2 25 35 4
Gender character 0 4 2
Score numeric 0 4 85 7.1 75 92 4
  • skim(df) provides a quick, detailed summary of each column, including type, missing values, mean, standard deviation, min, max, and number of unique values.