2.3.1. Introduction to Data Wrangling

1. Introduction

Data wrangling is the process of transforming and preparing raw data into a tidy, usable format for analysis. After importing your data and reshaping it as needed, wrangling ensures your dataset is clean, consistent, and ready for further exploration.

2. Why Data Wrangling?

Real-world data is often messy, incomplete, or inconsistent.
Wrangling helps:
- Filter and select relevant data.
- Clean and standardize variable names.
- Create new variables or transform existing ones.
- Summarize, group, and join datasets.
- Prepare data for visualization and modeling.

3. Essential R Packages for Data Wrangling

dplyr: Core tidyverse package for data manipulation.
tidyr: For reshaping and tidying data.
janitor: For cleaning data frames and variable names.
skimr: For quick, comprehensive data summaries.

4. Key dplyr Functions

%>% : Pipe operator for chaining commands.
glimpse() : Overview of dataset structure.
filter() : Filter rows by condition.
select() : Select, rename, or reorder columns.
rename() : Rename columns.
arrange() : Reorder rows.
mutate() : Create or transform columns.
group_by() : Group data for summary.
summarize() : Summarize grouped data.
left_join() : Merge datasets.
tally() : Count rows or sum values.
count() : Count unique values.
add_count() : Add counts as a new column.
add_tally() : Add tally as a new column.

5. Key tidyr Functions

unite() : Combine multiple columns into one.
separate() : Split one column into multiple columns.

6. Key janitor Functions

clean_names() : Standardize column names.
tabyl() : Frequency tables for variables.
get_dupes() : Identify duplicate rows.

# Install janitor if needed
# install.packages('janitor')
library(janitor)

7. Key skimr Function

skim() : Quick, detailed summary of a data frame.

# Install skimr if needed
# install.packages('skimr')
library(skimr)

8. Data Wrangling Examples

8.1. Filtering and Selecting Data

Input Table:

Name	Age	Gender	Score
Alice	30	F	88
Bob	28	M	75
Carol	25	F	92
David	35	M	85

library(dplyr)
filtered <- df %>%
  filter(Age > 28) %>%
  select(Name, Score)
filtered

Output Table:

Name	Score
Alice	88
David	85

Filters rows where Age > 28 and selects only Name and Score columns.

8.2. Creating and Transforming Columns

Input Table:

Name	Score
Alice	88
Bob	75
Carol	92

mutated <- df %>%
  mutate(Passed = Score >= 80)
mutated

Output Table:

Name	Score	Passed
Alice	88	TRUE
Bob	75	FALSE
Carol	92	TRUE

Adds a new column Passed based on Score.

8.3. Grouping and Summarizing Data

Input Table:

Name	Gender	Score
Alice	F	88
Bob	M	75
Carol	F	92
David	M	85

summary <- df %>%
  group_by(Gender) %>%
  summarize(Avg_Score = mean(Score))
summary

Output Table:

Gender	Avg_Score
F	90
M	80

Groups by Gender and calculates average Score.

8.4. Cleaning Column Names

Input Table:

First Name	Last Name	Test Score
Alice	Smith	88
Bob	Jones	75

cleaned <- df %>%
  clean_names()
cleaned

Output Table:

first_name	last_name	test_score
Alice	Smith	88
Bob	Jones	75

Converts column names to snake_case for consistency.

8.5. Summarizing Data with skimr

Input Table:

Name	Age	Gender	Score
Alice	30	F	88
Bob	28	M	75
Carol	25	F	92
David	35	M	85

library(skimr)
skim(df)

Expected Output (summary excerpt):

variable	type	complete	mean	sd	min	max	n_unique
Name	character	4					4
Age	numeric	4	29.5	4.2	25	35	4
Gender	character	4					2
Score	numeric	4	85	7.1	75	92	4

skim(df) provides a quick, detailed summary of each column, including type, missing values, mean, standard deviation, min, max, and number of unique values.