2.3.1. Introduction to Data Wrangling
1. Introduction
Data wrangling is the process of transforming and preparing raw data into a tidy, usable format for analysis. After importing your data and reshaping it as needed, wrangling ensures your dataset is clean, consistent, and ready for further exploration.
2. Why Data Wrangling?
- Real-world data is often messy, incomplete, or inconsistent.
- Wrangling helps:
- Filter and select relevant data.
- Clean and standardize variable names.
- Create new variables or transform existing ones.
- Summarize, group, and join datasets.
- Prepare data for visualization and modeling.
3. Essential R Packages for Data Wrangling
- dplyr: Core tidyverse package for data manipulation.
- tidyr: For reshaping and tidying data.
- janitor: For cleaning data frames and variable names.
- skimr: For quick, comprehensive data summaries.
4. Key dplyr Functions
%>%: Pipe operator for chaining commands.glimpse(): Overview of dataset structure.filter(): Filter rows by condition.select(): Select, rename, or reorder columns.rename(): Rename columns.arrange(): Reorder rows.mutate(): Create or transform columns.group_by(): Group data for summary.summarize(): Summarize grouped data.left_join(): Merge datasets.tally(): Count rows or sum values.count(): Count unique values.add_count(): Add counts as a new column.add_tally(): Add tally as a new column.
5. Key tidyr Functions
unite(): Combine multiple columns into one.separate(): Split one column into multiple columns.
6. Key janitor Functions
clean_names(): Standardize column names.tabyl(): Frequency tables for variables.get_dupes(): Identify duplicate rows.
# Install janitor if needed
# install.packages('janitor')
library(janitor)
7. Key skimr Function
skim(): Quick, detailed summary of a data frame.
# Install skimr if needed
# install.packages('skimr')
library(skimr)
8. Data Wrangling Examples
8.1. Filtering and Selecting Data
Input Table:
| Name | Age | Gender | Score |
|---|---|---|---|
| Alice | 30 | F | 88 |
| Bob | 28 | M | 75 |
| Carol | 25 | F | 92 |
| David | 35 | M | 85 |
library(dplyr)
filtered <- df %>%
filter(Age > 28) %>%
select(Name, Score)
filtered
Output Table:
| Name | Score |
|---|---|
| Alice | 88 |
| David | 85 |
- Filters rows where Age > 28 and selects only Name and Score columns.
8.2. Creating and Transforming Columns
Input Table:
| Name | Score |
|---|---|
| Alice | 88 |
| Bob | 75 |
| Carol | 92 |
mutated <- df %>%
mutate(Passed = Score >= 80)
mutated
Output Table:
| Name | Score | Passed |
|---|---|---|
| Alice | 88 | TRUE |
| Bob | 75 | FALSE |
| Carol | 92 | TRUE |
- Adds a new column Passed based on Score.
8.3. Grouping and Summarizing Data
Input Table:
| Name | Gender | Score |
|---|---|---|
| Alice | F | 88 |
| Bob | M | 75 |
| Carol | F | 92 |
| David | M | 85 |
summary <- df %>%
group_by(Gender) %>%
summarize(Avg_Score = mean(Score))
summary
Output Table:
| Gender | Avg_Score |
|---|---|
| F | 90 |
| M | 80 |
- Groups by Gender and calculates average Score.
8.4. Cleaning Column Names
Input Table:
| First Name | Last Name | Test Score |
|---|---|---|
| Alice | Smith | 88 |
| Bob | Jones | 75 |
cleaned <- df %>%
clean_names()
cleaned
Output Table:
| first_name | last_name | test_score |
|---|---|---|
| Alice | Smith | 88 |
| Bob | Jones | 75 |
- Converts column names to snake_case for consistency.
8.5. Summarizing Data with skimr
Input Table:
| Name | Age | Gender | Score |
|---|---|---|---|
| Alice | 30 | F | 88 |
| Bob | 28 | M | 75 |
| Carol | 25 | F | 92 |
| David | 35 | M | 85 |
library(skimr)
skim(df)
Expected Output (summary excerpt):
| variable | type | missing | complete | mean | sd | min | max | n_unique |
|---|---|---|---|---|---|---|---|---|
| Name | character | 0 | 4 | 4 | ||||
| Age | numeric | 0 | 4 | 29.5 | 4.2 | 25 | 35 | 4 |
| Gender | character | 0 | 4 | 2 | ||||
| Score | numeric | 0 | 4 | 85 | 7.1 | 75 | 92 | 4 |
skim(df)provides a quick, detailed summary of each column, including type, missing values, mean, standard deviation, min, max, and number of unique values.