2.2.4. Working with Databases and Relational Data

1. Introduction to Databases and Relational Data

Most real-world data is stored across multiple related tables, not just in single files like CSV or Excel.
Relational data: Data organized in multiple tables, linked by unique identifiers (keys).
Relational databases (e.g., SQLite, MySQL, PostgreSQL) efficiently store, manage, and query such data.
R can connect to and work with relational databases using packages like RSQLite, DBI, dplyr, and dbplyr.

2. Why Use Relational Data?

Efficient Data Storage: Avoids duplication by storing related info in separate tables.
Avoids Ambiguity: Unique identifiers (keys) clarify relationships, even when names are duplicated.
Privacy: Sensitive data can be stored in separate tables with restricted access.
Scalability: Databases handle large datasets and complex queries efficiently.

3. Example: Relational Data Structure

Suppose you have four tables:

restaurants: id, name, address, type
inspections: inspection_id, restaurant_id, date, inspector, rating
ratings: rating_id, restaurant_id, stars
owners: owner_id, name, contact

Relational mapping:

restaurants.id is the PK for restaurants.
inspections.restaurant_id and ratings.restaurant_id are FK referencing restaurants.id.
restaurants.owner_id is a FK referencing owners.owner_id.

Example Data: restaurants

id	name	address	type	owner_id
JJ29	Taco Stand	123 Main St	Mexican	OW1
AB12	Pizza Hub	456 Oak Ave	Italian	OW2
XY88	Curry House	789 Pine Rd	Indian	OW1

inspections

inspection_id	restaurant_id	date	inspector	rating
INSP1	JJ29	2023-01-10	Smith	A
INSP2	AB12	2023-02-15	Lee	B
INSP3	XY88	2023-03-05	Patel	A

ratings

rating_id	restaurant_id	stars
R1	JJ29	4.5
R2	AB12	4.0
R3	XY88	4.8

owners

owner_id	name	contact
OW1	Alice Brown	555-1234
OW2	Bob Smith	555-5678

1.2.8.1.-relational_database.png

PK: Primary Key (unique identifier in the table)
FK: Foreign Key (references the primary key in another table)

This mapping shows:

restaurants is central, linked to inspections and ratings via id.
restaurants.owner_id links to owners.owner_id.
inspections and ratings both reference restaurants via restaurant_id.

This structure allows you to join and relate information across all four tables using their keys.

4. Introduction to SQL and SQLite

SQL (Structured Query Language): Standard language for querying relational databases.
SQLite: Lightweight, file-based relational database, easy to use with R.

5. Connecting to a Database with RSQLite

Use the RSQLite package to connect to SQLite databases.
Install the package if not already installed:

# install.packages("RSQLite")

library(RSQLite)

# Specify driver and connect to database

sqlite <- dbDriver("SQLite")
db <- dbConnect(sqlite, "company.db")

# List tables in the database

dbListTables(db)

dbDisconnect(db)

Here's how to create the given relational database schema and populate it using R with the RSQLite package.

1. Create the Tables in the Database

library(RSQLite)

# Connect to the database (creates company.db if not exists)
db <- dbConnect(SQLite(), "company.db")

# Owners table
dbExecute(db, "
  CREATE TABLE owners (
    owner_id TEXT PRIMARY KEY,
    name TEXT,
    contact TEXT
  )
")

# Restaurants table
dbExecute(db, "
  CREATE TABLE restaurants (
    id TEXT PRIMARY KEY,
    name TEXT,
    address TEXT,
    type TEXT,
    owner_id TEXT,
    FOREIGN KEY(owner_id) REFERENCES owners(owner_id)
  )
")

# Inspections table
dbExecute(db, "
  CREATE TABLE inspections (
    inspection_id TEXT PRIMARY KEY,
    restaurant_id TEXT,
    date TEXT,
    inspector TEXT,
    rating TEXT,
    FOREIGN KEY(restaurant_id) REFERENCES restaurants(id)
  )
")

# Ratings table
dbExecute(db, "
  CREATE TABLE ratings (
    rating_id TEXT PRIMARY KEY,
    restaurant_id TEXT,
    stars REAL,
    FOREIGN KEY(restaurant_id) REFERENCES restaurants(id)
  )
")

2. Insert the Data Rows

# Insert owners
dbExecute(db, "INSERT INTO owners VALUES ('OW1', 'Alice Brown', '555-1234')")
dbExecute(db, "INSERT INTO owners VALUES ('OW2', 'Bob Smith', '555-5678')")

# Insert restaurants
dbExecute(db, "INSERT INTO restaurants VALUES ('JJ29', 'Taco Stand',  '123 Main St',  'Mexican', 'OW1')")
dbExecute(db, "INSERT INTO restaurants VALUES ('AB12', 'Pizza Hub',   '456 Oak Ave',  'Italian', 'OW2')")
dbExecute(db, "INSERT INTO restaurants VALUES ('XY88', 'Curry House', '789 Pine Rd',  'Indian',  'OW1')")

# Insert inspections
dbExecute(db, "INSERT INTO inspections VALUES ('INSP1', 'JJ29', '2023-01-10', 'Smith', 'A')")
dbExecute(db, "INSERT INTO inspections VALUES ('INSP2', 'AB12', '2023-02-15', 'Lee', 'B')")
dbExecute(db, "INSERT INTO inspections VALUES ('INSP3', 'XY88', '2023-03-05', 'Patel', 'A')")

# Insert ratings
dbExecute(db, "INSERT INTO ratings VALUES ('R1', 'JJ29', 4.5)")
dbExecute(db, "INSERT INTO ratings VALUES ('R2', 'AB12', 4.0)")
dbExecute(db, "INSERT INTO ratings VALUES ('R3', 'XY88', 4.8)")

3. Verify: List Tables

# List all tables
dbListTables(db)

You should see:

dbListTables(db)
[1] "inspections" "owners"      "ratings"     "restaurants"

4. Disconnect

dbDisconnect(db)

This code will create the four related tables and insert all the example data as you specified, using foreign key constraints to reflect the relationships between entities. If you want to run queries or further interact with the data, let me know!

6. Accessing Tables with dbplyr and dplyr

Use dbplyr and dplyr to work with database tables as if they were data frames.

library(RSQLite)
library(dbplyr)
library(dplyr)

db <- dbConnect(SQLite(), "company.db")

inspections <- tbl(db, "inspections")
owners <- tbl(db, "owners")
ratings <- tbl(db, "ratings")
restaurants <- tbl(db, "restaurants")

inspections

dbDisconnect(db)

Output:

# Source:   table<`inspections`> [?? x 5]
# Database: sqlite 3.50.3 [C:\Users\gnana\OneDrive\Documents\R Programming\a2zlearners.com\SAStoR\2.2.4.-Working-with-Databases-and-Relational-Data\company.db]
  inspection_id restaurant_id date       inspector rating
  <chr>         <chr>         <chr>      <chr>     <chr> 
1 INSP1         JJ29          2023-01-10 Smith     A     
2 INSP2         AB12          2023-02-15 Lee       B     
3 INSP3         XY88          2023-03-05 Patel     A

7. Mutating Joins: Combining Tables

Mutating joins combine variables from two tables using a key (e.g., restaurant_id).
Types of mutating joins:
- inner_join(): Only rows with keys in both tables.
- left_join(): All rows from the first table, matched rows from the second.
- right_join(): All rows from the second table, matched rows from the first.
- full_join(): All rows from both tables.

Input Example:

restaurants table: id, name, address, type, owner_id
inspections table: inspection_id, restaurant_id, date, inspector, rating

R Example:

# Inner join
inner <- inner_join(restaurants, inspections, by = c("id" = "restaurant_id"))
as_tibble(inner)

Expected Output:

id	name	address	type	owner_id	inspection_id	date	inspector	rating
JJ29	Taco Stand	123 Main St	Mexican	OW1	INSP1	2023-01-10	Smith	A
AB12	Pizza Hub	456 Oak Ave	Italian	OW2	INSP2	2023-02-15	Lee	B
XY88	Curry House	789 Pine Rd	Indian	OW1	INSP3	2023-03-05	Patel	A

Other joins:

# Left join
left <- left_join(restaurants, inspections, by = c("id" = "restaurant_id"))
as_tibble(left)

# Right join
right <- right_join(as_tibble(restaurants), as_tibble(inspections), by = c("id" = "restaurant_id"))
as_tibble(right)

# Full join
full <- full_join(as_tibble(restaurants), as_tibble(inspections), by = c("id" = "restaurant_id"))
as_tibble(full)

8. Filtering Joins: Filtering Observations

Filtering joins affect which rows are kept, not which columns.
- semi_join(x, y): Keeps rows in x with a match in y.
- anti_join(x, y): Keeps rows in x with no match in y.

R Example:

# Semi join: restaurants with inspections
semi <- semi_join(restaurants, inspections, by = c("id" = "restaurant_id"))
as_tibble(semi)

# Anti join: restaurants without inspections
anti <- anti_join(restaurants, inspections, by = c("id" = "restaurant_id"))
as_tibble(anti)

Expected Output for semi_join:

id	name	address	type	owner_id
JJ29	Taco Stand	123 Main St	Mexican	OW1
AB12	Pizza Hub	456 Oak Ave	Italian	OW2
XY88	Curry House	789 Pine Rd	Indian	OW1

Expected Output for anti_join:

If all restaurants have inspections, the output will be an empty tibble.

9. Connecting to Remote Databases

For large or remote databases, connect using credentials (host, user, password).
Use DBI and a database-specific package (e.g., RMySQL, RPostgres).

R Example:

# Example only; replace with your credentials
con <- DBI::dbConnect(RMySQL::MySQL(),
  host = "database.host.com",
  user = "yourusername",
  password = rstudioapi::askForPassword("database_password")
)

10. Input and Output Table Summary

Operation	R Function / Package	Input Example	Output Example
Connect to SQLite	`dbConnect()`	.db file	DB connection
List tables	`dbListTables()`	DB connection	Table names
Access table	`tbl()`	DB connection, name	Table (tbl)
Mutating join	`*_join()`	Two tables	Combined table
Filtering join	`semi_join()`, `anti_join()`	Two tables	Filtered table

11. Beyond the Basics: Exploring Databases in R

Query large datasets efficiently: Only pull the data you need, not the whole table.
Write SQL directly in R: Use dbGetQuery() for custom SQL queries.
Create, update, or delete tables: Use dbWriteTable(), dbRemoveTable(), etc.
Work with other databases: Use RPostgres, RMariaDB, RODBC, etc. for PostgreSQL, MariaDB, SQL Server, etc.
Database transactions: Use dbBegin(), dbCommit(), dbRollback() for advanced workflows.
Combine with dplyr verbs: Filter, mutate, group, and summarize directly on database tables.
Export query results: Use collect() to bring data into R as a tibble/data frame for further analysis or export.

R Example: Custom SQL Query

# Get all inspections
dbGetQuery(db, "SELECT * FROM inspections")

12. Summary and Best Practices

Use relational databases for efficient, scalable, and organized data storage.
Use unique identifiers to link tables and avoid ambiguity.
Use RSQLite, dplyr, and dbplyr for seamless database access and manipulation in R.
Use mutating joins to combine tables and filtering joins to filter observations.
For large or remote databases, connect directly and only pull the data you need.
Always close your database connection with dbDisconnect() when done.

Resource download links

2.2.4.-Working-with-Databases-and-Relational-Data.zip

⁂