Beginner's guide to models in R
In this exercise you will use a large dataset that describes houses that were sold in Ames, Iowa. You will use these Ames house data from the package “AmesHousing” and parallel coordinate plots to understand how houses vary by neighborhood. Then you will fit models for each neighborhood based on the size of the house to assess whether a linear model fits all neighbohoods in a similar manner. A scatter plot shows whether this is the case.
Objectives: - Create plots for exploratory data analysis of high-dimensional data - Use abstract, aggregated model-based variables (i.e., overall fit., intercept and slope of a linear model) to compare groups of data - Appreciate how visualizations might help you identify unfair machine learning models
Submit: Complete each section chunck of code below to process the data and create the graphs. I have given you some bits of code to do some transformations that we have not discussed in class. Briefly answer the questions posed in describing what each chunk does and the meaning of the graphs.
Load packages
library(AmesHousing)
library(janitor) # Useful package for converting variable names, such as "Lot shape" to "lot_shape"
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(scales)
library(broom)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggrepel)
Load and clean the data
What the clean_names function does to the names?
It convert the variable names from camelCase to snake_case i.e. removes all spaces, convert to lower case, and use underscore to separate words. example “MS SubClass” to “ms_sub_class”
Why is “where” useful?
It is useful to select variables based on their type. For example, where(is.numeric) will select all numeric variables.
# names(ames_raw)
house.df = ames_raw %>% janitor::clean_names()
# names(house.df)
house.df = house.df %>%
select(pid, sale_price, neighborhood, # Selects specific variables
where(is.numeric), # Selects numeric variables
-order, -(misc_val:sale_condition), -lot_area) %>% # Removes specific variables
select(which(colMeans(is.na(.)) < 0.05)) %>% # Removes columns with more than 5% missing values
filter(sale_price > 15000) %>% # Keeps houses that sold for more than $15,000
group_by(neighborhood) %>% # Groups by neighborhood
filter(n() > 50) %>% # Removes neighborhoods that have had fewer than 50 home sales
ungroup() # Ungroups the data
Create a parallel coordinate chart
- Try mapping color to neighborhood and faceting by neighborhood. Which is most effective and why?
Both the plots with color mapping and faceting are shown below. The faceted plot makes it easier to interpret as it avoids the crowding due to large number of neighboorhoods ploted at once.
- What does the parallel coordinate plot reveal about the neighborhoods?
The plot reveals what variable change more for all the neighborhoods. The plots highlights the variable that are representative of majority of the variance, and possibly could be used as predictors.
## Convert to long format with all the numeric variable in a columns (i.e, pivot longer)
# Hint: specify the "cols" using the selection function from above: where(is.numeric)
long.house.df = house.df %>%
pivot_longer(cols = where(is.numeric), names_to = "variable", values_to = "value")
## Scale values
# Hint: Be sure to group by variable before scaling and to ungroup after
long.house.df = long.house.df %>%
group_by(variable) %>%
mutate(value_sc = scale(value)) %>%
ungroup()
## Create a variable to define what neighborhoods to plot
neighborhood_to_plot = c("StoneBr", "NridgHt", "NoRidge", "Somerst", "Timber")
selected_house_df = long.house.df %>%
mutate(highlight = ifelse(neighborhood %in% neighborhood_to_plot, "yes", "no"))
## Create a parallel coordinate plot of the scaled values for selected neighborhoods
# Hint: Use the following to specify the subset of data to plot
# ggplot(data = long.house.df %>% filter(highlight == "yes"),
# Hint: Consider the following to highlight the zero crossing and sales price
# geom_vline(xintercept = "sale_price", colour = "grey98", size = 3) +
# geom_hline(yintercept = 0, colour = "grey99", size = 4) +
mapping color to neighborhood
ggplot(data = selected_house_df %>% filter(highlight == "yes"), aes(x = variable, y = value_sc, group = pid, color = neighborhood)) +
geom_line(alpha = 0.2) +
geom_vline(xintercept = "sale_price", colour = "red", size = 0.3) +
geom_hline(yintercept = 0, colour = "red", size = 0.3) +
theme_minimal() +
labs(x = NULL, y = NULL, title = "Parallel coordinate plot of house features by neighborhood") +
coord_flip()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

faceting by neighborhood
ggplot(data = selected_house_df %>% filter(highlight == "yes"), aes(x = variable, y = value_sc, group = pid)) +
geom_line(alpha = 0.2) +
geom_vline(xintercept = "sale_price", colour = "red", size = 0.3) +
geom_hline(yintercept = 0, colour = "red", size = 0.3) +
theme_minimal() +
labs(x = NULL, y = NULL, title = "Parallel coordinate plot of house features by neighborhood") +
facet_wrap(~neighborhood, scales = "free_y") +
coord_flip()

Fit models to all the neighborhoods and plot the estimated intercept and slope
- Fit models to predict sale_price as a function of size: sale_price~x1st_flr_sf
- Use glance and tidy to extract both overall model fit data and model parameters
- Hint: After using nest-map-unnest use pivot wider to move the intercept and slope into separate columns
- Hint: After unnesting, use clean names to turn “(Intercept)” into an acceptable R name
- Hint: Use the scale to show axis labels as dollars
-
Map the size of the point the r.squared value. R-square value indicates how well the model can predict the data
- This plot uses a linear model to abstract and aggregate data for each neighborhood. Why this might be useful and why it might be worse than useless?
Linear model is useful to understand relation between the DV and IV and can help predict the DV. Each neighbohood has a different charachterstics, meaning a individual model might be better fit. However, this can be worse than useless if the underlying distribution of the data is not linear.
- If this regression model was guiding admission decisions for university students based on GPA rather than predicting house prices based on their size, how might the r-square value indicate potential unfairness if the points represent different socio-economic groups rather the neigborhoods.
Usign GPA for regression model to guide admission decision, can be very unfair as it will only capture the linear relation and any student deviating from the linear fit would be at disadvatage. Further, it using just one variable (GPA) can be noisy.
## Fit models for each neighborhood
models = house.df %>% ungroup() %>%
group_by(neighborhood) %>%
nest() %>%
mutate(
fit = map(data, ~lm(sale_price ~ x1st_flr_sf, data = .x)),
glanced = map(fit, broom::glance),
tidied = map(fit, broom::tidy),
augmented = map(fit, broom::augment)
)
models
## Unnest the model parameters
tidy_model = models %>%
unnest(tidied)
## Unnest the model fit
glance_model = models %>%
unnest(glanced)
## Use left_join to combine the parameter and fit dataframes
model_df = left_join(tidy_model, glance_model, by = "neighborhood") %>%
select(neighborhood, r.squared, estimate, std.error, p.value.x)
## Define highlighted neighborhoods
# Hint: Adapt the code from the previous section
selected_model_df = model_df %>%
filter(neighborhood %in% neighborhood_to_plot)
## Plot the slope and intercept in a scatter plot with the size of the point mapped to the r-square
ggplot(data = selected_model_df, aes(x = estimate, y = std.error, size = r.squared)) +
geom_point(alpha = 0.5) +
theme_minimal() +
labs(x = "Intercept", y = "Slope", title = "Slope and intercept of linear model for each neighborhood")

Enjoy Reading This Article?
Here are some more articles you might like to read next: