In this exercise you will use a large dataset that describes houses that were sold in Ames, Iowa. You will use these Ames house data from the package “AmesHousing” and parallel coordinate plots to understand how houses vary by neighborhood. Then you will fit models for each neighborhood based on the size of the house to assess whether a linear model fits all neighbohoods in a similar manner. A scatter plot shows whether this is the case.

Objectives: - Create plots for exploratory data analysis of high-dimensional data - Use abstract, aggregated model-based variables (i.e., overall fit., intercept and slope of a linear model) to compare groups of data - Appreciate how visualizations might help you identify unfair machine learning models

Submit: Complete each section chunck of code below to process the data and create the graphs. I have given you some bits of code to do some transformations that we have not discussed in class. Briefly answer the questions posed in describing what each chunk does and the meaning of the graphs.

Load packages

library(AmesHousing)
library(janitor) # Useful package for converting variable names, such as "Lot shape" to "lot_shape"

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(scales)
library(broom)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard()    masks scales::discard()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ dplyr::lag()        masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggrepel)

Load and clean the data

What the clean_names function does to the names?

It convert the variable names from camelCase to snake_case i.e. removes all spaces, convert to lower case, and use underscore to separate words. example “MS SubClass” to “ms_sub_class”

Why is “where” useful?

It is useful to select variables based on their type. For example, where(is.numeric) will select all numeric variables.

# names(ames_raw)
house.df = ames_raw %>% janitor::clean_names()
# names(house.df)

house.df = house.df %>% 
  select(pid, sale_price, neighborhood, # Selects specific variables
         where(is.numeric), # Selects numeric variables
         -order, -(misc_val:sale_condition), -lot_area) %>% # Removes specific variables
  select(which(colMeans(is.na(.)) < 0.05)) %>% # Removes columns with more than 5% missing values
  filter(sale_price > 15000) %>% # Keeps houses that sold for more than $15,000
  group_by(neighborhood) %>% # Groups by neighborhood
  filter(n() > 50) %>% # Removes neighborhoods that have had fewer than 50 home sales
  ungroup() # Ungroups the data

Create a parallel coordinate chart

Try mapping color to neighborhood and faceting by neighborhood. Which is most effective and why?

Both the plots with color mapping and faceting are shown below. The faceted plot makes it easier to interpret as it avoids the crowding due to large number of neighboorhoods ploted at once.

What does the parallel coordinate plot reveal about the neighborhoods?

The plot reveals what variable change more for all the neighborhoods. The plots highlights the variable that are representative of majority of the variance, and possibly could be used as predictors.

## Convert to long format with all the numeric variable in a columns (i.e, pivot longer)
# Hint: specify the "cols" using the selection function from above: where(is.numeric)
long.house.df = house.df %>% 
  pivot_longer(cols = where(is.numeric), names_to = "variable", values_to = "value")

## Scale values
# Hint: Be sure to group by variable before scaling and to ungroup after
long.house.df = long.house.df %>% 
  group_by(variable) %>% 
  mutate(value_sc = scale(value)) %>% 
  ungroup()

## Create a variable to define what neighborhoods to plot
neighborhood_to_plot = c("StoneBr", "NridgHt", "NoRidge", "Somerst", "Timber")

selected_house_df = long.house.df %>% 
  mutate(highlight = ifelse(neighborhood %in% neighborhood_to_plot, "yes", "no"))

## Create a parallel coordinate plot of the scaled values for selected neighborhoods 
# Hint: Use the following to specify the subset of data to plot
#  ggplot(data = long.house.df %>%  filter(highlight == "yes"),
# Hint: Consider the following to highlight the zero crossing and sales price
#  geom_vline(xintercept = "sale_price", colour = "grey98", size = 3) +
#  geom_hline(yintercept = 0, colour = "grey99", size = 4) +

mapping color to neighborhood

ggplot(data = selected_house_df %>%  filter(highlight == "yes"), aes(x = variable, y = value_sc, group = pid, color = neighborhood)) +
  geom_line(alpha = 0.2) +
  geom_vline(xintercept = "sale_price", colour = "red", size = 0.3) +
  geom_hline(yintercept = 0, colour = "red", size = 0.3) +
  theme_minimal() +
  labs(x = NULL, y = NULL, title = "Parallel coordinate plot of house features by neighborhood") +
  coord_flip() 

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

faceting by neighborhood

ggplot(data = selected_house_df %>%  filter(highlight == "yes"), aes(x = variable, y = value_sc, group = pid)) +
  geom_line(alpha = 0.2) +
  geom_vline(xintercept = "sale_price", colour = "red", size = 0.3) +
  geom_hline(yintercept = 0, colour = "red", size = 0.3) +
  theme_minimal() +
  labs(x = NULL, y = NULL, title = "Parallel coordinate plot of house features by neighborhood") +
  facet_wrap(~neighborhood, scales = "free_y") + 
  coord_flip()

Fit models to all the neighborhoods and plot the estimated intercept and slope

Fit models to predict sale_price as a function of size: sale_price~x1st_flr_sf
Use glance and tidy to extract both overall model fit data and model parameters
Hint: After using nest-map-unnest use pivot wider to move the intercept and slope into separate columns
Hint: After unnesting, use clean names to turn “(Intercept)” into an acceptable R name
Hint: Use the scale to show axis labels as dollars
Map the size of the point the r.squared value. R-square value indicates how well the model can predict the data
This plot uses a linear model to abstract and aggregate data for each neighborhood. Why this might be useful and why it might be worse than useless?

Linear model is useful to understand relation between the DV and IV and can help predict the DV. Each neighbohood has a different charachterstics, meaning a individual model might be better fit. However, this can be worse than useless if the underlying distribution of the data is not linear.

If this regression model was guiding admission decisions for university students based on GPA rather than predicting house prices based on their size, how might the r-square value indicate potential unfairness if the points represent different socio-economic groups rather the neigborhoods.

Usign GPA for regression model to guide admission decision, can be very unfair as it will only capture the linear relation and any student deviating from the linear fit would be at disadvatage. Further, it using just one variable (GPA) can be noisy.

## Fit models for each neighborhood
models = house.df %>% ungroup() %>%
        group_by(neighborhood) %>%
        nest() %>%
        mutate(
          fit = map(data, ~lm(sale_price ~ x1st_flr_sf, data = .x)),
          glanced = map(fit, broom::glance),
          tidied = map(fit, broom::tidy),
          augmented = map(fit, broom::augment)
        ) 
models

## Unnest the model parameters 
tidy_model = models %>% 
  unnest(tidied)


## Unnest the model fit
glance_model = models %>% 
  unnest(glanced)


## Use left_join to combine the parameter and fit dataframes
model_df = left_join(tidy_model, glance_model, by = "neighborhood") %>%
  select(neighborhood, r.squared, estimate, std.error, p.value.x)


## Define highlighted neighborhoods 
# Hint: Adapt the code from the previous section
selected_model_df = model_df %>% 
  filter(neighborhood %in% neighborhood_to_plot)


## Plot the slope and intercept in a scatter plot with the size of the point mapped to the r-square
ggplot(data = selected_model_df, aes(x = estimate, y = std.error, size = r.squared)) +
  geom_point(alpha = 0.5) +
  theme_minimal()  +
  labs(x = "Intercept", y = "Slope", title = "Slope and intercept of linear model for each neighborhood") 

Beginner's guide to models in R