House prices - part one

Image credit: Palo Cech

House prices - part one

In the following few posts I will be doing simple exploratory data analysis of the Ames Housing dataset used in the Kaggle competition House Prices: Advanced Regression Techniques. In later posts, I will go on to run statistical models on the data. This is my first blog post here so it’s also an exercise in creating blog posts!

The problem is to try to correctly predict the house prices in the test dataset by training a regression model on the training data.

I will soon do an exploratory data analysis of the data, however to begin with, the data should be cleaned. In particular, the empty values should be filled in.

First, let’s load the data.

library(dplyr)
library(ggplot2)
library(reshape2)

train <- read.csv('train.csv', stringsAsFactors = F)
test <- read.csv('test.csv', stringsAsFactors = F)

Columns

Here are the columns available in the dataset:

names(train)
##  [1] "Id"            "MSSubClass"    "MSZoning"      "LotFrontage"  
##  [5] "LotArea"       "Street"        "Alley"         "LotShape"     
##  [9] "LandContour"   "Utilities"     "LotConfig"     "LandSlope"    
## [13] "Neighborhood"  "Condition1"    "Condition2"    "BldgType"     
## [17] "HouseStyle"    "OverallQual"   "OverallCond"   "YearBuilt"    
## [21] "YearRemodAdd"  "RoofStyle"     "RoofMatl"      "Exterior1st"  
## [25] "Exterior2nd"   "MasVnrType"    "MasVnrArea"    "ExterQual"    
## [29] "ExterCond"     "Foundation"    "BsmtQual"      "BsmtCond"     
## [33] "BsmtExposure"  "BsmtFinType1"  "BsmtFinSF1"    "BsmtFinType2" 
## [37] "BsmtFinSF2"    "BsmtUnfSF"     "TotalBsmtSF"   "Heating"      
## [41] "HeatingQC"     "CentralAir"    "Electrical"    "X1stFlrSF"    
## [45] "X2ndFlrSF"     "LowQualFinSF"  "GrLivArea"     "BsmtFullBath" 
## [49] "BsmtHalfBath"  "FullBath"      "HalfBath"      "BedroomAbvGr" 
## [53] "KitchenAbvGr"  "KitchenQual"   "TotRmsAbvGrd"  "Functional"   
## [57] "Fireplaces"    "FireplaceQu"   "GarageType"    "GarageYrBlt"  
## [61] "GarageFinish"  "GarageCars"    "GarageArea"    "GarageQual"   
## [65] "GarageCond"    "PavedDrive"    "WoodDeckSF"    "OpenPorchSF"  
## [69] "EnclosedPorch" "X3SsnPorch"    "ScreenPorch"   "PoolArea"     
## [73] "PoolQC"        "Fence"         "MiscFeature"   "MiscVal"      
## [77] "MoSold"        "YrSold"        "SaleType"      "SaleCondition"
## [81] "SalePrice"

The last column “SalePrice” is the one we will be measuring so we won’t include this in the full set. We bin the ID column straight away as it’s not needed.

full <-
  train %>% select(-Id, -SalePrice) %>%
  bind_rows(test %>% select(-Id))

Data cleaning

We look for missing values and try to fill these in

full %>% 
  select(which(sapply(full, function(x)any(is.na(x))))) %>%
  sapply(function(x)sum(ifelse(is.na(x), 1, 0))) %>%
  sort(decreasing=T)
##       PoolQC  MiscFeature        Alley        Fence  FireplaceQu 
##         2909         2814         2721         2348         1420 
##  LotFrontage  GarageYrBlt GarageFinish   GarageQual   GarageCond 
##          486          159          159          159          159 
##   GarageType     BsmtCond BsmtExposure     BsmtQual BsmtFinType2 
##          157           82           82           81           80 
## BsmtFinType1   MasVnrType   MasVnrArea     MSZoning    Utilities 
##           79           24           23            4            2 
## BsmtFullBath BsmtHalfBath   Functional  Exterior1st  Exterior2nd 
##            2            2            2            1            1 
##   BsmtFinSF1   BsmtFinSF2    BsmtUnfSF  TotalBsmtSF   Electrical 
##            1            1            1            1            1 
##  KitchenQual   GarageCars   GarageArea     SaleType 
##            1            1            1            1

PoolQC missing values

full %>%
  select(PoolQC) %>%
  group_by(PoolQC) %>%
  summarise(Count = n())
## # A tibble: 4 x 2
##   PoolQC Count
##    <chr> <int>
## 1     Ex     4
## 2     Fa     2
## 3     Gd     4
## 4   <NA>  2909

What to fill in the NA’s with? We will go for a default missing value. The data dictionary says that ‘NA’ means “no pool” so we’ll go with that.

full <-
  full %>%
  mutate(PoolQC = ifelse(is.na(PoolQC), 'None', PoolQC))

What is the situation with PoolArea?

full %>% filter(PoolQC == 'None', PoolArea > 0) %>% select(PoolArea)
##   PoolArea
## 1      368
## 2      444
## 3      561

There are 3 instances of Pools with non-zero area but “don’t exist”. To fix this let’s look at the relationship between PoolQC and PoolArea

full %>%
  filter(PoolQC != 'None') %>% 
  ggplot(aes(x=factor(PoolQC), y=PoolArea)) + 
  geom_boxplot(fill='cornflowerblue')

MiscFeature missing values

Next on the list is MiscFeature. The data dictionary describes this as “Miscellaneous feature not covered in other categories”. Lets do a count of each value:

full %>%
  group_by(MiscFeature) %>%
  summarise(Count = n())
## # A tibble: 5 x 2
##   MiscFeature Count
##         <chr> <int>
## 1        Gar2     5
## 2        Othr     4
## 3        Shed    95
## 4        TenC     1
## 5        <NA>  2814

I think we’re good to go ahead and fill in the missing values of MiscFeature with ‘None’.

full <-
  full %>%
  mutate(MiscFeature = ifelse(is.na(MiscFeature), 'None', MiscFeature))

MiscFeature Second Garages

Value of Gar2 means there is a second Garage. Let’s check whether this is accurately represented in the data

full  %>%
  filter(MiscFeature == 'Gar2')%>% 
  select(GarageType, GarageYrBlt, GarageFinish, GarageCars, GarageArea, GarageQual, GarageCond)
##   GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1     Attchd        1960          RFn          1        301         TA
## 2    BuiltIn        1977          RFn          2        600         TA
## 3     Attchd        1958          Unf          1        312         TA
## 4     Attchd        1953          RFn          1        286         TA
## 5     Detchd        2002          Unf          2        624         TA
##   GarageCond
## 1         TA
## 2         TA
## 3         TA
## 4         TA
## 5         TA

Good, so there’s no existence of 2nd garages without a first. However I wonder if there are garages with more Cars in them without room for them.

full %>% ggplot(aes(x = GarageArea, y=GarageCars)) + geom_point(colour = 'cornflowerblue')
## Warning: Removed 1 rows containing missing values (geom_point).

This looks to be behaving well. There dont seem to be any outliers towards the left and top of the graph.

FireplaceQu missing values

We use the Fireplaces column to fix the FireplaceQu missing values

full %>%
  filter(is.na(FireplaceQu), Fireplaces > 0) %>%
  select(Fireplaces, FireplaceQu)
## [1] Fireplaces  FireplaceQu
## <0 rows> (or 0-length row.names)

Hence, we assign all blank FireplaceQu to ‘None’.

full <-
  full %>%
  mutate(FireplaceQu = ifelse(is.na(FireplaceQu), 'None', FireplaceQu))

LotFrontage missing values

Does lack of a value mean 0 feet as lot frontage?

full %>%
  filter(is.na(LotFrontage) | LotFrontage == 0) %>%
  summarise(Count = n())
##   Count
## 1   486

Since there are only NA’s and no 0’s then let’s assume so

full <-
  full %>%
  mutate(LotFrontage = ifelse(is.na(LotFrontage), 0, LotFrontage))

Garages

full %>%
  filter(is.na(GarageType) |
           is.na(GarageYrBlt) |
           is.na(GarageFinish) |
           is.na(GarageCars) |
           is.na(GarageArea) |
           is.na(GarageQual) |
           is.na(GarageCond)) %>%
  group_by(GarageType, GarageYrBlt, GarageFinish, GarageCars,
           GarageArea, GarageQual, GarageCond) %>%
  summarise(Count = n()) %>%
  knitr::kable()
GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond Count
Detchd NA NA 1 360 NA NA 1
Detchd NA NA NA NA NA NA 1
NA NA NA 0 0 NA NA 157

So we have 2 cases of Detached garages, 1 with entries for GarageCars and GarageArea, and another for no other Garage entries. We also have 157 cases where there are no entries for anything, so this looks like no garage.

We’ll fill in the “no garages” first. For the year we will use the YearBuilt

full <-
  full %>%
  transform(GarageType = ifelse(is.na(GarageType), 'None', GarageType),
            GarageYrBlt = ifelse(is.na(GarageYrBlt), YearBuilt, GarageYrBlt),
            GarageFinish = ifelse(is.na(GarageFinish), 'None', GarageFinish),
            GarageCars = ifelse(is.na(GarageCars), 'None', 0),
            GarageArea = ifelse(is.na(GarageArea), 'None', 0),
            GarageQual = ifelse(is.na(GarageQual), 'None', GarageQual),
            GarageCond = ifelse(is.na(GarageCond), 'None', GarageCond))

Basements cleaning

Next up, the basement. Let’s see where the null values occur

full %>%
  filter(is.na(BsmtQual) |
           is.na(BsmtCond) |
           is.na(BsmtExposure) |
           is.na(BsmtFinType1) |
           is.na(BsmtFinSF1) |
           is.na(BsmtFinType2) |
           is.na(BsmtFinSF2) | 
           is.na(BsmtUnfSF) |
           is.na(BsmtFullBath) |
           is.na(BsmtHalfBath) |
           is.na(TotalBsmtSF)) %>%
  group_by(BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinSF1,
           BsmtFinType2, BsmtFinSF2, BsmtUnfSF, BsmtFullBath, BsmtHalfBath,
           TotalBsmtSF) %>%
  summarise(Count = n()) %>%
  knitr::kable()
BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF BsmtFullBath BsmtHalfBath TotalBsmtSF Count
Gd TA No GLQ 1124 NA 479 1603 1 0 3206 1
Gd TA NA Unf 0 Unf 0 725 0 0 725 1
Gd TA NA Unf 0 Unf 0 936 0 0 936 1
Gd TA NA Unf 0 Unf 0 1595 0 0 1595 1
Gd NA Mn GLQ 1044 Rec 382 0 1 0 1426 1
TA NA Av ALQ 755 Unf 0 240 0 0 995 1
TA NA No BLQ 1033 Unf 0 94 0 1 1127 1
NA Fa No Unf 0 Unf 0 173 0 0 173 1
NA TA No Unf 0 Unf 0 356 0 0 356 1
NA NA NA NA 0 NA 0 0 0 0 0 77
NA NA NA NA 0 NA 0 0 NA NA 0 1
NA NA NA NA NA NA NA NA NA NA NA 1

From the above we see:

  • 2 cases of No Basement (BsmtQual) where BsmtCond is not No Basement - takes values of Fair/Typical
  • These cases occur when the BsmtFinType1 is Unfinished. Square footage is positive here.
  • 5 cases BsmtFinType2 = BsmtFinType1 which shouldn’t happen unless both are NA. Does this happen elsewhere?
  • Check that BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF = TotalBsmtSF

First, let’s check for cases of BsmtFinType2 = BsmtFinType1 and check that total SF is correct

full %>%
  filter(BsmtFinType2 == BsmtFinType1) %>%
  group_by(BsmtFinType1, BsmtFinType2) %>%
  summarise(BsmtFinSF1 = sum(BsmtFinSF1),
            BsmtFinSF2 = sum(BsmtFinSF2),
            BsmtUnfSF = sum(BsmtUnfSF),
            TotalBsmtSF = sum(TotalBsmtSF),
            ChkTotal = sum(BsmtFinSF1) + sum(BsmtFinSF2) + sum(BsmtUnfSF)) %>%
  knitr::kable()
BsmtFinType1 BsmtFinType2 BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF ChkTotal
BLQ BLQ 543 119 261 923 923
Unf Unf 0 0 826432 826432 826432

The check passed, so let’s set BsmtFinType2 to NA, BsmtFinSF1 to BsmtFinSF1 + BsmtFinSF2, and BsmtFinSF2 to 0.

full <-
  full %>%
  transform(BsmtFinType2 = ifelse(BsmtFinType2 == BsmtFinType1, NA, BsmtFinType2),
            BsmtFinSF1 = ifelse(BsmtFinType2 == BsmtFinType1, BsmtFinSF1 + BsmtFinSF2, BsmtFinSF1),
            BsmtFinSF2 = ifelse(BsmtFinType2 == BsmtFinType1, 0, BsmtFinSF2))

Let’s check across the dataset that the total basement SF is correct:

full %>%
  filter(TotalBsmtSF != BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF) %>%
  summarise(Count = n())
##   Count
## 1     0

All good then. Let’s perform the updates:

  • Rows 2-4 BsmtExposure indicate No Basement, however this should be No Exposure
  • One row of 77 cases of completely null values - let’s fill these with the default null values
  • Rows 8,9 fill in BsmtQual with ‘TA’
  • Rows 2-4 BsmtFinType2 set to NA, then convert all NA’s to ‘None’
  • NA’s in BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, BsmtFullBath, BsmtHalfBath should be 0
  • NA’s in BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2 should be ‘None’
full <-
  full %>%
  mutate(BsmtExposure = ifelse(is.na(BsmtExposure) & BsmtQual == 'Gd', 'No', BsmtExposure),
         BsmtFinType2 = ifelse(is.na(BsmtExposure) & BsmtQual == 'Gd', NA, BsmtFinType2),
         
         BsmtFinSF1 = ifelse(is.na(BsmtFinSF1), 0, BsmtFinSF1),
         BsmtFinSF2 = ifelse(is.na(BsmtFinSF2), 0, BsmtFinSF2),
         BsmtUnfSF = ifelse(is.na(BsmtUnfSF), 0, BsmtUnfSF),
         TotalBsmtSF = ifelse(is.na(TotalBsmtSF), 0, TotalBsmtSF),
         BsmtFullBath = ifelse(is.na(BsmtFullBath), 0, BsmtFullBath),
         BsmtHalfBath = ifelse(is.na(BsmtHalfBath), 0, BsmtHalfBath),
         
         BsmtQual = ifelse(is.na(BsmtQual), 'None', BsmtQual),
         BsmtCond = ifelse(is.na(BsmtCond), 'None', BsmtCond),
         BsmtExposure = ifelse(is.na(BsmtExposure), 'None', BsmtExposure),
         BsmtFinType1 = ifelse(is.na(BsmtFinType1), 'None', BsmtFinType1),
         BsmtFinType2 = ifelse(is.na(BsmtFinType2), 'None', BsmtFinType2))

Now what’s left to fill in?

full %>% 
  select(which(sapply(full, function(x)any(is.na(x))))) %>%
  sapply(function(x)sum(ifelse(is.na(x), 1, 0))) %>%
  sort(decreasing=T)
##       Alley       Fence  MasVnrType  MasVnrArea    MSZoning   Utilities 
##        2721        2348          24          23           4           2 
##  Functional Exterior1st Exterior2nd  Electrical KitchenQual    SaleType 
##           2           1           1           1           1           1

Alley and fence

full %>%
  group_by(Alley, Fence) %>%
  summarise(Count = n()) %>%
  knitr::kable()
Alley Fence Count
Grvl GdPrv 5
Grvl GdWo 4
Grvl MnPrv 19
Grvl MnWw 1
Grvl NA 91
Pave GdPrv 3
Pave GdWo 1
Pave MnPrv 2
Pave NA 72
NA GdPrv 110
NA GdWo 107
NA MnPrv 308
NA MnWw 11
NA NA 2185

NA means ‘no alley access’ or ‘no fence’ so let’s fill that with ‘None’

full <-
  full %>%
  mutate(Alley = ifelse(is.na(Alley), 'None', Alley),
         Fence = ifelse(is.na(Fence), 'None', Fence))

Mason veneer

full %>%
  group_by(MasVnrType, ifelse(is.na(MasVnrArea), 0, 1)) %>%
  summarise(Count = n()) %>%
  knitr::kable()
MasVnrType ifelse(is.na(MasVnrArea), 0, 1) Count
BrkCmn 1 25
BrkFace 1 879
None 1 1742
Stone 1 249
NA 0 23
NA 1 1

Fill this in:

full <-
  full %>%
  mutate(MasVnrType = ifelse(is.na(MasVnrType), 'None', MasVnrType),
         MasVnrArea = ifelse(is.na(MasVnrArea), 0, MasVnrArea))

MSZoning

Getting the distribution of MSZoning, we see there are 4 NA’s here:

full %>%
  group_by(MSZoning) %>%
  summarise(Count = n()) %>%
  knitr::kable()
MSZoning Count
C (all) 25
FV 139
RH 26
RL 2265
RM 460
NA 4

Now, let’s see what the most common value of MSZoning is per MSSubClass, and select only the MSSubClass where MSZoning is NA.

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

full %>%
  group_by(MSSubClass) %>%
  summarise(AvgZoning = Mode(MSZoning)) %>%
  inner_join(
    full %>%
      filter(is.na(MSZoning)) %>%
      distinct(),
    by = "MSSubClass"
  ) %>%
  select(MSSubClass, AvgZoning) %>%
  distinct()
## # A tibble: 3 x 2
##   MSSubClass AvgZoning
##        <int>     <chr>
## 1         20        RL
## 2         30        RM
## 3         70        RM

Let’s update the values with this:

full <-
  full %>%
  mutate(MSZoning = ifelse(is.na(MSZoning), ifelse(MSSubClass == 20, 'RL', 'RM'), MSZoning))

Utilities Functional Exterior1st Exterior2nd Electrical KitchenQual SaleType

The rest

full %>%
  filter(is.na(Utilities) |
           is.na(Functional) |
           is.na(Exterior1st) |
           is.na(Exterior2nd) |
           is.na(Electrical) |
           is.na(KitchenQual) |
           is.na(SaleType)) %>%
  group_by(Utilities, Functional, Exterior1st, Exterior2nd, Electrical, KitchenQual, SaleType) %>%
  summarise(Count = n()) %>%
  knitr::kable()
Utilities Functional Exterior1st Exterior2nd Electrical KitchenQual SaleType Count
AllPub Mod Wd Sdng Wd Sdng SBrkr NA COD 1
AllPub Typ Plywood Plywood SBrkr TA NA 1
AllPub Typ VinylSd VinylSd NA Gd WD 1
AllPub Typ NA NA SBrkr Fa WD 1
AllPub NA AsbShng VinylSd FuseA Fa WD 1
AllPub NA Wd Sdng Wd Sdng SBrkr TA COD 1
NA Min1 Wd Sdng Wd Sdng FuseA TA ConLD 1
NA Min2 BrkFace BrkFace FuseA TA WD 1

Let’s fill the rest in with the most common value

full <-
  full %>%
  mutate(Utilities = ifelse(is.na(Utilities), Mode(Utilities), Utilities),
         Functional = ifelse(is.na(Functional), Mode(Functional), Functional),
         Exterior1st = ifelse(is.na(Exterior1st), Mode(Exterior1st), Exterior1st),
         Exterior2nd = ifelse(is.na(Exterior2nd), Mode(Exterior2nd), Exterior2nd),
         Electrical = ifelse(is.na(Electrical), Mode(Electrical), Electrical),
         KitchenQual = ifelse(is.na(KitchenQual), Mode(KitchenQual), KitchenQual),
         SaleType = ifelse(is.na(SaleType), Mode(SaleType), SaleType))

Update the train and test data

Now that we have fixed the missing values in the train and test data, we may update the original datasets so that we are ready to begin the analysis.

t <- nrow(train)
f <- nrow(full)

SP <- train$SalePrice

train <- full[1:t,]
test <- full[t+1:f,]

train$SalePrice <- SP

We are now ready to do some exploratory analysis.

comments powered by Disqus