This file demonstrates a typical process of using R package “cleandata” to prepare data for machine learning.
A collection of functions that work with data frame to inspect, impute, encode, and partition data. The functions for imputation, encoding, and partitioning can produce log files to help you keep track of data manipulation process.
Available on CRAN: https://cran.r-project.org/package=cleandata
Source Codes on GitHub: https://github.com/sherrisherry/cleandata
log
.With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa for predicting housing prices, this dataset is a typical example of what a business analyst encounters everyday.
According to the description of this dataset, the “NA”s in some columns aren’t missing value. To prevent R from comfusing them with true missing values, read in the data files without converting any value to the NA
in R.
The train set should have only one more column SalePrice than the test set.
# import 'cleandata' package.
library('cleandata')
# read in the training and test datasets without converting 'NA's to missing values.
train <- read.csv('data/train.csv', na.strings = "", strip.white = TRUE)
test <- read.csv('data/test.csv', na.strings = "", strip.white = TRUE)
# summarize the training set and test set
cat(paste('train: ', nrow(train), 'obs. ', ncol(train), 'cols\ncolumn names:\n', toString(colnames(train)),
'\n\ntest: ', nrow(test), 'obs. ', ncol(test), 'cols\ncolumn names:\n', toString(colnames(test)), '\n'))
## train: 1460 obs. 81 cols
## column names:
## Id, MSSubClass, MSZoning, LotFrontage, LotArea, Street, Alley, LotShape, LandContour, Utilities, LotConfig, LandSlope, Neighborhood, Condition1, Condition2, BldgType, HouseStyle, OverallQual, OverallCond, YearBuilt, YearRemodAdd, RoofStyle, RoofMatl, Exterior1st, Exterior2nd, MasVnrType, MasVnrArea, ExterQual, ExterCond, Foundation, BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinSF1, BsmtFinType2, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, Heating, HeatingQC, CentralAir, Electrical, X1stFlrSF, X2ndFlrSF, LowQualFinSF, GrLivArea, BsmtFullBath, BsmtHalfBath, FullBath, HalfBath, BedroomAbvGr, KitchenAbvGr, KitchenQual, TotRmsAbvGrd, Functional, Fireplaces, FireplaceQu, GarageType, GarageYrBlt, GarageFinish, GarageCars, GarageArea, GarageQual, GarageCond, PavedDrive, WoodDeckSF, OpenPorchSF, EnclosedPorch, X3SsnPorch, ScreenPorch, PoolArea, PoolQC, Fence, MiscFeature, MiscVal, MoSold, YrSold, SaleType, SaleCondition, SalePrice
##
## test: 1459 obs. 80 cols
## column names:
## Id, MSSubClass, MSZoning, LotFrontage, LotArea, Street, Alley, LotShape, LandContour, Utilities, LotConfig, LandSlope, Neighborhood, Condition1, Condition2, BldgType, HouseStyle, OverallQual, OverallCond, YearBuilt, YearRemodAdd, RoofStyle, RoofMatl, Exterior1st, Exterior2nd, MasVnrType, MasVnrArea, ExterQual, ExterCond, Foundation, BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinSF1, BsmtFinType2, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, Heating, HeatingQC, CentralAir, Electrical, X1stFlrSF, X2ndFlrSF, LowQualFinSF, GrLivArea, BsmtFullBath, BsmtHalfBath, FullBath, HalfBath, BedroomAbvGr, KitchenAbvGr, KitchenQual, TotRmsAbvGrd, Functional, Fireplaces, FireplaceQu, GarageType, GarageYrBlt, GarageFinish, GarageCars, GarageArea, GarageQual, GarageCond, PavedDrive, WoodDeckSF, OpenPorchSF, EnclosedPorch, X3SsnPorch, ScreenPorch, PoolArea, PoolQC, Fence, MiscFeature, MiscVal, MoSold, YrSold, SaleType, SaleCondition
To ensure consistency in the following imputation and encoding process across the train set and the test set, I appended the test set to the train set. The SalePrice values of the rows of the test set was set to NA
to distinguish them from the rows of the train set. The resulting data frame was called df.
# filling the target columns of the test set with NA then combining test and training sets
test$SalePrice <- NA
df <- rbind(train, test)
rm(train, test)
Function
inspect_na
inspect_na()
counts the number of NA
s in each column and sort them in descending order. In the following operation, inspect_na()
returned the top 5 columns with missing values. If you want to see the number of missing values in every column, leave parameter top
as default. As supposed, only SalePrice contained missing values, which equaled to the number of rows in the test set.
inspect_na(df, top = 5)
## SalePrice Id MSSubClass MSZoning LotFrontage
## 1459 0 0 0 0
The NAs in the columns listed in NAisNoA were what was refered to as ‘none’-but-not-‘NA’ values. In these columns, NA had only one possible value - “not applicable”. I replaced these NAs with NoA to prevent imputing them later.
# in the 'NAisNoA' columns, NA means this attribute doesn't apply to them, not missing.
NAisNoA <- c('Alley', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond',
'PoolQC', 'Fence', 'MiscFeature')
for(i in NAisNoA){levels(df[, i])[levels(df[, i]) == "NA"] <- "NoA"}
At this stage, I reconstructed the data frame df to inspect the true missing values.
We can see that only LotFrontage had about 20% missing values. The other columns had few to no missing value.
# write the dataset into a csv file then read this file back to df to reconstruct df
write.csv(df, file = 'data/data.csv', row.names = FALSE)
df <- read.csv('data/data.csv', na.strings = "NA", strip.white = TRUE)
# see which predictors have most NAs
inspect_na(df[, -ncol(df)], top = 25)
## LotFrontage GarageYrBlt MasVnrType MasVnrArea MSZoning
## 486 159 24 23 4
## Utilities BsmtFullBath BsmtHalfBath Functional Exterior1st
## 2 2 2 2 1
## Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1 1 1 1 1
## Electrical KitchenQual GarageCars GarageArea SaleType
## 1 1 1 1 1
## Id MSSubClass LotArea Street Alley
## 0 0 0 0 0
Function
inspect_map
inspect_map()
classifies the columns of a data frame. Before I further explain this function, I’d like to introduce ‘scheme’. In package “cleandata”, a scheme refers to a set of all the possible values of an enumerator. The factor objects in R are enumerators.
Function inspect_map
returns a list of factor_cols (list), factor_levels (list), num_cols (vector), char_cols (vector), ordered_cols (vector), and other_cols (vector).
common
for more information about scheme.In the following codes, I specified that 2 factorial columns share the same scheme if their levels had more than 2 same values by setting the common
parameter to 2. By default, the common
parameter is 0, which means every level of 2 factorial columns should be the same for them to share the same scheme.
# create a map for imputation and encoding
data_map <- inspect_map(df[, -ncol(df)], common = 2)
## Id integer factors: 0 nums: 1 chars: 0 ordered: 0 others: 0
## MSSubClass integer factors: 0 nums: 2 chars: 0 ordered: 0 others: 0
## MSZoning factor factors: 1 nums: 2 chars: 0 ordered: 0 others: 0
## LotFrontage integer factors: 1 nums: 3 chars: 0 ordered: 0 others: 0
## LotArea integer factors: 1 nums: 4 chars: 0 ordered: 0 others: 0
## Street factor factors: 2 nums: 4 chars: 0 ordered: 0 others: 0
## Alley factor factors: 3 nums: 4 chars: 0 ordered: 0 others: 0
## LotShape factor factors: 4 nums: 4 chars: 0 ordered: 0 others: 0
## LandContour factor factors: 5 nums: 4 chars: 0 ordered: 0 others: 0
## Utilities factor factors: 6 nums: 4 chars: 0 ordered: 0 others: 0
## LotConfig factor factors: 7 nums: 4 chars: 0 ordered: 0 others: 0
## LandSlope factor factors: 8 nums: 4 chars: 0 ordered: 0 others: 0
## Neighborhood factor factors: 9 nums: 4 chars: 0 ordered: 0 others: 0
## Condition1 factor factors: 10 nums: 4 chars: 0 ordered: 0 others: 0
## Condition2 factor factors: 11 nums: 4 chars: 0 ordered: 0 others: 0
## BldgType factor factors: 12 nums: 4 chars: 0 ordered: 0 others: 0
## HouseStyle factor factors: 13 nums: 4 chars: 0 ordered: 0 others: 0
## OverallQual integer factors: 13 nums: 5 chars: 0 ordered: 0 others: 0
## OverallCond integer factors: 13 nums: 6 chars: 0 ordered: 0 others: 0
## YearBuilt integer factors: 13 nums: 7 chars: 0 ordered: 0 others: 0
## YearRemodAdd integer factors: 13 nums: 8 chars: 0 ordered: 0 others: 0
## RoofStyle factor factors: 14 nums: 8 chars: 0 ordered: 0 others: 0
## RoofMatl factor factors: 15 nums: 8 chars: 0 ordered: 0 others: 0
## Exterior1st factor factors: 16 nums: 8 chars: 0 ordered: 0 others: 0
## Exterior2nd factor factors: 17 nums: 8 chars: 0 ordered: 0 others: 0
## MasVnrType factor factors: 18 nums: 8 chars: 0 ordered: 0 others: 0
## MasVnrArea integer factors: 18 nums: 9 chars: 0 ordered: 0 others: 0
## ExterQual factor factors: 19 nums: 9 chars: 0 ordered: 0 others: 0
## ExterCond factor factors: 20 nums: 9 chars: 0 ordered: 0 others: 0
## Foundation factor factors: 21 nums: 9 chars: 0 ordered: 0 others: 0
## BsmtQual factor factors: 22 nums: 9 chars: 0 ordered: 0 others: 0
## BsmtCond factor factors: 23 nums: 9 chars: 0 ordered: 0 others: 0
## BsmtExposure factor factors: 24 nums: 9 chars: 0 ordered: 0 others: 0
## BsmtFinType1 factor factors: 25 nums: 9 chars: 0 ordered: 0 others: 0
## BsmtFinSF1 integer factors: 25 nums: 10 chars: 0 ordered: 0 others: 0
## BsmtFinType2 factor factors: 26 nums: 10 chars: 0 ordered: 0 others: 0
## BsmtFinSF2 integer factors: 26 nums: 11 chars: 0 ordered: 0 others: 0
## BsmtUnfSF integer factors: 26 nums: 12 chars: 0 ordered: 0 others: 0
## TotalBsmtSF integer factors: 26 nums: 13 chars: 0 ordered: 0 others: 0
## Heating factor factors: 27 nums: 13 chars: 0 ordered: 0 others: 0
## HeatingQC factor factors: 28 nums: 13 chars: 0 ordered: 0 others: 0
## CentralAir factor factors: 29 nums: 13 chars: 0 ordered: 0 others: 0
## Electrical factor factors: 30 nums: 13 chars: 0 ordered: 0 others: 0
## X1stFlrSF integer factors: 30 nums: 14 chars: 0 ordered: 0 others: 0
## X2ndFlrSF integer factors: 30 nums: 15 chars: 0 ordered: 0 others: 0
## LowQualFinSF integer factors: 30 nums: 16 chars: 0 ordered: 0 others: 0
## GrLivArea integer factors: 30 nums: 17 chars: 0 ordered: 0 others: 0
## BsmtFullBath integer factors: 30 nums: 18 chars: 0 ordered: 0 others: 0
## BsmtHalfBath integer factors: 30 nums: 19 chars: 0 ordered: 0 others: 0
## FullBath integer factors: 30 nums: 20 chars: 0 ordered: 0 others: 0
## HalfBath integer factors: 30 nums: 21 chars: 0 ordered: 0 others: 0
## BedroomAbvGr integer factors: 30 nums: 22 chars: 0 ordered: 0 others: 0
## KitchenAbvGr integer factors: 30 nums: 23 chars: 0 ordered: 0 others: 0
## KitchenQual factor factors: 31 nums: 23 chars: 0 ordered: 0 others: 0
## TotRmsAbvGrd integer factors: 31 nums: 24 chars: 0 ordered: 0 others: 0
## Functional factor factors: 32 nums: 24 chars: 0 ordered: 0 others: 0
## Fireplaces integer factors: 32 nums: 25 chars: 0 ordered: 0 others: 0
## FireplaceQu factor factors: 33 nums: 25 chars: 0 ordered: 0 others: 0
## GarageType factor factors: 34 nums: 25 chars: 0 ordered: 0 others: 0
## GarageYrBlt integer factors: 34 nums: 26 chars: 0 ordered: 0 others: 0
## GarageFinish factor factors: 35 nums: 26 chars: 0 ordered: 0 others: 0
## GarageCars integer factors: 35 nums: 27 chars: 0 ordered: 0 others: 0
## GarageArea integer factors: 35 nums: 28 chars: 0 ordered: 0 others: 0
## GarageQual factor factors: 36 nums: 28 chars: 0 ordered: 0 others: 0
## GarageCond factor factors: 37 nums: 28 chars: 0 ordered: 0 others: 0
## PavedDrive factor factors: 38 nums: 28 chars: 0 ordered: 0 others: 0
## WoodDeckSF integer factors: 38 nums: 29 chars: 0 ordered: 0 others: 0
## OpenPorchSF integer factors: 38 nums: 30 chars: 0 ordered: 0 others: 0
## EnclosedPorch integer factors: 38 nums: 31 chars: 0 ordered: 0 others: 0
## X3SsnPorch integer factors: 38 nums: 32 chars: 0 ordered: 0 others: 0
## ScreenPorch integer factors: 38 nums: 33 chars: 0 ordered: 0 others: 0
## PoolArea integer factors: 38 nums: 34 chars: 0 ordered: 0 others: 0
## PoolQC factor factors: 39 nums: 34 chars: 0 ordered: 0 others: 0
## Fence factor factors: 40 nums: 34 chars: 0 ordered: 0 others: 0
## MiscFeature factor factors: 41 nums: 34 chars: 0 ordered: 0 others: 0
## MiscVal integer factors: 41 nums: 35 chars: 0 ordered: 0 others: 0
## MoSold integer factors: 41 nums: 36 chars: 0 ordered: 0 others: 0
## YrSold integer factors: 41 nums: 37 chars: 0 ordered: 0 others: 0
## SaleType factor factors: 42 nums: 37 chars: 0 ordered: 0 others: 0
## SaleCondition factor factors: 43 nums: 37 chars: 0 ordered: 0 others: 0
summary(data_map)
## Length Class Mode
## factor_cols 31 -none- list
## factor_levels 31 -none- list
## num_cols 37 -none- character
## char_cols 0 -none- NULL
## ordered_cols 0 -none- NULL
## other_cols 0 -none- NULL
This dataset only had factorial and numeric columns. I unpacked data_map before heading to imputation and encoding.
factor_cols <- data_map$factor_cols
factor_levels <- data_map$factor_levels
num_cols <- data_map$num_cols
rm(data_map)
The functions for imputation and encoding keep track of your process by producing log files. This feature is by default disabled. To enable log files, I created an environment variable log_arg to storep the list of arguments for sink()
. In old versions, log_arg should be assigned to log
parameter in every imputation or encoding function, which is still supported by this version.
# create a list of arguments for producing a log file
log_arg <- list(file = 'log.txt', append = TRUE, split = FALSE)
log_arg can be a list of any arguments for sink()
. In this example, the log file was named “log.txt”, new information was appended to the file, and the contents to the log file weren’t printed to the standard output.
In this version, parameter log
by default searches a list called log_arg in the dynamic scope parent environment and takes the value of log_arg. If log
is assigned a list, it takes the assigned value. If no a list log_arg in the parent and no list is assigned to log
, no log file.
To prevent leakage, I instructed the imputation functions to use only rows of the train set to calculate the imputation values by passing an index to parameter idx
.
Function
impute_mode
,impute_median
,impute_mean
impute_mode()
works with both numerical, string, and factorial columns. It impute NA
s by the modes of their corresponding columns.
impute_median()
and impute_mean()
only work with numerical columns. They impute NA
s by medians and means respectively.
# impute NAs in factorial columns by the mode of corresponding columns
lst <- unlist(factor_cols)
df <- impute_mode(df, cols = lst, idx = !is.na(df$SalePrice))
# impute NAs in numerical columns by the median of corresponding columns
lst <- num_cols
df <- impute_median(df, cols = lst, idx = !is.na(df$SalePrice))
# check the result
inspect_na(df[, -ncol(df)], top = 5)
## Id MSSubClass MSZoning LotFrontage LotArea
## 0 0 0 0 0
Every encoding function prints summary of the columns before and after encoding by default. The output of encode_ordinal()
and encode_binary()
is by default factorial. If you want numerical output, set parameter out.int
to TRUE
after making sure no missing value in the input.
In this demo, I kept the encoded columns factorial because I intended to save the dataset into a csv file, which doesn’t distinguish between factorial and numerical columns.
In business datasets, we can often find ratings, which are ordinal and use similar schemes. Based on my experience, if many columns share the same scheme, they are likely to be ratings.
summary(factor_cols)
## Length Class Mode
## MSZoning 1 -none- character
## Street 1 -none- character
## Alley 1 -none- character
## LotShape 1 -none- character
## LandContour 1 -none- character
## Utilities 1 -none- character
## LotConfig 1 -none- character
## LandSlope 1 -none- character
## Neighborhood 1 -none- character
## Condition1 2 -none- character
## BldgType 1 -none- character
## HouseStyle 1 -none- character
## RoofStyle 1 -none- character
## RoofMatl 1 -none- character
## Exterior1st 2 -none- character
## MasVnrType 1 -none- character
## ExterQual 10 -none- character
## Foundation 1 -none- character
## BsmtExposure 1 -none- character
## BsmtFinType1 2 -none- character
## Heating 1 -none- character
## CentralAir 1 -none- character
## Electrical 1 -none- character
## Functional 1 -none- character
## GarageType 1 -none- character
## GarageFinish 1 -none- character
## PavedDrive 1 -none- character
## Fence 1 -none- character
## MiscFeature 1 -none- character
## SaleType 1 -none- character
## SaleCondition 1 -none- character
In our dataset ExterQual and other 9 columns share the same scheme. After I checked their scheme and the description file, I was sure that they were ordinal.
factor_levels$ExterQual
## [1] "Ex" "Fa" "Gd" "NoA" "Po" "TA"
“Po”: poor, “Fa”: fair, “TA”: typical/average, “Gd”: good, “Ex”: excellent
Function
encode_ordinal
encode_ordinal()
encodes ordinal data into sequential integers by a given order. The argument passed to none
is always encoded to 0. The 1st member of the vector passed to order
is encoded to 1.
# encoding ordinal columns
i <- 'ExterQual'; lst <- c('Po', 'Fa', 'TA', 'Gd', 'Ex')
df[, factor_cols[[i]]] <- encode_ordinal(df[, factor_cols[[i]]], order = lst, none = 'NoA')
## ExterQual ExterCond BsmtQual BsmtCond HeatingQC KitchenQual
## Ex: 107 Ex: 12 Ex : 258 Fa : 104 Ex:1493 Ex: 205
## Fa: 35 Fa: 67 Fa : 88 Gd : 122 Fa: 92 Fa: 70
## Gd: 979 Gd: 299 Gd :1209 NoA: 82 Gd: 474 Gd:1151
## TA:1798 Po: 3 NoA: 81 Po : 5 Po: 3 TA:1493
## TA:2538 TA :1283 TA :2606 TA: 857
##
## FireplaceQu GarageQual GarageCond PoolQC
## Ex : 43 Ex : 3 Ex : 3 Ex : 4
## Fa : 74 Fa : 124 Fa : 74 Fa : 2
## Gd : 744 Gd : 24 Gd : 15 Gd : 4
## NoA:1420 NoA: 159 NoA: 159 NoA:2909
## Po : 46 Po : 5 Po : 14
## TA : 592 TA :2604 TA :2654
## coded 10 cols 5 levels
## ExterQual ExterCond BsmtQual BsmtCond HeatingQC KitchenQual FireplaceQu
## 5: 107 5: 12 5: 258 2: 104 5:1493 5: 205 5: 43
## 2: 35 2: 67 2: 88 4: 122 2: 92 2: 70 2: 74
## 4: 979 4: 299 4:1209 0: 82 4: 474 4:1151 4: 744
## 3:1798 1: 3 0: 81 1: 5 1: 3 3:1493 0:1420
## 3:2538 3:1283 3:2606 3: 857 1: 46
## 3: 592
## GarageQual GarageCond PoolQC
## 5: 3 5: 3 5: 4
## 2: 124 2: 74 2: 2
## 4: 24 4: 15 4: 4
## 0: 159 0: 159 0:2909
## 1: 5 1: 14
## 3:2604 3:2654
# removing encoded columns from the map
factor_levels[[i]] <- NULL
factor_cols[[i]] <- NULL
The Utilities column was binary according the dataset.
factor_levels$Utilities
## [1] "AllPub" "NoSeWa"
levels(df$Utilities)
## [1] "AllPub" "NoSeWa"
However, the description file indicates that it has 4 possible values: ‘ELO’, ‘NoSeWa’, ‘NoSewr’, ‘AllPub’. Therefore, I encoded it as having 4 levels.
# in dataset only "AllPub" "NoSeWa", with 2 NAs
i <- 'Utilities'; lst <- c('ELO', 'NoSeWa', 'NoSewr', 'AllPub')
df[, factor_cols[[i]]] <- encode_ordinal(df[, factor_cols[[i]], drop=FALSE], order = lst)
## Utilities
## AllPub:2918
## NoSeWa: 1
## coded 1 cols 4 levels
## Utilities
## 4:2918
## 2: 1
factor_levels[[i]]<-NULL
factor_cols[[i]]<-NULL
# find all the 2-level columns
lst <- lapply(factor_levels, length)
lst <- as.data.frame(lst)
lst <- colnames(lst[, lst == 2])
cat(lst)
## Street CentralAir
Function
encode_binary
encode_binary()
encodes binary data into 0 and 1, regardless of order.
# encode all the 2-level columns
i <- unlist(factor_cols[lst])
df[, i] <- encode_binary(df[, i, drop=FALSE])
## Street CentralAir
## Grvl: 12 N: 196
## Pave:2907 Y:2723
## coded 1 cols
## coded 1 cols
## Street CentralAir
## 0: 12 0: 196
## 1:2907 1:2723
factor_levels[lst] <- NULL
factor_cols[lst] <- NULL
Although we may have found more ordinal columns, I wanted to speed up our process so I assumed that all the remaining categorical columns were not ordered.
Function
encode_onehot
encode_onehot()
encodes categorical data by One-hot encoding.
# encode all the other categorical Columns
i <- unlist(factor_cols)
df0 <- encode_onehot(df[, i, drop=FALSE])
## MSZoning Alley LotShape LandContour LotConfig
## C (all): 25 Grvl: 120 IR1: 968 Bnk: 117 Corner : 511
## FV : 139 NoA :2721 IR2: 76 HLS: 120 CulDSac: 176
## RH : 26 Pave: 78 IR3: 16 Low: 60 FR2 : 85
## RL :2269 Reg:1859 Lvl:2622 FR3 : 14
## RM : 460 Inside :2133
##
##
## LandSlope Neighborhood Condition1 Condition2 BldgType
## Gtl:2778 NAmes : 443 Norm :2511 Norm :2889 1Fam :2425
## Mod: 125 CollgCr: 267 Feedr : 164 Feedr : 13 2fmCon: 62
## Sev: 16 OldTown: 239 Artery : 92 Artery : 5 Duplex: 109
## Edwards: 194 RRAn : 50 PosA : 4 Twnhs : 96
## Somerst: 182 PosN : 39 PosN : 4 TwnhsE: 227
## NridgHt: 166 RRAe : 28 RRNn : 2
## (Other):1428 (Other): 35 (Other): 2
## HouseStyle RoofStyle RoofMatl Exterior1st
## 1Story :1471 Flat : 20 CompShg:2876 VinylSd:1026
## 2Story : 872 Gable :2310 Tar&Grv: 23 MetalSd: 450
## 1.5Fin : 314 Gambrel: 22 WdShake: 9 HdBoard: 442
## SLvl : 128 Hip : 551 WdShngl: 7 Wd Sdng: 411
## SFoyer : 83 Mansard: 11 ClyTile: 1 Plywood: 221
## 2.5Unf : 24 Shed : 5 Membran: 1 CemntBd: 126
## (Other): 27 (Other): 2 (Other): 243
## Exterior2nd MasVnrType Foundation BsmtExposure BsmtFinType1
## VinylSd:1015 BrkCmn : 25 BrkTil: 311 Av : 418 ALQ:429
## MetalSd: 447 BrkFace: 879 CBlock:1235 Gd : 276 BLQ:269
## HdBoard: 406 None :1766 PConc :1308 Mn : 239 GLQ:849
## Wd Sdng: 391 Stone : 249 Slab : 49 No :1904 LwQ:154
## Plywood: 270 Stone : 11 NoA: 82 NoA: 79
## CmentBd: 126 Wood : 5 Rec:288
## (Other): 264 Unf:851
## BsmtFinType2 Heating Electrical Functional GarageType
## ALQ: 52 Floor: 1 FuseA: 188 Maj1: 19 2Types : 23
## BLQ: 68 GasA :2874 FuseF: 50 Maj2: 9 Attchd :1723
## GLQ: 34 GasW : 27 FuseP: 8 Min1: 65 Basment: 36
## LwQ: 87 Grav : 9 Mix : 1 Min2: 70 BuiltIn: 186
## NoA: 80 OthW : 2 SBrkr:2672 Mod : 35 CarPort: 15
## Rec: 105 Wall : 6 Sev : 2 Detchd : 779
## Unf:2493 Typ :2719 NoA : 157
## GarageFinish PavedDrive Fence MiscFeature SaleType
## Fin: 719 N: 216 GdPrv: 118 Gar2: 5 WD :2526
## NoA: 159 P: 62 GdWo : 112 NoA :2814 New : 239
## RFn: 811 Y:2641 MnPrv: 329 Othr: 4 COD : 87
## Unf:1230 MnWw : 12 Shed: 95 ConLD : 26
## NoA :2348 TenC: 1 CWD : 12
## ConLI : 9
## (Other): 20
## SaleCondition
## Abnorml: 190
## AdjLand: 12
## Alloca : 24
## Family : 46
## Normal :2402
## Partial: 245
##
## coded col MSZoning ; 5 levels
## coded col Alley ; 3 levels
## coded col LotShape ; 4 levels
## coded col LandContour ; 4 levels
## coded col LotConfig ; 5 levels
## coded col LandSlope ; 3 levels
## coded col Neighborhood ; 25 levels
## coded col Condition1 ; 9 levels
## coded col Condition2 ; 8 levels
## coded col BldgType ; 5 levels
## coded col HouseStyle ; 8 levels
## coded col RoofStyle ; 6 levels
## coded col RoofMatl ; 8 levels
## coded col Exterior1st ; 15 levels
## coded col Exterior2nd ; 16 levels
## coded col MasVnrType ; 4 levels
## coded col Foundation ; 6 levels
## coded col BsmtExposure ; 5 levels
## coded col BsmtFinType1 ; 7 levels
## coded col BsmtFinType2 ; 7 levels
## coded col Heating ; 6 levels
## coded col Electrical ; 5 levels
## coded col Functional ; 7 levels
## coded col GarageType ; 7 levels
## coded col GarageFinish ; 4 levels
## coded col PavedDrive ; 3 levels
## coded col Fence ; 5 levels
## coded col MiscFeature ; 5 levels
## coded col SaleType ; 9 levels
## coded col SaleCondition ; 6 levels
## MSZoning_C (all) MSZoning_FV MSZoning_RH
## 25 139 26
## MSZoning_RL MSZoning_RM Alley_Grvl
## 2269 460 120
## Alley_NoA Alley_Pave LotShape_IR1
## 2721 78 968
## LotShape_IR2 LotShape_IR3 LotShape_Reg
## 76 16 1859
## LandContour_Bnk LandContour_HLS LandContour_Low
## 117 120 60
## LandContour_Lvl LotConfig_Corner LotConfig_CulDSac
## 2622 511 176
## LotConfig_FR2 LotConfig_FR3 LotConfig_Inside
## 85 14 2133
## LandSlope_Gtl LandSlope_Mod LandSlope_Sev
## 2778 125 16
## Neighborhood_Blmngtn Neighborhood_Blueste Neighborhood_BrDale
## 28 10 30
## Neighborhood_BrkSide Neighborhood_ClearCr Neighborhood_CollgCr
## 108 44 267
## Neighborhood_Crawfor Neighborhood_Edwards Neighborhood_Gilbert
## 103 194 165
## Neighborhood_IDOTRR Neighborhood_MeadowV Neighborhood_Mitchel
## 93 37 114
## Neighborhood_NAmes Neighborhood_NoRidge Neighborhood_NPkVill
## 443 71 23
## Neighborhood_NridgHt Neighborhood_NWAmes Neighborhood_OldTown
## 166 131 239
## Neighborhood_Sawyer Neighborhood_SawyerW Neighborhood_Somerst
## 151 125 182
## Neighborhood_StoneBr Neighborhood_SWISU Neighborhood_Timber
## 51 48 72
## Neighborhood_Veenker Condition1_Artery Condition1_Feedr
## 24 92 164
## Condition1_Norm Condition1_PosA Condition1_PosN
## 2511 20 39
## Condition1_RRAe Condition1_RRAn Condition1_RRNe
## 28 50 6
## Condition1_RRNn Condition2_Artery Condition2_Feedr
## 9 5 13
## Condition2_Norm Condition2_PosA Condition2_PosN
## 2889 4 4
## Condition2_RRAe Condition2_RRAn Condition2_RRNn
## 1 1 2
## BldgType_1Fam BldgType_2fmCon BldgType_Duplex
## 2425 62 109
## BldgType_Twnhs BldgType_TwnhsE HouseStyle_1.5Fin
## 96 227 314
## HouseStyle_1.5Unf HouseStyle_1Story HouseStyle_2.5Fin
## 19 1471 8
## HouseStyle_2.5Unf HouseStyle_2Story HouseStyle_SFoyer
## 24 872 83
## HouseStyle_SLvl RoofStyle_Flat RoofStyle_Gable
## 128 20 2310
## RoofStyle_Gambrel RoofStyle_Hip RoofStyle_Mansard
## 22 551 11
## RoofStyle_Shed RoofMatl_ClyTile RoofMatl_CompShg
## 5 1 2876
## RoofMatl_Membran RoofMatl_Metal RoofMatl_Roll
## 1 1 1
## RoofMatl_Tar&Grv RoofMatl_WdShake RoofMatl_WdShngl
## 23 9 7
## Exterior1st_AsbShng Exterior1st_AsphShn Exterior1st_BrkComm
## 44 2 6
## Exterior1st_BrkFace Exterior1st_CBlock Exterior1st_CemntBd
## 87 2 126
## Exterior1st_HdBoard Exterior1st_ImStucc Exterior1st_MetalSd
## 442 1 450
## Exterior1st_Plywood Exterior1st_Stone Exterior1st_Stucco
## 221 2 43
## Exterior1st_VinylSd Exterior1st_Wd Sdng Exterior1st_WdShing
## 1026 411 56
## Exterior2nd_AsbShng Exterior2nd_AsphShn Exterior2nd_Brk Cmn
## 38 4 22
## Exterior2nd_BrkFace Exterior2nd_CBlock Exterior2nd_CmentBd
## 47 3 126
## Exterior2nd_HdBoard Exterior2nd_ImStucc Exterior2nd_MetalSd
## 406 15 447
## Exterior2nd_Other Exterior2nd_Plywood Exterior2nd_Stone
## 1 270 6
## Exterior2nd_Stucco Exterior2nd_VinylSd Exterior2nd_Wd Sdng
## 47 1015 391
## Exterior2nd_Wd Shng MasVnrType_BrkCmn MasVnrType_BrkFace
## 81 25 879
## MasVnrType_None MasVnrType_Stone Foundation_BrkTil
## 1766 249 311
## Foundation_CBlock Foundation_PConc Foundation_Slab
## 1235 1308 49
## Foundation_Stone Foundation_Wood BsmtExposure_Av
## 11 5 418
## BsmtExposure_Gd BsmtExposure_Mn BsmtExposure_No
## 276 239 1904
## BsmtExposure_NoA BsmtFinType1_ALQ BsmtFinType1_BLQ
## 82 429 269
## BsmtFinType1_GLQ BsmtFinType1_LwQ BsmtFinType1_NoA
## 849 154 79
## BsmtFinType1_Rec BsmtFinType1_Unf BsmtFinType2_ALQ
## 288 851 52
## BsmtFinType2_BLQ BsmtFinType2_GLQ BsmtFinType2_LwQ
## 68 34 87
## BsmtFinType2_NoA BsmtFinType2_Rec BsmtFinType2_Unf
## 80 105 2493
## Heating_Floor Heating_GasA Heating_GasW
## 1 2874 27
## Heating_Grav Heating_OthW Heating_Wall
## 9 2 6
## Electrical_FuseA Electrical_FuseF Electrical_FuseP
## 188 50 8
## Electrical_Mix Electrical_SBrkr Functional_Maj1
## 1 2672 19
## Functional_Maj2 Functional_Min1 Functional_Min2
## 9 65 70
## Functional_Mod Functional_Sev Functional_Typ
## 35 2 2719
## GarageType_2Types GarageType_Attchd GarageType_Basment
## 23 1723 36
## GarageType_BuiltIn GarageType_CarPort GarageType_Detchd
## 186 15 779
## GarageType_NoA GarageFinish_Fin GarageFinish_NoA
## 157 719 159
## GarageFinish_RFn GarageFinish_Unf PavedDrive_N
## 811 1230 216
## PavedDrive_P PavedDrive_Y Fence_GdPrv
## 62 2641 118
## Fence_GdWo Fence_MnPrv Fence_MnWw
## 112 329 12
## Fence_NoA MiscFeature_Gar2 MiscFeature_NoA
## 2348 5 2814
## MiscFeature_Othr MiscFeature_Shed MiscFeature_TenC
## 4 95 1
## SaleType_COD SaleType_Con SaleType_ConLD
## 87 5 26
## SaleType_ConLI SaleType_ConLw SaleType_CWD
## 9 8 12
## SaleType_New SaleType_Oth SaleType_WD
## 239 7 2526
## SaleCondition_Abnorml SaleCondition_AdjLand SaleCondition_Alloca
## 190 12 24
## SaleCondition_Family SaleCondition_Normal SaleCondition_Partial
## 46 2402 245
df[, i] <- NULL
df <- cbind(df, df0)
Function
partition_random
partition_random()
partitions a dataset randomly.
# partition the dataset
df0 <- partition_random(df[!is.na(df$SalePrice),], train = 8, test = FALSE)
## Train: 80%, Validation: 20%, Test: 0%
Let’s check the log file at the end.
cat(paste(readLines('log.txt'), collapse = '\n'))
## Columns Imputed by Mode:
## MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConfig, LandSlope, Neighborhood, Condition1, Condition2, BldgType, HouseStyle, RoofStyle, RoofMatl, Exterior1st, Exterior2nd, MasVnrType, ExterQual, ExterCond, BsmtQual, BsmtCond, HeatingQC, KitchenQual, FireplaceQu, GarageQual, GarageCond, PoolQC, Foundation, BsmtExposure, BsmtFinType1, BsmtFinType2, Heating, CentralAir, Electrical, Functional, GarageType, GarageFinish, PavedDrive, Fence, MiscFeature, SaleType, SaleCondition
##
## Columns Imputed by Median:
## Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCond, YearBuilt, YearRemodAdd, MasVnrArea, BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, X1stFlrSF, X2ndFlrSF, LowQualFinSF, GrLivArea, BsmtFullBath, BsmtHalfBath, FullBath, HalfBath, BedroomAbvGr, KitchenAbvGr, TotRmsAbvGrd, Fireplaces, GarageYrBlt, GarageCars, GarageArea, WoodDeckSF, OpenPorchSF, EnclosedPorch, X3SsnPorch, ScreenPorch, PoolArea, MiscVal, MoSold, YrSold
##
## Columns:
## ExterQual, ExterCond, BsmtQual, BsmtCond, HeatingQC, KitchenQual, FireplaceQu, GarageQual, GarageCond, PoolQC
## Scheme:
## NoA Po Fa TA Gd Ex
## 0 1 2 3 4 5
##
## Columns:
## Utilities
## Scheme:
## ELO NoSeWa NoSewr AllPub
## 0 1 2 3 4
##
## Columns:
## Street, CentralAir
## Scheme:
## Grvl Pave
## 0 1
##
## Columns:
## Street, CentralAir
## Scheme:
## N Y
## 0 1
##
## Columns Encoded by Onehot:
## MSZoning, Alley, LotShape, LandContour, LotConfig, LandSlope, Neighborhood, Condition1, Condition2, BldgType, HouseStyle, RoofStyle, RoofMatl, Exterior1st, Exterior2nd, MasVnrType, Foundation, BsmtExposure, BsmtFinType1, BsmtFinType2, Heating, Electrical, Functional, GarageType, GarageFinish, PavedDrive, Fence, MiscFeature, SaleType, SaleCondition
##
## Details:
## Template of New Column Names: oldname_level; Dropping 1st Level: FALSE
##
## Columns Partitioned by Random:
## Partition
##
## Details:
## Train: 80%, Validation: 20%, Test: 0%
=== end ===