Unit 5 Classification and Regression Trees (CART)

5.1 Introduction to trees - Judge, Jury, and Classifier

## 'data.frame':    566 obs. of  9 variables:
##  $ Docket    : Factor w/ 566 levels "00-1011","00-1045",..: 63 69 70 145 97 181 242 289 334 436 ...
##  $ Term      : int  1994 1994 1994 1994 1995 1995 1996 1997 1997 1999 ...
##  $ Circuit   : Factor w/ 13 levels "10th","11th",..: 4 11 7 3 9 11 13 11 12 2 ...
##  $ Issue     : Factor w/ 11 levels "Attorneys","CivilRights",..: 5 5 5 5 9 5 5 5 5 3 ...
##  $ Petitioner: Factor w/ 12 levels "AMERICAN.INDIAN",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Respondent: Factor w/ 12 levels "AMERICAN.INDIAN",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ LowerCourt: Factor w/ 2 levels "conser","liberal": 2 2 2 1 1 1 1 1 1 1 ...
##  $ Unconst   : int  0 0 0 0 0 1 0 1 0 0 ...
##  $ Reverse   : int  1 1 1 1 1 0 1 1 1 1 ...
...

##    predictCART
##      0  1
##   0 41 36
##   1 22 71

## [1] 0.6927105

5.2 The D2Hawkeye Story

Predict cost bucket patient fell into in 2009 using a CART model

## 'data.frame':    458005 obs. of  16 variables:
##  $ age              : int  85 59 67 52 67 68 75 70 67 67 ...
##  $ alzheimers       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ arthritis        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cancer           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ copd             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ depression       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ diabetes         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ heart.failure    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ihd              : int  0 0 0 0 0 0 0 0 0 0 ...
...
## 
##           1           2           3           4           5 
## 0.671267781 0.190170413 0.089466272 0.043324855 0.005770679

5.2.1 Baseline method and Penalty Matrix

Baseline method will perdict that cost bucket for a patient in 2009 will be the same as it was in 2008

##    
##          1      2      3      4      5
##   1 110138   7787   3427   1452    174
##   2  16000  10721   4629   2931    559
##   3   7006   4629   2774   1621    360
##   4   2688   1943   1415   1539    352
##   5    293    191    160    309    104
## [1] 0.6838135
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    1    2    3    4
## [2,]    2    0    1    2    3
## [3,]    4    2    0    1    2
## [4,]    6    4    2    0    1
## [5,]    8    6    4    2    0
## [1] 0.7386055

Now, we’ll create a CART model to improve the penalty error and accuracy

##    PredictTest
##          1      2      3      4      5
##   1 114987   7884      0    107      0
##   2  18692  16051      0     97      0
##   3   8188   8120      0     82      0
##   4   3176   4567      0    194      0
##   5    349    671      0     37      0
## [1] 0.7126669
## [1] 0.7591238

CART model predicts 3,4,5 rarely because data has few observations in these classes. Can be rectified by adding penalty matrix to model

##    PredictTest
##         1     2     3     4     5
##   1 93651 26529  2571   227     0
##   2  6913 20324  7049   554     0
##   3  3499  8184  4370   337     0
##   4  1264  3403  2702   568     0
##   5   129   375   411   142     0
## [1] 0.6490813
## [1] 0.6377987

5.3 Recitation - Boston Housing Data (Regression Trees)

## 'data.frame':    506 obs. of  16 variables:
##  $ TOWN   : Factor w/ 92 levels "Arlington","Ashland",..: 54 77 77 46 46 46 69 69 69 69 ...
##  $ TRACT  : int  2011 2021 2022 2031 2032 2033 2041 2042 2043 2044 ...
##  $ LON    : num  -71 -71 -70.9 -70.9 -70.9 ...
##  $ LAT    : num  42.3 42.3 42.3 42.3 42.3 ...
##  $ MEDV   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 22.1 16.5 18.9 ...
##  $ CRIM   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ ZN     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ INDUS  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ CHAS   : int  0 0 0 0 0 0 0 0 0 0 ...
...
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3850  0.4490  0.5380  0.5547  0.6240  0.8710

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   17.02   21.20   22.53   25.00   50.00

5.3.1 Geographical Predictions

## 
## Call:
## lm(formula = MEDV ~ LAT + LON, data = boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.460  -5.590  -1.299   3.695  28.129 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
...

The tree is much better at estimating above median value houses. Athough still complex. We’ll try specifying a simpler tree with a min bucket and observe performance.

Regression tree has carved out the low value area in the center, that linear Regression in incapable of.

Now we’ll usee more variables than Longitude and Latitude to build a more accurate model

## 
## Call:
## lm(formula = MEDV ~ ., data = train[!(names(train) %in% c("TOWN", 
##     "TRACT"))])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.511  -2.712  -0.676   1.793  36.883 
## 
## Coefficients:
...
## [1] 3037.088
## Warning: Bad 'data' field in model 'call' (expected a data.frame or a matrix).
## To silence this warning:
##     Call prp with roundint=FALSE,
##     or rebuild the rpart model with model=TRUE.

## [1] 4328.988

5.4 Assignment

5.4.1 Part 1 - Understanding why people vote

5.4.1.1 Problem 1 - Exploration and Logistic Regression

##       sex              yob           voting         hawthorne    
##  Min.   :0.0000   Min.   :1900   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.:1947   1st Qu.:0.0000   1st Qu.:0.000  
##  Median :0.0000   Median :1956   Median :0.0000   Median :0.000  
##  Mean   :0.4993   Mean   :1956   Mean   :0.3159   Mean   :0.111  
##  3rd Qu.:1.0000   3rd Qu.:1965   3rd Qu.:1.0000   3rd Qu.:0.000  
##  Max.   :1.0000   Max.   :1986   Max.   :1.0000   Max.   :1.000  
##    civicduty        neighbors          self           control      
##  Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.0000  
...
##    
##          0      1
##   0 209191  26197
##   1  96675  12021
##       sex              yob           voting    hawthorne        civicduty     
##  Min.   :0.0000   Min.   :1900   Min.   :1   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1945   1st Qu.:1   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :1954   Median :1   Median :0.0000   Median :0.0000  
##  Mean   :0.4898   Mean   :1953   Mean   :1   Mean   :0.1133   Mean   :0.1106  
##  3rd Qu.:1.0000   3rd Qu.:1962   3rd Qu.:1   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1986   Max.   :1   Max.   :1.0000   Max.   :1.0000  
##    neighbors           self           control      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
...
## 
## Call:
## glm(formula = voting ~ hawthorne + civicduty + neighbors + self, 
##     family = "binomial", data = gerber)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9744  -0.8691  -0.8389   1.4586   1.5590  
## 
## Coefficients:
...
##    
##      FALSE   TRUE
##   0 134513 100875
##   1  56730  51966
## [1] 0.5419578

## [1] 0.5308461

5.4.1.3 Problem 3 - Interaction Terms

## [1] 0.043362

## 
## Call:
## glm(formula = voting ~ control + sex, family = "binomial", data = gerber)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9220  -0.9012  -0.8290   1.4564   1.5717  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
...
##         1         2         3         4 
## 0.3462559 0.3024455 0.3337375 0.2908065
## 
## Call:
## glm(formula = voting ~ sex + control + sex:control, family = "binomial", 
##     data = gerber)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9213  -0.9019  -0.8284   1.4573   1.5724  
## 
## Coefficients:
...
##         1         2         3         4 
## 0.3458183 0.3027947 0.3341757 0.2904558

5.4.2 Part 2 - Letter Recognition

5.4.3 Part 3 - Predicting Earnings from Census Data

5.4.3.1 Problem 1 - A Logistic Regression Model

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Call:
## glm(formula = over50k ~ ., family = "binomial", data = censusTrain)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.1065  -0.5037  -0.1804  -0.0008   3.3383  
## 
## Coefficients: (1 not defined because of singularities)
##                                            Estimate Std. Error z value Pr(>|z|)
...
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.01388 0.10074 0.24115 0.39293 1.00000
##         
##          FALSE TRUE
##    <=50K  9051  662
##    >50K   1190 1888
## [1] 0.8552107
## 
##  <=50K   >50K 
##   9713   3078
## [1] 0.7593621

5.4.3.2 Problem 2 - A CART model

##  <=50K   >50K 
##  10725   2066
##         predictCART
##           <=50K  >50K
##    <=50K   9243   470
##    >50K    1482  1596
## [1] 0.8473927
##       <=50K             >50K        
##  Min.   :0.01286   Min.   :0.05099  
##  1st Qu.:0.69728   1st Qu.:0.05099  
##  Median :0.94901   Median :0.05099  
##  Mean   :0.75905   Mean   :0.24095  
##  3rd Qu.:0.94901   3rd Qu.:0.30272  
##  Max.   :0.94901   Max.   :0.98714
##        <=50K       >50K
## 2  0.2794982 0.72050176
## 5  0.2794982 0.72050176
## 7  0.9490143 0.05098572
## 8  0.6972807 0.30271934
## 11 0.6972807 0.30271934
## 12 0.2794982 0.72050176
##         
##          FALSE TRUE
##    <=50K  9243  470
##    >50K   1482 1596
## [1] 0.8470256