I share here the full code and output for an exploratory data visualization in R Markdown submitted for a contest in January 2017.

The code includes data exploration, feature generation / simplification, summary statistics, plotting distributions of predictors, assessment of correlations, and initial exploration of predictive modeling using random forests.


In January 2017, I had the pleasure of attending the 2017 ComputeFest, hosted by the Institute for Applied Computational Science at Harvard University. The skill-building workshops were excellent, including introductions to a number of topics, e.g.:

  • Data Science in Python
  • Introduction to MATLAB
  • Data Science in Python
  • Introduction to R, R Programming for Data Analysis
  • Deep Learning for Image Segmentation with TensorFlow

There was also a contest in exploratory data visualization hosted by Prodigy Finance. From the contest instructions:

The idea here is to do exploratory visualization in preparation for data analysis. The dataset we will use is from the UCI machine learning repository: https://archive.ics.uci.edu/ml/datasets/Covertype. The dataset contains cartographic variables that might be useful in predicting the kind of tree cover. Your visualizations should be useful to an analyst in creating machine learning models to analyze this dataset: you are looking for correlations between the “features” and the kind of tree cover.

My wife and I decided to use this as an opportunity to practice our data exploration skills in R. It was quite a suprise when our submission was chosen as the winner – the prize was two Amazon Echos!

I share below our submission. Because it was produced using RMarkdown, it was a simple copy / paste to have it built with blogdown / Hugo – the only non-functional part is the code-folding. There are solutions posted for implementing code-folding in blogdown, but hopefully it will be incorporated in future versions of blogdown.

Without further ado:

Introduction

Exploratory visualization in preparation for data analysis, of the Covertype Data Set from the UCI machine learning repository with cartographic features for predicting the classes of tree cover.

Goals:

  • look for correlations between the features and the kind of tree cover
  • label axes clearly and caption and legend your images
  • report which puts together these visualizations with some description in PDF or HTML format

As this is only for practice, a test set is not being set aside, as would be normal practice prior to exploratory data analysis in preparation for machine learning.

library(tidyverse)
library(corrplot)
library(randomForest)
library(knitr)
library(rgl) # for 3D scatterplot
knit_hooks$set(webgl = hook_webgl)
knitr::opts_chunk$set(cache=TRUE)
# setupKnitr()

Dataset loading

The dataset consists of 581,012 rows with the target (forest cover type class) and 54 predictors consisting of 10 continuous variables and 44 factors. The 44 factors actually only represent one of four wilderness areas and one of forty soil types. The forty soil types, in-turn, specify a climatic zone and a geologic zone.

# d <- read.csv(file = 'covtype.data', header = FALSE)
# save(d, file = 'data.Rda')
load('data.Rda')

# Name                                     Data Type    Measurement                       Description
# Elevation                               quantitative    meters                       Elevation in meters
# Aspect                                  quantitative    azimuth                      Aspect in degrees azimuth
# Slope                                   quantitative    degrees                      Slope in degrees
# Horizontal_Distance_To_Hydrology        quantitative    meters                       Horz Dist to nearest surface water features
# Vertical_Distance_To_Hydrology          quantitative    meters                       Vert Dist to nearest surface water features
# Horizontal_Distance_To_Roadways         quantitative    meters                       Horz Dist to nearest roadway
# Hillshade_9am                           quantitative    0 to 255 index               Hillshade index at 9am, summer solstice
# Hillshade_Noon                          quantitative    0 to 255 index               Hillshade index at noon, summer soltice
# Hillshade_3pm                           quantitative    0 to 255 index               Hillshade index at 3pm, summer solstice
# Horizontal_Distance_To_Fire_Points      quantitative    meters                       Horz Dist to nearest wildfire ignition points
# Wilderness_Area (4 binary columns)      qualitative     0 (absence) or 1 (presence)  Wilderness area designation
# Soil_Type (40 binary columns)           qualitative     0 (absence) or 1 (presence)  Soil Type designation
# Cover_Type (7 types)                    integer         1 to 7                       Forest Cover Type designation

names(d) <- c(
    'elevation','aspect','slope','hydro_h','hydro_v','road_h','shade_09','shade_12','shade_15','fire_h',
    'wilderness_1','wilderness_2','wilderness_3','wilderness_4',
    paste0(c('soil_'), c(1:40)),
    'cover'
  )

Dataset pre-processing

The data preprocessing extracts the wilderness areas, soil types, climatic zones, and geologic zones, and replaces the category codes for with human-readable labels.

cover_types <- c('Spruce/Fir','Lodgepole Pine','Ponderosa Pine','Cottonwood/Willow','Aspen','Douglas-fir','Krummholz')
wilderness_areas <- c('Rawah', 'Neota', 'Comanche Peak', 'Cache la Poudre')

climate_zone <- c(2,2,2,2,2,2,3,3,4,4,4,4,4,5,5,6,6,6,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,8,8,8,8,8,8)
climatic_zones <- c('lower montane dry','lower montane','montane dry','montane','montane dry and montane','montane and subalpine','subalpine','alpine')
climate_zone <- climatic_zones[climate_zone]

geologic_zone <- c(7,7,7,7,7,7,5,5,2,7,7,7,7,1,1,1,1,7,1,1,1,2,2,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7)
geologic_zones <- c('alluvium','glacial','shale','sandstone','mixed sedimentary','unspecified','igneous and metamorphic','volcanic')
geologic_zone <- geologic_zones[geologic_zone]

d <- d %>%
  mutate(
    id = row_number(),
    cover = cover_types[cover]
  )
## Warning: package 'bindrcpp' was built under R version 3.4.4
d_soil <- d %>%
  select(id, starts_with('soil_')) %>%
  gather(key = 'soil_type', value = 'value', -id) %>%
  filter(value != 0) %>%
  select(-value) %>%
  mutate(
    soil_type = sub('soil_', '', soil_type) %>% as.numeric(),
    climate = climate_zone[soil_type],
    climate = factor(climate,
      levels = c('alpine','subalpine', 'montane and subalpine',
        'montane dry and montane', 'montane', 'montane dry', 'lower montane')),
    geologic = geologic_zone[soil_type]
  
  )

d_wilderness <- d %>%
  select(id, starts_with('wilderness_')) %>%
  gather(key = 'wilderness_type', value = 'value', -id) %>%
  filter(value != 0) %>%
  select(-value) %>%
  mutate(
    wilderness_type = sub('wilderness_', '', wilderness_type) %>% as.numeric(),
    wilderness_type = wilderness_areas[wilderness_type]
  )

d <- d[, c(1:10, 55, 56)] %>%
  inner_join(d_soil, by = 'id') %>%
  inner_join(d_wilderness, by = 'id') %>%
  mutate_each(funs(factor), 11:16) %>%
  select(-id)
## `mutate_each()` is deprecated.
## Use `mutate_all()`, `mutate_at()` or `mutate_if()` instead.
## To map `funs` over a selection of variables, use `mutate_at()`

Initial univariate data exploration

Summary statistics

summary(d)
   elevation        aspect          slope         hydro_h      
 Min.   :1859   Min.   :  0.0   Min.   : 0.0   Min.   :   0.0  
 1st Qu.:2809   1st Qu.: 58.0   1st Qu.: 9.0   1st Qu.: 108.0  
 Median :2996   Median :127.0   Median :13.0   Median : 218.0  
 Mean   :2959   Mean   :155.7   Mean   :14.1   Mean   : 269.4  
 3rd Qu.:3163   3rd Qu.:260.0   3rd Qu.:18.0   3rd Qu.: 384.0  
 Max.   :3858   Max.   :360.0   Max.   :66.0   Max.   :1397.0  
                                                               
    hydro_v            road_h        shade_09        shade_12    
 Min.   :-173.00   Min.   :   0   Min.   :  0.0   Min.   :  0.0  
 1st Qu.:   7.00   1st Qu.:1106   1st Qu.:198.0   1st Qu.:213.0  
 Median :  30.00   Median :1997   Median :218.0   Median :226.0  
 Mean   :  46.42   Mean   :2350   Mean   :212.1   Mean   :223.3  
 3rd Qu.:  69.00   3rd Qu.:3328   3rd Qu.:231.0   3rd Qu.:237.0  
 Max.   : 601.00   Max.   :7117   Max.   :254.0   Max.   :254.0  
                                                                 
    shade_15         fire_h                   cover       
 Min.   :  0.0   Min.   :   0   Aspen            :  9493  
 1st Qu.:119.0   1st Qu.:1024   Cottonwood/Willow:  2747  
 Median :143.0   Median :1710   Douglas-fir      : 17367  
 Mean   :142.5   Mean   :1980   Krummholz        : 20510  
 3rd Qu.:168.0   3rd Qu.:2550   Lodgepole Pine   :283301  
 Max.   :254.0   Max.   :7173   Ponderosa Pine   : 35754  
                                Spruce/Fir       :211840  
   soil_type                         climate      
 29     :115247   alpine                 : 40437  
 23     : 57752   subalpine              :401983  
 32     : 52519   montane and subalpine  :  8166  
 33     : 45154   montane dry and montane:   602  
 22     : 33373   montane                : 93593  
 10     : 32634   montane dry            :   284  
 (Other):244333   lower montane          : 35947  
                    geologic             wilderness_type  
 alluvium               : 20987   Cache la Poudre: 36968  
 glacial                : 92272   Comanche Peak  :253364  
 igneous and metamorphic:467469   Neota          : 29884  
 mixed sedimentary      :   284   Rawah          :260796  
                                                          
                                                          
                                                          

Distribution of the 10 continuous predictor variables

A plot of the distributions of all 10 continuous predictors (on different scales) confirms that each predictor has adequate variance.

d[, 1:10] %>%
  gather(key = measure, value = value) %>%
  ggplot(aes(value)) +
  geom_histogram(fill = 'steelblue3', bins = 30) +
  facet_wrap(~measure, scales = 'free', ncol = 3) +
  ylab('') +
  theme_bw() +
  theme(
    axis.text.y = element_blank(),
    axis.ticks = element_blank(),
    axis.title.x = element_blank()
  )

Marked class imbalance of prediction target

A histogram shows marked class imbalance of the seven forest cover categories. This is important to be aware of because of its potential to affect classifier training later on.

g <- d %>%
  ggplot(aes(cover, color = cover, fill = cover)) +
  geom_bar(alpha = 0.5) +
  labs(title = 'Frequency of Forest Cover Types in the Dataset', y = 'Frequency') +
  theme_bw() +
  guides(fill = FALSE, color = FALSE) +
  theme(
    axis.text.x = element_text(angle=45, hjust = 1),
    axis.title.x = element_blank()
  )
g

Multivariate data exploration

Assessment of correlation between numeric predictors

A correlation matrix of the 10 continuous predictor features shows some degree of correlation, both positive and negative, between some pairs of predictors, but overall there does not appear to be an excessive amount.

c <- cor(d[, 1:10])
corrplot(c, order = 'hclust', tl.cex = 0.75)

Pairwise analysis of continuous variables by balanced sampling

The following plot shows pairwise comparisons of each of the 10 continuous predictors with 30 randomly selected points from each of the 7 types of forest cover (color-coded). Selecting 30 at random from each of the forest covers counteracts the severe class imbalance, making it easier to see clusters of forest cover.

The plot works better on a large display, but is quite effective at demonstrating interesting associations between predictors as well as clustering of forest cover type as related to pairs of predictors.

Note the non-linear correlation between aspect or slope and the three shade-related features, which makes sense given the relationship between direction to sun + generation of shade by slope. Some of the U-shaped associations between the predictors are easily appreciated here, while not being apparent in the correlation matrix above.

# library(scales)
# show_col(hue_pal()(7)) # show the default ggplot2 color palette for 7 colors

d2 <- d[, c(1:11)] %>% group_by(cover) %>% sample_n(size = 30) # sample equal number from each class?a
d2[,c(1:10)] %>%
  pairs(
    main = 'Pairs plot of continuous predictors and forest cover type',
    pch = 21, cex = 0.55,
    bg = c('#F8766D','#C49A00','#53B400','#00C094','#00B6EB','#A58AFF','#FB61D7')[unclass(d2$cover)],
    col = alpha('#FFFFFF', 0.0), upper.panel = NULL)

Correlations of target with predictors

Variation in distribution of forest cover by each predictor

A density plot of each of the 7 classes of forest cover was examined for each of the 10 numeric predictors.

Some interesting features are immediately apparent; for example, in the panel for elevation, one can see that cottonwoods are predominantly found at lower elevations, aspens in the mid-range, and krummholz at higher elevations.

d[, c(1:11)] %>%
  gather(key = measure, value = value, -cover) %>%
  ggplot(aes(value, color = cover, fill = cover)) +
  geom_density(alpha = 0.2) + theme_bw() +
  facet_wrap(~measure, scales = 'free', ncol = 3) +
  labs(title = 'Distribution of Forest Cover by Continuous Numeric Predictors') +
  ylab('') +
  theme(
    axis.text.y = element_blank(),
    axis.ticks = element_blank(),
    axis.title.x = element_blank(),
    legend.title = element_blank()
    # legend.position = c(0.9, 0.1),  legend.justification = c("right", "bottom"),
    # legend.direction = "horizontal"
    )

Elevation

d %>%
  ggplot(aes(elevation, fill = cover, color = cover)) +
  geom_density(alpha = 0.3) +
  labs(
    title = 'Distribution of Forest Cover by Elevation',
    x = 'Elevation', y = 'Density'
    ) +
  theme_bw() +
  theme(
    legend.title = element_blank(),
    axis.text.y = element_blank()
    )

Violin plot of distribution of cover types by numeric predictor

To reduce the effect of the overlapping density distributions, violin plots are another way to visualize the relationship between each type of forest cover and the continuous variables.

Elevation shows significantly more distinct profiles for the different forest cover types than any other single predictor.

d[, c(1:11)] %>%
  gather(key = measure, value = value, -cover) %>%
  ggplot(aes(cover, value, color = cover)) +
  geom_violin(aes(fill = cover), alpha = 0.5) +
#  geom_boxplot(width = 0.15) +
  facet_wrap(~measure, scales = 'free', ncol = 3) +
  labs(
    title = 'Distribution of forest cover by continuous numeric predictors',
    x = '', y = '') +
  theme_bw() +
  theme(
    axis.text.y = element_blank(), axis.text.x = element_blank(),
    axis.ticks = element_blank(), legend.title = element_blank())

Density plot of cover by pairs of predictors

Visualizing forest cover types based on the density plot of two predictors shows better separation between forest cover types than with single predictors.

For the sake of space, only the density of forest cover with respect to elevation and fire point distance are shown, although other pairs of predictors were also assessed.

d %>% ggplot(aes(elevation, fire_h, color = cover)) +
  geom_density_2d() +
  facet_wrap(~cover) +
  labs(
    title = '2D-density of forest cover type by elevation and fire point distance',
    x = 'Elevation', y = 'Horizontal distance to fire points'
  ) +
  theme_bw() +
  guides(color=FALSE)

Interactive 3D scatter plot of forest cover by 3 predictors

Clustering of forest types can be appreciated upon rotation of the 3D scatterplot by 3 predictor features – elevation, aspect, and fire point distance. Again, to deal with the forest cover class imbalance, 60 of each forest class were randomly sampled from the dataset to ensure all classes are represented.

d2 <- d[, c(1:11)] %>% group_by(cover) %>% sample_n(size = 60)

with(d2,
  plot3d(
  elevation, aspect, fire_h, 
  type="s", size = 1.2,
  col = c('#F8766D','#C49A00','#53B400','#00C094','#00B6EB','#A58AFF','#FB61D7')[unclass(cover)]
))
rglwidget()

Absolute counts of forest cover types by soil type

The soil types representation is very unbalanced within the data set, as shown below. Because of this, the subsequent plots will show the proportions of each type of forest cover, within particular predictor classes.

d %>% ggplot(aes(soil_type, fill = cover, color = cover)) +
  geom_bar(alpha = 0.5) +
  theme_bw() +
  labs(
    title = 'Counts of forest cover type by soil type',
    x = 'Soil type', y = 'Counts'
    ) +
  theme(
    panel.grid = element_blank(),
    axis.text.x = element_text(size = 8),
    legend.title = element_blank(),
    axis.text.y = element_blank()
    )

Proportion of forest cover types by soil type

d %>% ggplot(aes(soil_type, fill = cover, color = cover)) +
  geom_bar(alpha = 0.5, position = 'fill') + 
  theme_bw() +
  labs(
    title = 'Proportion of forest cover type by soil type',
    x = 'Soil type', y = 'Proportion'
    ) +
  theme(
    panel.grid = element_blank(),
    axis.text.x = element_text(size = 8),
    legend.title = element_blank(),
    axis.text.y = element_blank()
    )

Forest cover by climatic zone

d %>% ggplot(aes(climate, fill = cover, color = cover)) +
  geom_bar(position = 'fill', alpha = 0.5) +
  labs(title = 'Proportion of Forest Cover by Climatic Zone', y = 'Proportion') +
  theme_bw() +
  theme(panel.grid = element_blank(), axis.text.x = element_text(angle=20, hjust = 1),
    legend.title = element_blank(), axis.title.x = element_blank())

Forest cover by geologic zone

d %>% ggplot(aes(geologic, fill = cover, color = cover)) +
  geom_bar(position = 'fill', alpha = 0.5) +
  labs(title = 'Proportion of Forest Cover by Geologic Zone', y = 'Proportion') +
  theme_bw() +
  theme(
    panel.grid = element_blank(),
    axis.text.x = element_text(angle=20, hjust = 1),
    axis.title.x = element_blank(),
    legend.title = element_blank()
  )

Forest cover by wilderness area

d %>% ggplot(aes(wilderness_type, fill = cover, color = cover)) +
  geom_bar(position = 'fill', alpha = 0.5) +
  labs(title = 'Proportion of Forest Cover by Wilderness Area', y = 'Proportion') +
  theme_bw() +
  theme(
    panel.grid = element_blank(),
    axis.title.x = element_blank(),
    legend.title = element_blank()
  )

Proportion of forest cover type by climatic zone, geologic zone, and wilderness area

The following plot demonstrates the interactions between the target category and three different categorical predictors.

d %>% ggplot(aes(wilderness_type, fill = cover, color = cover)) +
  geom_bar(position = 'fill', alpha = 0.5) +
  facet_grid(climate ~ geologic) +
  theme_bw() +
  labs(
    title = 'Proportion of forest cover type by climatic, geologic zone, and wilderness area',
    y = 'Climatic Zone and Forest Cover', x = 'Geologic Zone and Wilderness Area'
    ) +
  theme(
    panel.grid = element_blank(),
    axis.text.x = element_text(angle=55, hjust = 1),
    legend.title = element_blank(),
    # axis.title.x = element_blank(),
    axis.text.y = element_blank(),
    # axis.text.x = element_blank(),
    axis.ticks = element_blank()
    )

Initial exploration of predictive modeling

Just for fun: Random Forests for Forest Cover Prediction

(It just seems right to use this particular model to predict this particular target, right?)

The data dictionary reports classification performance from ~1998, using the first 15,120 records used for training and validation, and the remaining 565,892 records for model performance testing, with the following classification accuracy:

  • 70% Neural Network (backpropagation)
  • 58% Linear Discriminant Analysis

Without doing proper cross-validation for determination of optimal hyperparameters, the first 15,120 records were used for training a Random Forest classifier.

A training set out-of-bag accuracy of 83.5% was obtained.

train <- d[c(1:15120), ]
test <- d[-c(1:15120), ]
rfmodel <- randomForest(cover ~ ., data = train, importance = TRUE, ntree = 300)
# save(rfmodel, file = 'rfmodel.Rda')
# load('rfmodel.Rda')
rfmodel

Call:
 randomForest(formula = cover ~ ., data = train, importance = TRUE,      ntree = 300) 
               Type of random forest: classification
                     Number of trees: 300
No. of variables tried at each split: 3

        OOB estimate of  error rate: 13.08%
Confusion matrix:
                  Aspen Cottonwood/Willow Douglas-fir Krummholz
Aspen              2057                 0          23         0
Cottonwood/Willow     0              2101          22         0
Douglas-fir           9                50        1912         0
Krummholz             0                 0           0      2087
Lodgepole Pine      146                 0          46        16
Ponderosa Pine       21                92         237         0
Spruce/Fir           41                 0           4       129
                  Lodgepole Pine Ponderosa Pine Spruce/Fir class.error
Aspen                         46             29          5  0.04768519
Cottonwood/Willow              0             37          0  0.02731481
Douglas-fir                   13            176          0  0.11481481
Krummholz                      4              0         69  0.03379630
Lodgepole Pine              1534             53        365  0.28981481
Ponderosa Pine                 9           1801          0  0.16620370
Spruce/Fir                   334              2       1650  0.23611111

Variable importance in random forest model

importance(rfmodel) %>%
  as.data.frame %>%
  cbind(list(names = row.names(importance(rfmodel)))) %>%
  select(c(10,8,9)) %>%
  arrange(desc(MeanDecreaseAccuracy))
             names MeanDecreaseAccuracy MeanDecreaseGini
1          hydro_h             85.03970         774.2142
2        elevation             83.08764        2967.1819
3         shade_12             68.12083         516.7484
4           fire_h             60.03840         959.7472
5           road_h             58.79357        1162.8486
6        soil_type             57.91295        1544.9187
7         shade_15             57.66940         541.3058
8          hydro_v             55.97477         631.2222
9         shade_09             54.73617         643.6233
10          aspect             48.88866         594.7532
11           slope             45.62555         398.7576
12 wilderness_type             38.61512         749.6550
13         climate             32.70137        1262.6618
14        geologic             22.86959         154.1603

Accuracy on test set

Using the random forest model trained on the first 15,120 samples yielded a test set (remaining 565,892 samples) accuracy of 75.0%, similar to the performance of the neural network from ~1998. Optimizing model training would likely yield substantial improvements.

The difference in classification accuracy from the OOB training set and the test set seems a bit extreme. One possibility is that the ordering of the samples is NOT random, so choosing the first 15,120 samples may not be a representative sample.

pred <- predict(rfmodel, test)
sum(pred == test$cover) / nrow(test)
## [1] 0.7546776