Executive Summary

This project is to create a model on correct and incorrect manner of barbell lifts of the participants, and then use the model to predict the actions from the test data set.

These manner is grouped under variable name classe.

Since this project’s main focus is machine learning, hence process such as getting and cleaning data is sourced to other files which can be found in this directory.

These files are:

Data source can be found under section Citation.

Prepare Data

Download data to local.

## RUN Once only after you set up your working directory
source("DownloadData.R")

Load data training and testing into environment.

source("LoadData.R")

Tidy data, and save them as tidyTraining and tidyTesting.

source("TidyData.R")

Import Library

library(dplyr)
library(caret)

Data Exploration

head(tidyTraining,3)
##   X user_name raw_timestamp_part_1 raw_timestamp_part_2 cvtd_timestamp
## 1 1  carlitos           1323084231               788290     2011-12-05
## 2 2  carlitos           1323084231               808298     2011-12-05
## 3 3  carlitos           1323084231               820366     2011-12-05
##   new_window num_window roll_belt pitch_belt yaw_belt total_accel_belt
## 1         no         11      1.41       8.07    -94.4                3
## 2         no         11      1.41       8.07    -94.4                3
## 3         no         11      1.42       8.07    -94.4                3
##   kurtosis_roll_belt kurtosis_picth_belt kurtosis_yaw_belt skewness_roll_belt
## 1                 NA                  NA                NA                 NA
## 2                 NA                  NA                NA                 NA
## 3                 NA                  NA                NA                 NA
##   skewness_roll_belt.1 skewness_yaw_belt max_roll_belt max_picth_belt
## 1                   NA                NA            NA             NA
## 2                   NA                NA            NA             NA
## 3                   NA                NA            NA             NA
##   max_yaw_belt min_roll_belt min_pitch_belt min_yaw_belt amplitude_roll_belt
## 1           NA            NA             NA           NA                  NA
## 2           NA            NA             NA           NA                  NA
## 3           NA            NA             NA           NA                  NA
##   amplitude_pitch_belt amplitude_yaw_belt var_total_accel_belt avg_roll_belt
## 1                   NA                 NA                   NA            NA
## 2                   NA                 NA                   NA            NA
## 3                   NA                 NA                   NA            NA
##   stddev_roll_belt var_roll_belt avg_pitch_belt stddev_pitch_belt
## 1               NA            NA             NA                NA
## 2               NA            NA             NA                NA
## 3               NA            NA             NA                NA
##   var_pitch_belt avg_yaw_belt stddev_yaw_belt var_yaw_belt gyros_belt_x
## 1             NA           NA              NA           NA         0.00
## 2             NA           NA              NA           NA         0.02
## 3             NA           NA              NA           NA         0.00
##   gyros_belt_y gyros_belt_z accel_belt_x accel_belt_y accel_belt_z
## 1            0        -0.02          -21            4           22
## 2            0        -0.02          -22            4           22
## 3            0        -0.02          -20            5           23
##   magnet_belt_x magnet_belt_y magnet_belt_z roll_arm pitch_arm yaw_arm
## 1            -3           599          -313     -128      22.5    -161
## 2            -7           608          -311     -128      22.5    -161
## 3            -2           600          -305     -128      22.5    -161
##   total_accel_arm var_accel_arm avg_roll_arm stddev_roll_arm var_roll_arm
## 1              34            NA           NA              NA           NA
## 2              34            NA           NA              NA           NA
## 3              34            NA           NA              NA           NA
##   avg_pitch_arm stddev_pitch_arm var_pitch_arm avg_yaw_arm stddev_yaw_arm
## 1            NA               NA            NA          NA             NA
## 2            NA               NA            NA          NA             NA
## 3            NA               NA            NA          NA             NA
##   var_yaw_arm gyros_arm_x gyros_arm_y gyros_arm_z accel_arm_x accel_arm_y
## 1          NA        0.00        0.00       -0.02        -288         109
## 2          NA        0.02       -0.02       -0.02        -290         110
## 3          NA        0.02       -0.02       -0.02        -289         110
##   accel_arm_z magnet_arm_x magnet_arm_y magnet_arm_z kurtosis_roll_arm
## 1        -123         -368          337          516                NA
## 2        -125         -369          337          513                NA
## 3        -126         -368          344          513                NA
##   kurtosis_picth_arm kurtosis_yaw_arm skewness_roll_arm skewness_pitch_arm
## 1                 NA               NA                NA                 NA
## 2                 NA               NA                NA                 NA
## 3                 NA               NA                NA                 NA
##   skewness_yaw_arm max_roll_arm max_picth_arm max_yaw_arm min_roll_arm
## 1               NA           NA            NA          NA           NA
## 2               NA           NA            NA          NA           NA
## 3               NA           NA            NA          NA           NA
##   min_pitch_arm min_yaw_arm amplitude_roll_arm amplitude_pitch_arm
## 1            NA          NA                 NA                  NA
## 2            NA          NA                 NA                  NA
## 3            NA          NA                 NA                  NA
##   amplitude_yaw_arm roll_dumbbell pitch_dumbbell yaw_dumbbell
## 1                NA      13.05217      -70.49400    -84.87394
## 2                NA      13.13074      -70.63751    -84.71065
## 3                NA      12.85075      -70.27812    -85.14078
##   kurtosis_roll_dumbbell kurtosis_picth_dumbbell kurtosis_yaw_dumbbell
## 1                     NA                      NA                    NA
## 2                     NA                      NA                    NA
## 3                     NA                      NA                    NA
##   skewness_roll_dumbbell skewness_pitch_dumbbell skewness_yaw_dumbbell
## 1                     NA                      NA                    NA
## 2                     NA                      NA                    NA
## 3                     NA                      NA                    NA
##   max_roll_dumbbell max_picth_dumbbell max_yaw_dumbbell min_roll_dumbbell
## 1                NA                 NA               NA                NA
## 2                NA                 NA               NA                NA
## 3                NA                 NA               NA                NA
##   min_pitch_dumbbell min_yaw_dumbbell amplitude_roll_dumbbell
## 1                 NA               NA                      NA
## 2                 NA               NA                      NA
## 3                 NA               NA                      NA
##   amplitude_pitch_dumbbell amplitude_yaw_dumbbell total_accel_dumbbell
## 1                       NA                     NA                   37
## 2                       NA                     NA                   37
## 3                       NA                     NA                   37
##   var_accel_dumbbell avg_roll_dumbbell stddev_roll_dumbbell var_roll_dumbbell
## 1                 NA                NA                   NA                NA
## 2                 NA                NA                   NA                NA
## 3                 NA                NA                   NA                NA
##   avg_pitch_dumbbell stddev_pitch_dumbbell var_pitch_dumbbell avg_yaw_dumbbell
## 1                 NA                    NA                 NA               NA
## 2                 NA                    NA                 NA               NA
## 3                 NA                    NA                 NA               NA
##   stddev_yaw_dumbbell var_yaw_dumbbell gyros_dumbbell_x gyros_dumbbell_y
## 1                  NA               NA                0            -0.02
## 2                  NA               NA                0            -0.02
## 3                  NA               NA                0            -0.02
##   gyros_dumbbell_z accel_dumbbell_x accel_dumbbell_y accel_dumbbell_z
## 1                0             -234               47             -271
## 2                0             -233               47             -269
## 3                0             -232               46             -270
##   magnet_dumbbell_x magnet_dumbbell_y magnet_dumbbell_z roll_forearm
## 1              -559               293               -65         28.4
## 2              -555               296               -64         28.3
## 3              -561               298               -63         28.3
##   pitch_forearm yaw_forearm kurtosis_roll_forearm kurtosis_picth_forearm
## 1         -63.9        -153                    NA                     NA
## 2         -63.9        -153                    NA                     NA
## 3         -63.9        -152                    NA                     NA
##   kurtosis_yaw_forearm skewness_roll_forearm skewness_pitch_forearm
## 1                   NA                    NA                     NA
## 2                   NA                    NA                     NA
## 3                   NA                    NA                     NA
##   skewness_yaw_forearm max_roll_forearm max_picth_forearm max_yaw_forearm
## 1                   NA               NA                NA              NA
## 2                   NA               NA                NA              NA
## 3                   NA               NA                NA              NA
##   min_roll_forearm min_pitch_forearm min_yaw_forearm amplitude_roll_forearm
## 1               NA                NA              NA                     NA
## 2               NA                NA              NA                     NA
## 3               NA                NA              NA                     NA
##   amplitude_pitch_forearm amplitude_yaw_forearm total_accel_forearm
## 1                      NA                    NA                  36
## 2                      NA                    NA                  36
## 3                      NA                    NA                  36
##   var_accel_forearm avg_roll_forearm stddev_roll_forearm var_roll_forearm
## 1                NA               NA                  NA               NA
## 2                NA               NA                  NA               NA
## 3                NA               NA                  NA               NA
##   avg_pitch_forearm stddev_pitch_forearm var_pitch_forearm avg_yaw_forearm
## 1                NA                   NA                NA              NA
## 2                NA                   NA                NA              NA
## 3                NA                   NA                NA              NA
##   stddev_yaw_forearm var_yaw_forearm gyros_forearm_x gyros_forearm_y
## 1                 NA              NA            0.03            0.00
## 2                 NA              NA            0.02            0.00
## 3                 NA              NA            0.03           -0.02
##   gyros_forearm_z accel_forearm_x accel_forearm_y accel_forearm_z
## 1           -0.02             192             203            -215
## 2           -0.02             192             203            -216
## 3            0.00             196             204            -213
##   magnet_forearm_x magnet_forearm_y magnet_forearm_z classe
## 1              -17              654              476      A
## 2              -18              661              473      A
## 3              -18              658              469      A

Feature Selection

After exploring the data, column X and user_name are not confounding factors for classe. Hence we will drop them from tidyTraining and then save it as fsTraining.

dropCol <- c("X","user_name")
fsTraining <- tidyTraining %>% select(-one_of(dropCol))

According to the research paper, Section 5.1 Feature extraction and selection mentioned that features such as mean, variance,standard deviation, max, min, amplitude, kurtosis and skewness are calculated from the euler angles of the four sensors. Hence, these fields can be safely dropped as they represent as the summary of the raw measurements.

pattern <- "avg_|var_|stddev_|max_|min_|amplitude_|kurtosis_|skewness_"
fsTraining <- fsTraining %>% select(-matches(pattern, ignore.case = TRUE))

Splitting Data

In order to do model selection and out of sample error calculation, we will be splitting fsTraining into 60% and 40% as fsTrainingSub and fsTestingSub respectively.

set.seed(1337)
trainIndex <- createDataPartition(fsTraining$classe, p = .6, 
                                  list = FALSE, 
                                  times = 1)
fsTrainingSub <- fsTraining[trainIndex,]
fsTestingSub <- fsTraining[-trainIndex,]

Model Selection

We will be using a few model to decide which model to use.

Rpart

A simple classification tree is used to create a model.

modelRpart <- train(classe~., data = fsTrainingSub, method = "rpart")
confusionMatrix(predict(modelRpart,fsTestingSub), fsTestingSub$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1981  644  630  544  143
##          B   44  521   41  231  104
##          C  178  353  697  451  267
##          D    0    0    0    0    0
##          E   29    0    0   60  928
## 
## Overall Statistics
##                                           
##                Accuracy : 0.526           
##                  95% CI : (0.5149, 0.5371)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3818          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8875   0.3432  0.50950   0.0000   0.6436
## Specificity            0.6507   0.9336  0.80719   1.0000   0.9861
## Pos Pred Value         0.5025   0.5537  0.35817      NaN   0.9125
## Neg Pred Value         0.9357   0.8556  0.88627   0.8361   0.9247
## Prevalence             0.2845   0.1935  0.17436   0.1639   0.1838
## Detection Rate         0.2525   0.0664  0.08884   0.0000   0.1183
## Detection Prevalence   0.5024   0.1199  0.24802   0.0000   0.1296
## Balanced Accuracy      0.7691   0.6384  0.65835   0.5000   0.8148

The accuracy for this model is too low, hence we will avoid using this.

Random Forest

modelRf <- train(classe~., data = fsTrainingSub, method = "rf")
confusionMatrix(predict(modelRf,fsTestingSub), fsTestingSub$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2232    1    0    0    0
##          B    0 1517    1    0    0
##          C    0    0 1364    4    0
##          D    0    0    3 1280    0
##          E    0    0    0    2 1442
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9986          
##                  95% CI : (0.9975, 0.9993)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9982          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9993   0.9971   0.9953   1.0000
## Specificity            0.9998   0.9998   0.9994   0.9995   0.9997
## Pos Pred Value         0.9996   0.9993   0.9971   0.9977   0.9986
## Neg Pred Value         1.0000   0.9998   0.9994   0.9991   1.0000
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2845   0.1933   0.1738   0.1631   0.1838
## Detection Prevalence   0.2846   0.1935   0.1744   0.1635   0.1840
## Balanced Accuracy      0.9999   0.9996   0.9982   0.9974   0.9998

Since the model random forest has higher accuracy, we will be select it as our project model.

Out of Sample Error Rate

The out of sample error rate can be calculated as below:

cm <- confusionMatrix(predict(modelRf,fsTestingSub), fsTestingSub$classe)
errorRate <- 1 - cm$overall["Accuracy"]
errorRate
##    Accuracy 
## 0.001401988

Cross Validation

Method k-fold cross validation is used to validate the built model accuracy and mitigate overfitting.

The number of folds selected are 5.

Parallel processing is also turned on in this setting to minimize model building time.

These settings are inserted into trainControlSett variable.

# Configure train control setting
trainControlSett <- trainControl(method = "cv", 
                              number = 5,
                              allowParallel = TRUE)

Parallel Processing

Set up parallel processing.

# Configure parallel processing
# Note for parallel processing: https://github.com/lgreski/datasciencectacontent/blob/master/markdown/pml-randomForestPerformance.md
library(parallel)
library(doParallel)
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)

Build Model

To have an accurate prediction for this project, we will required at least 99.9% accuracy.

Hence, the classification model selected will be randomForest, which provides high prediction accuracy.

The data we will be using to train our model will be fsTraining.

The trainControlSett is applied in here along in the model building.

A seed is also set here to ensure reproducibility.

set.seed(1337)
model <- train(classe~., data = fsTraining, method = "rf", 
               trControl = trainControlSett)
print(model)
## Random Forest 
## 
## 19622 samples
##    57 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 15697, 15697, 15697, 15698, 15699 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9974519  0.9967769
##   29    0.9994394  0.9992910
##   57    0.9988279  0.9985174
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 29.

From the model result, we can see that highest accuracy is achieved by having only 29 variables splitting at each tree node.

Turn Off Parallel Processing

Deregister parallel processing cluster with code below.

stopCluster(cluster)
registerDoSEQ()

Predict Result

tidyTesting data set will be used to predict the 20 results of classe with the created model.

predict(model,newdata = tidyTesting)
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Citation

Data source:

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H.

Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.