knitr::opts_chunk$set(tidy.opts=list(width.cutoff=60),tidy=TRUE)

```

Principle Components Analysis (PCA)

Let’s recall the example from the lecture notes. Our survey asks about how much they care about price, software, aesthetics and brand when choosing a new computer. There are 16 respondents with differing operating systems (OS). (e.g. 0 is Windows, 1 is Mac). Data set is as follows. Likert answers from 1 to 7 is a scale from strongly disagree to strongly agree.

#Customer Survey Data
survey_data <- data.frame(
customer = 1:16,
OS = c(0,0,0,0,1,0,0,0,1,1,0,1,1,1,1,1),
Price = c(6,7,6,5,7,6,5,6,3,1,2,5,2,3,1,2),
Software = c(5,3,4,7,7,4,7,5,5,3,6,7,4,5,6,3),
Aesthetics = c(3,2,4,1,5,2,2,4,6,7,6,7,5,6,5,7),
Brand = c(4,2,5,3,5,3,1,4,7,5,7,6,6,5,5,7)
)

#Let's do some exploratory analysis
summary(survey_data[,3:6])
##      Price          Software       Aesthetics       Brand      
##  Min.   :1.000   Min.   :3.000   Min.   :1.00   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:4.000   1st Qu.:2.75   1st Qu.:3.750  
##  Median :5.000   Median :5.000   Median :5.00   Median :5.000  
##  Mean   :4.188   Mean   :5.062   Mean   :4.50   Mean   :4.688  
##  3rd Qu.:6.000   3rd Qu.:6.250   3rd Qu.:6.00   3rd Qu.:6.000  
##  Max.   :7.000   Max.   :7.000   Max.   :7.00   Max.   :7.000
#Correlation Matrix
cor(survey_data[,3:6])
##                 Price   Software Aesthetics      Brand
## Price       1.0000000  0.1856123 -0.6320222 -0.5802668
## Software    0.1856123  1.0000000 -0.1462152 -0.1185864
## Aesthetics -0.6320222 -0.1462152  1.0000000  0.8528544
## Brand      -0.5802668 -0.1185864  0.8528544  1.0000000
#Do PCA with princomp function and use correlation matrix to create components
survey_pca <- princomp(as.matrix(survey_data[,3:6]),cor=T)
summary(survey_pca,loadings=TRUE)
## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4
## Standard deviation     1.5589391 0.9804092 0.6816673 0.37925777
## Proportion of Variance 0.6075727 0.2403006 0.1161676 0.03595911
## Cumulative Proportion  0.6075727 0.8478733 0.9640409 1.00000000
## 
## Loadings:
##            Comp.1 Comp.2 Comp.3 Comp.4
## Price      -0.523         0.848       
## Software   -0.177  0.977 -0.120       
## Aesthetics  0.597  0.134  0.295 -0.734
## Brand       0.583  0.167  0.423  0.674

As you can clearly see, the first 2 PC is responsible for 85% of the variation. First PC considers Price and Software as negative effects and Aesthetics and Brand as a positive effect, second PC is highly affected by Software. With the third PC, 96.4% of the variation is explained.

library(ggplot2)
ggplot(data.frame(pc=1:4,cum_var=c(0.6075727,0.8478733,0.9640409,1.00000000)),aes(x=pc,y=cum_var)) + geom_point() + geom_line()

You see, there is a trade-off between explanatory power and number of explanatory variables (dimensions).

Properties of PCA

  • PCA is a method to reduce the number of explanatory variables. The resulting principle components (PC) are actually linear transformations of your explanatory variables. (e.g. \(PC_i = a_1*x_1 + a_2*x_2 + \dots + a_p*x_p\)).
  • If some or most of your explanatory variables are highly correlated, then PCA is the way to go.
  • Total number of PCs are equal to total number of explanatory variables. Though, it is possible to eliminate some PCs without considerable loss in explanatory power.
  • PCs are fairly independent (i.e. low correlation).
  • You cannot use categorical or binary variables in PCA.
  • PCA is more of an exploration tool than an explanation or prediction tool. Though, it is possible to use PCs as input to other methods (i.e. regression).
  • PCA considers linear relationship, so it is not good for non-linear relations (i.e. \(x_1 = x_2^2\)). It also assumes normally distributed errors, so fat tailed distributions are a poor fit to PCA.
  • It is possible to use covariance matrix rather than the correlation matrix.
  • Centering and scaling can be done to get better results. Centering sets the mean values of the covariates to 0 and scaling converts explanatory variables to result in unit variance.
  • In R you can use both prcomp and princomp functions. While the former uses singular value decomposition method to calculate principle components, the latter uses eigenvalues and eigenvectors.

Examples

Made-up Example regarding Transformations

set.seed(58)
#Randomly create data points around the normal distribution
x1=rnorm(30,mean=2,sd=4)
#Get one linear transformation and one nonlinear transformation of the data
pca_data<-data.frame(x1,x2=x1*2,x3=(x1^2),x4=abs(x1)^(0.5)+rnorm(30))
#See the correlation matrix
pca_cor<-cor(pca_data)
pca_cor
##            x1         x2        x3         x4
## x1 1.00000000 1.00000000 0.6648541 0.08208783
## x2 1.00000000 1.00000000 0.6648541 0.08208783
## x3 0.66485406 0.66485406 1.0000000 0.40646124
## x4 0.08208783 0.08208783 0.4064612 1.00000000
#See the eigenvalues and eigenvectors
pca_eigen<-eigen(pca_cor)
pca_eigen
## eigen() decomposition
## $values
## [1] 2.625027e+00 1.067917e+00 3.070557e-01 4.510095e-16
## 
## $vectors
##            [,1]       [,2]       [,3]          [,4]
## [1,] -0.5856026  0.2603411  0.2988179  7.071068e-01
## [2,] -0.5856026  0.2603411  0.2988179 -7.071068e-01
## [3,] -0.5269453 -0.2545741 -0.8108765  6.106227e-16
## [4,] -0.1909657 -0.8942243  0.4048395 -9.714451e-17
#See the standard deviations and proportion of variances
sqrt(pca_eigen$values)
## [1] 1.620194e+00 1.033401e+00 5.541261e-01 2.123698e-08
pca_eigen$values/sum(pca_eigen$values)
## [1] 6.562569e-01 2.669792e-01 7.676393e-02 1.127524e-16
cumsum(pca_eigen$values/sum(pca_eigen$values))
## [1] 0.6562569 0.9232361 1.0000000 1.0000000
#Run PCA
pca_result<-princomp(pca_data,cor=T)
#See the PCA
summary(pca_result,loadings=TRUE)
## Importance of components:
##                           Comp.1    Comp.2     Comp.3 Comp.4
## Standard deviation     1.6201937 1.0334006 0.55412609      0
## Proportion of Variance 0.6562569 0.2669792 0.07676393      0
## Cumulative Proportion  0.6562569 0.9232361 1.00000000      1
## 
## Loadings:
##    Comp.1 Comp.2 Comp.3 Comp.4
## x1 -0.586  0.260  0.299  0.707
## x2 -0.586  0.260  0.299 -0.707
## x3 -0.527 -0.255 -0.811       
## x4 -0.191 -0.894  0.405
pca_result2<-princomp(pca_data[,c("x1","x3","x4")],cor=T)
summary(pca_result2,loadings=TRUE)
## Importance of components:
##                           Comp.1    Comp.2     Comp.3
## Standard deviation     1.3481348 0.9628649 0.50539453
## Proportion of Variance 0.6058225 0.3090363 0.08514121
## Cumulative Proportion  0.6058225 0.9148588 1.00000000
## 
## Loadings:
##    Comp.1 Comp.2 Comp.3
## x1 -0.601 -0.517  0.609
## x3 -0.690        -0.722
## x4 -0.403  0.855  0.327

Young People Survey

Following data is from Kaggle. Data set is named Young People Survey. It is a simple survey with lots of questions and some demographic data. Resulting PCA analysis helps us to maintain 90% of the variance with just two-thirds of the survey questions.

#Prepare data
yr_data <-
read.csv("data/youth_responses.csv",sep=",") %>%
filter(complete.cases(.)) %>%
# mutate(id=row_number()) %>%
tbl_df()

#Prepare PCA data
yr_pca<-
yr_data[,sapply(yr_data,class)=="integer"] %>%
select(Music:Spending.on.healthy.eating)

#Run PCA analysis
yr_pca_result<-princomp(yr_pca,cor=T)

#See the PCA output
ggplot(data=data.frame(PC=1:length(yr_pca_result$sdev),var_exp=cumsum(yr_pca_result$sdev^2/sum(yr_pca_result$sdev^2))),
aes(x=PC,y=var_exp)) + geom_line() + geom_point() + scale_y_continuous(labels = scales::percent,breaks=seq(0,1,length.out=11)) + scale_x_continuous(breaks=seq(0,135,by=5))

Multidimensional Scaling (MDS)

Classical multidimensional scaling is kind of a reverse engineering method. Basically, from a distance matrix it can generate approximate locations of nodes in the coordinate system.

The most classical example is if a distance matrix between cities are used as an input, MDS gives the coordinates as the output. In the following example we are going to “generate” 10 cities, plot them, get distance matrix, run MDS on the distance matrix and get coordinates back.

#Set the seed for reproducibility
set.seed(58)
#Create a coordinate matrix (all coordinates are between 0 and 1).
#Suppose the places of cities A to J
coord_mat<-matrix(round(runif(20),6),ncol=2)
#Column names of the coordinate matrix, say x and y
colnames(coord_mat)<-c("x","y")
#Row names of the coordinates are cities.
#LETTERS is a predefined vector with letters of the English alphabet.
rownames(coord_mat)<-LETTERS[1:10]
#Create distance matrix
dist_mat<-dist(coord_mat)
#Display coordinate matrix
print(coord_mat)
##          x        y
## A 0.326454 0.031954
## B 0.144960 0.653293
## C 0.666349 0.392102
## D 0.942411 0.569635
## E 0.836936 0.690193
## F 0.363652 0.234271
## G 0.240290 0.970383
## H 0.728265 0.592039
## I 0.267996 0.802786
## J 0.582656 0.164829
ggplot(as.data.frame(coord_mat),aes(x=x,y=y)) + geom_text(label=rownames(coord_mat))

print(dist_mat)
##           A         B         C         D         E         F         G
## B 0.6473038                                                            
## C 0.4952123 0.5831528                                                  
## D 0.8176209 0.8018271 0.3282197                                        
## E 0.8329889 0.6929592 0.3434504 0.1601849                              
## F 0.2057082 0.4726580 0.3413738 0.6689028 0.6571625                    
## G 0.9423764 0.3311101 0.7182863 0.8084385 0.6591607 0.7463773          
## H 0.6893093 0.5865124 0.2093046 0.2153148 0.1464363 0.5108234 0.6174656
## I 0.7730455 0.1936131 0.5721420 0.7135790 0.5799741 0.5765062 0.1698716
## J 0.2886091 0.6558772 0.2421932 0.5415640 0.5836657 0.2297497 0.8752895
##           H         I
## B                    
## C                    
## D                    
## E                    
## F                    
## G                    
## H                    
## I 0.5062231          
## J 0.4513428 0.7113368
#Now let's employ Multidimensional Scaling (MDS)
#Base R has a lovely command called cmdscale (Classical multidimensional scaling)
mds_data<-cmdscale(dist_mat,k=2)
colnames(mds_data)<-c("x","y")
#Print the output
print(mds_data)
##             x           y
## A -0.36165938  0.36271265
## B  0.27965223  0.27483980
## C -0.17157590 -0.09456567
## D -0.12215882 -0.41904394
## E  0.03094720 -0.37195089
## F -0.19213662  0.24618843
## G  0.53023233  0.05840622
## H -0.01431325 -0.23268465
## I  0.36591977  0.10150796
## J -0.34490755  0.07459010
#Let's plot the output
ggplot(as.data.frame(mds_data),aes(x=x,y=y)) + geom_text(label=rownames(mds_data)) + labs(title="MDS Output")