1. Understanding Data Set 1

The aim of this study is to find connections between (o)esophageal cancer and 3 variables such as: + Alcohol consumption + Tobacco consumption + Age

I will be using “esoph” sample data set existing locally in R package. Data set consists of 88 oberservations for 6 different age groups, 4 different alcohol consumption groups and 4 different tobacco consumption groups. Each combination of groups has its own number of cancer cases and controls. For this data set has no missing values, it didn’t require cleanup in my analysis.

1.1 Load Required Libraries And Get Structure Of Data

#Load required libraries
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
library(ggplot2)
library(ggcorrplot)

#Create custom colors vector
custom_colors=c("#E32800", "#FDB205","#FDF505","#009BDF","#E3FD05","#A7FD05","#7CBE00","#639700","#972000","#871D00","#50FF95","#00DEAF","#00B891","#00B5B8","#0080B8","#0063E8","#0047A7","#9F55FF","#C69BFF","#D69BFF","#B956FE","#DF56FE","#FE5681","#9BDF00")

#Look at the data structure and sample of first 10 observations 
head(esoph,10)
##    agegp     alcgp    tobgp ncases ncontrols
## 1  25-34 0-39g/day 0-9g/day      0        40
## 2  25-34 0-39g/day    10-19      0        10
## 3  25-34 0-39g/day    20-29      0         6
## 4  25-34 0-39g/day      30+      0         5
## 5  25-34     40-79 0-9g/day      0        27
## 6  25-34     40-79    10-19      0         7
## 7  25-34     40-79    20-29      0         4
## 8  25-34     40-79      30+      0         7
## 9  25-34    80-119 0-9g/day      0         2
## 10 25-34    80-119    10-19      0         1
# Control data ranges for every column

summary(esoph)
##    agegp          alcgp         tobgp        ncases         ncontrols    
##  25-34:15   0-39g/day:23   0-9g/day:24   Min.   : 0.000   Min.   : 1.00  
##  35-44:15   40-79    :23   10-19   :24   1st Qu.: 0.000   1st Qu.: 3.00  
##  45-54:16   80-119   :21   20-29   :20   Median : 1.000   Median : 6.00  
##  55-64:16   120+     :21   30+     :20   Mean   : 2.273   Mean   :11.08  
##  65-74:15                                3rd Qu.: 4.000   3rd Qu.:14.00  
##  75+  :11                                Max.   :17.000   Max.   :60.00

2. Analyzing Relations In Esoph Data Set

Initially, I want to see the first ten combination groups having the highest risks by ordering all observations according to case numbers and control numbers.

##    agegp     alcgp    tobgp ncases ncontrols
## 1  65-74     40-79 0-9g/day     17        34
## 2  55-64     40-79 0-9g/day      9        40
## 3  55-64    80-119 0-9g/day      9        18
## 4  55-64    80-119    10-19      8        15
## 5  45-54     40-79 0-9g/day      6        38
## 6  55-64     40-79    10-19      6        21
## 7  45-54    80-119    10-19      6        14
## 8  65-74    80-119 0-9g/day      6        13
## 9  55-64      120+    10-19      6         7
## 10 65-74 0-39g/day 0-9g/day      5        48

It seems that three age groups having the highest number of cancer cases are:

When we look at top three alcohol consumption amounts in the highest risk groups, we get:

Let’s view general distributions of percentage of cancer cases normalized by number of controls.

#Grouping by age group, calculate percentage by sum of cases divided by sum of control counts 
d1<- esoph %>%  group_by(agegp) %>%
    summarise(count = n(), total_cases = sum(ncases), total_controls = sum(ncontrols),
              percentage=total_cases*100/total_controls)

ggplot(d1, aes(x=d1$agegp, y=d1$percentage,fill=d1$agegp)) +
   geom_bar(stat="identity", position = "dodge") +
   scale_fill_brewer(palette ="Set1")+
   labs(x= 'Age Groups', y= 'Percentage Of Cancer Cases')+
   guides(fill=guide_legend(title="Age Groups"))

We see that cancer cases increase by age. This shows the harmful affects of both tobacco and alcohol consumption add up with time. Now let’s look at the effects of each habit. First, tobacco consumptions by age groups:

#Group by both age and tobacco consumption groups
d2<- esoph %>%  group_by(agegp,tobgp) %>%
    summarise(count = n(), total_cases = sum(ncases), total_controls = sum(ncontrols),
              percentage=total_cases*100/total_controls)


ggplot(d2, aes(x=d2$agegp, y=d2$percentage,fill=d2$tobgp)) +
   geom_bar(stat="identity", position = "dodge") +
   scale_fill_brewer(palette ="Set1")+
   labs(x= 'Age Groups', y= 'Percentage Of Cancer Cases')+
   guides(fill=guide_legend(title="Tobacco Consumption Groups Per Age Group"))

In the graph above, it’s seen that the cancer making effect of tobacco consumption increases by both age and amount of tobacco used per day. People who consume tobacco 120+ g/day involve in cancer cases dramatically higher than other consumption groups for especially 45-64 age range.

Now let’s look at alcohol consumption by age groups:

 d3<- esoph %>%  group_by(agegp,alcgp) %>%
     summarise(count = n(), total_cases = sum(ncases), total_controls = sum(ncontrols),
               percentage=total_cases*100/total_controls)


ggplot(d3, aes(x=d3$agegp, y=d3$percentage,fill=d3$alcgp)) +
   geom_bar(stat="identity", position = "dodge") +
   scale_fill_brewer(palette ="Set1")+
   labs(x= 'Age Groups', y= 'Percentage Of Cancer Cases')+
   guides(fill=guide_legend(title="Alcohol Consumption Groups Per Age Group"))

From the alcohol consumption view, we see that alcohol consumption has more negative effect than tobacco consumption for age groups 55-64 and over having alcohol consumption 40-79 g/day and over. Finally, I wanted to look at the added effects of both alcohol and tobacco consumption for the whole observations.

ggplot(esoph, aes(factor(esoph$alcgp), esoph$ncases*100/esoph$ncontrols, fill = esoph$tobgp)) +
  geom_bar(stat="identity", position = "dodge") +
  scale_fill_brewer(palette ="Set1")+
  labs(x= 'Alcohol Consumption Groups', y= 'Percentage Of Cancer Cases')+
  guides(fill=guide_legend(title="Tobacco Consumption Groups"))

The graph above shows that the amount of alcohol consumption has a lower effect on cancer cases than tobacco consumption. But as alcohol consumption per day increases, its effect also catches up with tobacco consumption. If alcohol consumption is over 80 g/day then not consuming any tobacco almost has no effect to prevent cancer.

3. Getting Data and Understanding Data Set 2 - Young People Survey

As a second data set, I will analyse a subset of Young People Survey data set acquired from Kaggle. This data set consists of survey answers from 1010 people (thus containing 1010 rows) and 150 variables. 139 columns have numerical values (between 1 to 5 for “Strongly Disagree” to “Strongly Agree”) while 11 columns are categorical variables.

I will try to apply statistical models to this data set to acquire meaningful information about relations or related subgroups in the selected subset. This subset includes data from the columns named below:

#Read data from csv file
s=read.csv("responses.csv")

#Convert to data frame
s=as.data.frame(s)

#Take only the columns to be analysed : "History" to "Pets" 
s=s[32:63]


#display first 5 values
head(s)
##   History Psychology Politics Mathematics Physics Internet PC
## 1       1          5        1           3       3        5  3
## 2       1          3        4           5       2        4  4
## 3       1          2        1           5       2        4  2
## 4       4          4        5           4       1        3  1
## 5       3          2        3           2       2        2  2
## 6       5          3        4           2       3        4  4
##   Economy.Management Biology Chemistry Reading Geography Foreign.languages
## 1                  5       3         3       3         3                 5
## 2                  5       1         1       4         4                 5
## 3                  4       1         1       5         2                 5
## 4                  2       3         3       5         4                 4
## 5                  2       3         3       5         2                 3
## 6                  1       4         4       3         3                 4
##   Medicine Law Cars Art.exhibitions Religion Countryside..outdoors Dancing
## 1        3   1    1               1        1                     5       3
## 2        1   2    2               2        1                     1       1
## 3        2   3    1               5        5                     5       5
## 4        2   5    1               5        4                     1       1
## 5        3   2    3               1        4                     4       1
## 6        4   3    5               2        2                     5       1
##   Musical.instruments Writing Passive.sport Active.sport Gardening
## 1                   3       2             1            5         5
## 2                   1       1             1            1         1
## 3                   5       5             5            2         1
## 4                   1       3             1            1         1
## 5                   3       1             3            1         4
## 6                   5       1             5            4         2
##   Celebrities Shopping Science.and.technology Theatre Fun.with.friends
## 1           1        4                      4       2                5
## 2           2        3                      3       2                4
## 3           1        4                      2       5                5
## 4           2        4                      3       1                2
## 5           3        3                      3       2                4
## 6           1        2                      3       1                3
##   Adrenaline.sports Pets
## 1                 4    4
## 2                 2    5
## 3                 5    5
## 4                 1    1
## 5                 2    1
## 6                 3    2
#display statistical descriptions
summary(s)
##     History        Psychology       Politics      Mathematics   
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :3.000   Median :3.000   Median :2.000   Median :2.000  
##  Mean   :3.207   Mean   :3.138   Mean   :2.596   Mean   :2.335  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:3.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##  NA's   :2       NA's   :5       NA's   :1       NA's   :3      
##     Physics         Internet           PC        Economy.Management
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000     
##  1st Qu.:1.000   1st Qu.:4.000   1st Qu.:2.000   1st Qu.:1.000     
##  Median :2.000   Median :4.000   Median :3.000   Median :2.000     
##  Mean   :2.065   Mean   :4.176   Mean   :3.136   Mean   :2.644     
##  3rd Qu.:3.000   3rd Qu.:5.000   3rd Qu.:4.000   3rd Qu.:4.000     
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000     
##  NA's   :3       NA's   :4       NA's   :6       NA's   :5         
##     Biology        Chemistry        Reading        Geography    
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:1.000   1st Qu.:2.000   1st Qu.:2.000  
##  Median :2.000   Median :2.000   Median :3.000   Median :3.000  
##  Mean   :2.665   Mean   :2.165   Mean   :3.159   Mean   :3.083  
##  3rd Qu.:4.000   3rd Qu.:3.000   3rd Qu.:5.000   3rd Qu.:4.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##  NA's   :6       NA's   :10      NA's   :6       NA's   :9      
##  Foreign.languages    Medicine          Law             Cars      
##  Min.   :1.000     Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.000     1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :4.000     Median :2.000   Median :2.000   Median :3.000  
##  Mean   :3.778     Mean   :2.516   Mean   :2.257   Mean   :2.687  
##  3rd Qu.:5.000     3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:4.000  
##  Max.   :5.000     Max.   :5.000   Max.   :5.000   Max.   :5.000  
##  NA's   :5         NA's   :5       NA's   :1       NA's   :4      
##  Art.exhibitions    Religion     Countryside..outdoors    Dancing     
##  Min.   :1.00    Min.   :1.000   Min.   :1.000         Min.   :1.000  
##  1st Qu.:1.00    1st Qu.:1.000   1st Qu.:3.000         1st Qu.:1.000  
##  Median :2.00    Median :2.000   Median :4.000         Median :2.000  
##  Mean   :2.59    Mean   :2.273   Mean   :3.687         Mean   :2.462  
##  3rd Qu.:4.00    3rd Qu.:3.000   3rd Qu.:5.000         3rd Qu.:4.000  
##  Max.   :5.00    Max.   :5.000   Max.   :5.000         Max.   :5.000  
##  NA's   :6       NA's   :3       NA's   :7             NA's   :3      
##  Musical.instruments    Writing      Passive.sport    Active.sport  
##  Min.   :1.000       Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000       1st Qu.:1.000   1st Qu.:2.000   1st Qu.:2.000  
##  Median :2.000       Median :1.000   Median :3.000   Median :3.000  
##  Mean   :2.324       Mean   :1.901   Mean   :3.388   Mean   :3.291  
##  3rd Qu.:4.000       3rd Qu.:3.000   3rd Qu.:5.000   3rd Qu.:5.000  
##  Max.   :5.000       Max.   :5.000   Max.   :5.000   Max.   :5.000  
##  NA's   :1           NA's   :6       NA's   :15      NA's   :4      
##    Gardening      Celebrities       Shopping     Science.and.technology
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000         
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:2.000   1st Qu.:2.000         
##  Median :1.000   Median :2.000   Median :3.000   Median :3.000         
##  Mean   :1.907   Mean   :2.362   Mean   :3.277   Mean   :3.234         
##  3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:4.000   3rd Qu.:4.000         
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000         
##  NA's   :7       NA's   :2       NA's   :2       NA's   :6             
##     Theatre      Fun.with.friends Adrenaline.sports      Pets      
##  Min.   :1.000   Min.   :2.000    Min.   :1.000     Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:4.000    1st Qu.:2.000     1st Qu.:2.000  
##  Median :3.000   Median :5.000    Median :3.000     Median :4.000  
##  Mean   :3.025   Mean   :4.558    Mean   :2.948     Mean   :3.335  
##  3rd Qu.:4.000   3rd Qu.:5.000    3rd Qu.:4.000     3rd Qu.:5.000  
##  Max.   :5.000   Max.   :5.000    Max.   :5.000     Max.   :5.000  
##  NA's   :8       NA's   :4        NA's   :3         NA's   :4

4. Remove Missing Data

#Get NAs summary in percentage
colMeans(is.na(s))*100
##                History             Psychology               Politics 
##              0.1980198              0.4950495              0.0990099 
##            Mathematics                Physics               Internet 
##              0.2970297              0.2970297              0.3960396 
##                     PC     Economy.Management                Biology 
##              0.5940594              0.4950495              0.5940594 
##              Chemistry                Reading              Geography 
##              0.9900990              0.5940594              0.8910891 
##      Foreign.languages               Medicine                    Law 
##              0.4950495              0.4950495              0.0990099 
##                   Cars        Art.exhibitions               Religion 
##              0.3960396              0.5940594              0.2970297 
##  Countryside..outdoors                Dancing    Musical.instruments 
##              0.6930693              0.2970297              0.0990099 
##                Writing          Passive.sport           Active.sport 
##              0.5940594              1.4851485              0.3960396 
##              Gardening            Celebrities               Shopping 
##              0.6930693              0.1980198              0.1980198 
## Science.and.technology                Theatre       Fun.with.friends 
##              0.5940594              0.7920792              0.3960396 
##      Adrenaline.sports                   Pets 
##              0.2970297              0.3960396
#remove NAs
s=s[complete.cases(s),]

#Let's display correlation matrix for first 5 columns
cor(s[,1:5])
##                 History Psychology  Politics Mathematics    Physics
## History     1.000000000 0.29637359 0.4062343 0.008786356 0.07088193
## Psychology  0.296373585 1.00000000 0.1910873 0.043974943 0.07142061
## Politics    0.406234287 0.19108731 1.0000000 0.103492328 0.13388052
## Mathematics 0.008786356 0.04397494 0.1034923 1.000000000 0.60785444
## Physics     0.070881929 0.07142061 0.1338805 0.607854436 1.00000000

5. Principal Component Analysis

# Principal Component Analysis to get information about which components have higher effects on variance

pca <- princomp(as.matrix(s[,1:32]),cor=T)
summary(pca,loadings=TRUE)
## Importance of components:
##                           Comp.1    Comp.2     Comp.3     Comp.4
## Standard deviation     2.0433958 1.8069970 1.60438109 1.46391156
## Proportion of Variance 0.1304833 0.1020387 0.08043871 0.06696991
## Cumulative Proportion  0.1304833 0.2325220 0.31296072 0.37993063
##                            Comp.5     Comp.6     Comp.7     Comp.8
## Standard deviation     1.26463164 1.18461175 1.06955650 1.05602251
## Proportion of Variance 0.04997791 0.04385328 0.03574847 0.03484949
## Cumulative Proportion  0.42990854 0.47376182 0.50951029 0.54435978
##                            Comp.9    Comp.10    Comp.11    Comp.12
## Standard deviation     1.04363010 1.00633062 0.95951254 0.93789360
## Proportion of Variance 0.03403637 0.03164692 0.02877076 0.02748889
## Cumulative Proportion  0.57839615 0.61004306 0.63881382 0.66630271
##                           Comp.13    Comp.14    Comp.15    Comp.16
## Standard deviation     0.93168719 0.89503805 0.87262765 0.85510068
## Proportion of Variance 0.02712628 0.02503416 0.02379622 0.02284991
## Cumulative Proportion  0.69342899 0.71846315 0.74225937 0.76510928
##                          Comp.17    Comp.18    Comp.19    Comp.20
## Standard deviation     0.8329419 0.80296355 0.78632469 0.76747805
## Proportion of Variance 0.0216810 0.02014845 0.01932208 0.01840695
## Cumulative Proportion  0.7867903 0.80693874 0.82626082 0.84466777
##                           Comp.21    Comp.22    Comp.23    Comp.24
## Standard deviation     0.74821762 0.72513227 0.71054805 0.70905599
## Proportion of Variance 0.01749468 0.01643178 0.01577745 0.01571126
## Cumulative Proportion  0.86216245 0.87859422 0.89437168 0.91008294
##                           Comp.25    Comp.26    Comp.27    Comp.28
## Standard deviation     0.69087699 0.65750430 0.64296368 0.60888823
## Proportion of Variance 0.01491597 0.01350975 0.01291882 0.01158578
## Cumulative Proportion  0.92499891 0.93850866 0.95142748 0.96301326
##                          Comp.29     Comp.30     Comp.31     Comp.32
## Standard deviation     0.5954011 0.551814415 0.539017589 0.483770803
## Proportion of Variance 0.0110782 0.009515598 0.009079374 0.007313568
## Cumulative Proportion  0.9740915 0.983607058 0.992686432 1.000000000
## 
## Loadings:
##                        Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## History                -0.196        -0.184  0.243 -0.130  0.243 -0.142
## Psychology             -0.244                0.119 -0.112         0.102
## Politics               -0.131 -0.202 -0.255  0.204 -0.270  0.153       
## Mathematics                   -0.295  0.126  0.133        -0.309  0.220
## Physics                       -0.344  0.241  0.136        -0.155  0.151
## Internet                      -0.237 -0.127               -0.400       
## PC                            -0.377                0.134 -0.291 -0.137
## Economy.Management            -0.174 -0.306        -0.214 -0.182  0.166
## Biology                -0.269         0.377 -0.151 -0.221              
## Chemistry              -0.199         0.421        -0.230              
## Reading                -0.282  0.188         0.176                     
## Geography              -0.160 -0.145 -0.151                0.218 -0.226
## Foreign.languages      -0.208        -0.217               -0.117  0.165
## Medicine               -0.269         0.317 -0.118 -0.266              
## Law                    -0.143 -0.147 -0.277        -0.363  0.116       
## Cars                          -0.350        -0.168                     
## Art.exhibitions        -0.313                       0.201              
## Religion               -0.230                0.166               -0.128
## Countryside..outdoors  -0.194               -0.102  0.334        -0.167
## Dancing                -0.241               -0.230                0.107
## Musical.instruments    -0.210                       0.361              
## Writing                -0.231                0.161  0.194 -0.114 -0.214
## Passive.sport                 -0.162        -0.222  0.135  0.116       
## Active.sport                  -0.200        -0.241  0.138  0.340       
## Gardening              -0.192               -0.178               -0.445
## Celebrities                          -0.183 -0.351 -0.181 -0.289 -0.234
## Shopping                       0.136 -0.178 -0.399 -0.153 -0.241       
## Science.and.technology        -0.348                                   
## Theatre                -0.302  0.128                0.144         0.256
## Fun.with.friends                     -0.135 -0.260  0.154         0.498
## Adrenaline.sports             -0.254        -0.221  0.200  0.313  0.179
## Pets                                        -0.256               -0.138
##                        Comp.8 Comp.9 Comp.10 Comp.11 Comp.12 Comp.13
## History                -0.194  0.162                                
## Psychology                     0.214 -0.181  -0.232  -0.272   0.536 
## Politics                0.108                -0.139          -0.187 
## Mathematics             0.273         0.360           0.129   0.147 
## Physics                 0.151  0.127  0.160                         
## Internet               -0.370        -0.107  -0.147           0.264 
## PC                     -0.144        -0.183                         
## Economy.Management      0.288 -0.163  0.189                         
## Biology                                                             
## Chemistry                                             0.142         
## Reading                -0.173         0.116   0.130   0.150         
## Geography              -0.243 -0.335  0.105   0.349   0.228   0.113 
## Foreign.languages      -0.227 -0.364 -0.122   0.187           0.114 
## Medicine                      -0.135 -0.178                         
## Law                     0.142                -0.135          -0.179 
## Cars                           0.101                 -0.125  -0.331 
## Art.exhibitions                0.166                         -0.156 
## Religion                      -0.220         -0.370  -0.456  -0.146 
## Countryside..outdoors  -0.119 -0.240  0.371          -0.412         
## Dancing                 0.326 -0.204          0.108  -0.114   0.102 
## Musical.instruments     0.158 -0.106 -0.226  -0.270   0.141  -0.110 
## Writing                 0.171  0.212 -0.319           0.309   0.149 
## Passive.sport                 -0.150  0.259  -0.560   0.466   0.125 
## Active.sport            0.286        -0.204   0.214           0.153 
## Gardening               0.200  0.108  0.220                  -0.109 
## Celebrities                                                         
## Shopping                                                     -0.108 
## Science.and.technology -0.195  0.194 -0.169   0.186  -0.121  -0.194 
## Theatre                -0.115  0.190  0.145                  -0.258 
## Fun.with.friends       -0.208                -0.173          -0.147 
## Adrenaline.sports              0.103 -0.120                   0.141 
## Pets                   -0.132  0.467  0.336          -0.108   0.306 
##                        Comp.14 Comp.15 Comp.16 Comp.17 Comp.18 Comp.19
## History                -0.252  -0.233           0.141          -0.130 
## Psychology             -0.153   0.266   0.125           0.116         
## Politics                       -0.146                   0.149  -0.140 
## Mathematics                            -0.183  -0.171                 
## Physics                -0.130          -0.150  -0.125   0.126         
## Internet               -0.174  -0.157           0.195           0.216 
## PC                             -0.100           0.226           0.103 
## Economy.Management      0.190   0.148   0.333                         
## Biology                                                               
## Chemistry                      -0.101                                 
## Reading                        -0.177                  -0.411         
## Geography              -0.155                  -0.236   0.302         
## Foreign.languages       0.408                          -0.329  -0.154 
## Medicine                0.101                                   0.165 
## Law                     0.158                   0.187           0.301 
## Cars                    0.100   0.174  -0.164                   0.145 
## Art.exhibitions        -0.114   0.343  -0.122                   0.313 
## Religion               -0.147  -0.103  -0.125  -0.435  -0.322         
## Countryside..outdoors                                   0.192         
## Dancing                -0.145  -0.200           0.345   0.202  -0.211 
## Musical.instruments     0.428  -0.197  -0.200           0.303         
## Writing                                 0.169  -0.250                 
## Passive.sport                   0.296                  -0.128  -0.274 
## Active.sport           -0.175          -0.220   0.172  -0.432  -0.108 
## Gardening                      -0.146   0.564          -0.193         
## Celebrities            -0.185          -0.248  -0.252   0.131         
## Shopping               -0.105          -0.101  -0.178          -0.282 
## Science.and.technology          0.182   0.189                  -0.554 
## Theatre                         0.252           0.185                 
## Fun.with.friends       -0.130  -0.471   0.304  -0.160                 
## Adrenaline.sports                       0.111  -0.365           0.278 
## Pets                    0.465  -0.209  -0.240                         
##                        Comp.20 Comp.21 Comp.22 Comp.23 Comp.24 Comp.25
## History                 0.369                   0.129  -0.487         
## Psychology              0.242   0.163                   0.228  -0.162 
## Politics                                0.154   0.119   0.337   0.355 
## Mathematics             0.169                                         
## Physics                                -0.162  -0.172   0.129         
## Internet               -0.225  -0.113                           0.219 
## PC                                      0.249                         
## Economy.Management     -0.159           0.224   0.180  -0.443  -0.235 
## Biology                                 0.141          -0.102         
## Chemistry                      -0.102                  -0.221   0.135 
## Reading                        -0.271  -0.131   0.303          -0.144 
## Geography              -0.223   0.285                          -0.175 
## Foreign.languages       0.303   0.202  -0.111           0.174   0.149 
## Medicine                                                       -0.110 
## Law                            -0.262  -0.183  -0.438   0.128         
## Cars                            0.380  -0.464   0.262          -0.220 
## Art.exhibitions                 0.286   0.199  -0.126           0.242 
## Religion               -0.259                                         
## Countryside..outdoors   0.189  -0.397  -0.125  -0.135          -0.142 
## Dancing                -0.311   0.108  -0.310   0.274   0.102   0.142 
## Musical.instruments     0.213           0.230          -0.103         
## Writing                -0.224  -0.178  -0.358  -0.110  -0.165  -0.197 
## Passive.sport                                                         
## Active.sport                            0.301  -0.258          -0.139 
## Gardening               0.189   0.236                   0.112   0.157 
## Celebrities             0.254  -0.187   0.201   0.318   0.237  -0.228 
## Shopping                0.122          -0.118  -0.342  -0.322   0.255 
## Science.and.technology -0.102  -0.148                   0.107         
## Theatre                -0.147           0.153                         
## Fun.with.friends                0.126          -0.126          -0.305 
## Adrenaline.sports              -0.278           0.290           0.419 
## Pets                   -0.263   0.152                                 
##                        Comp.26 Comp.27 Comp.28 Comp.29 Comp.30 Comp.31
## History                        -0.247   0.164          -0.179         
## Psychology             -0.152   0.144  -0.131           0.195         
## Politics                0.335   0.385           0.120           0.113 
## Mathematics                                    -0.195           0.575 
## Physics                                         0.263  -0.123  -0.650 
## Internet                        0.206   0.359  -0.223                 
## PC                             -0.189  -0.442   0.499           0.162 
## Economy.Management              0.101                          -0.239 
## Biology                                                               
## Chemistry               0.165                           0.623         
## Reading                -0.245   0.303  -0.348  -0.174          -0.155 
## Geography              -0.272                                         
## Foreign.languages       0.178  -0.190   0.137   0.118                 
## Medicine                                0.155          -0.632         
## Law                    -0.253  -0.319                                 
## Cars                            0.261                                 
## Art.exhibitions         0.332  -0.122  -0.262  -0.311          -0.102 
## Religion                       -0.142                   0.111         
## Countryside..outdoors   0.280   0.132                                 
## Dancing                        -0.260                                 
## Musical.instruments    -0.257   0.171          -0.179                 
## Writing                 0.289                   0.140           0.150 
## Passive.sport                  -0.112  -0.129                         
## Active.sport            0.125   0.204   0.126                         
## Gardening              -0.193           0.135                         
## Celebrities             0.169  -0.194                                 
## Shopping               -0.186   0.295  -0.215   0.163  -0.134         
## Science.and.technology         -0.175          -0.435                 
## Theatre                -0.202           0.486   0.372   0.175   0.161 
## Fun.with.friends        0.136                                         
## Adrenaline.sports      -0.245                                         
## Pets                                                                  
##                        Comp.32
## History                       
## Psychology                    
## Politics                      
## Mathematics                   
## Physics                       
## Internet                      
## PC                            
## Economy.Management            
## Biology                 0.792 
## Chemistry              -0.390 
## Reading                       
## Geography                     
## Foreign.languages             
## Medicine               -0.397 
## Law                           
## Cars                          
## Art.exhibitions               
## Religion                      
## Countryside..outdoors         
## Dancing                       
## Musical.instruments           
## Writing                       
## Passive.sport                 
## Active.sport                  
## Gardening                     
## Celebrities                   
## Shopping                      
## Science.and.technology        
## Theatre                       
## Fun.with.friends              
## Adrenaline.sports             
## Pets

PCA shows that the first 2 principal component is responsible for 23% of the variation, while first 7 component is responsible for more than 50% of the variation. For components greater than seventh component have less impact on variance, these are excluded from dataset and a new PCA will be applied for the focused group.

pca <- princomp(as.matrix(s[,1:7]),cor=T)
summary(pca,loadings=TRUE)
## Importance of components:
##                           Comp.1    Comp.2    Comp.3    Comp.4     Comp.5
## Standard deviation     1.4418693 1.2491623 1.0493025 0.9060352 0.75668768
## Proportion of Variance 0.2969981 0.2229152 0.1572908 0.1172714 0.08179661
## Cumulative Proportion  0.2969981 0.5199133 0.6772041 0.7944755 0.87627215
##                            Comp.6     Comp.7
## Standard deviation     0.70619949 0.60611651
## Proportion of Variance 0.07124539 0.05248246
## Cumulative Proportion  0.94751754 1.00000000
## 
## Loadings:
##             Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## History      0.184 -0.603  0.158 -0.169  0.711 -0.200       
## Psychology   0.121 -0.501         0.801 -0.196  0.217       
## Politics     0.263 -0.503  0.135 -0.508 -0.631              
## Mathematics  0.509  0.137 -0.443               -0.407  0.595
## Physics      0.527        -0.444         0.116  0.125 -0.699
## Internet     0.344  0.213  0.647  0.259 -0.120 -0.526 -0.247
## PC           0.477  0.247  0.376         0.157  0.673  0.298

In the second PCA, first 2 PC is responsible for more than half of the variation. First PC is highly affected by Mathematics, Physics and PC, while third PC is highly affected by Internet and PC and fourth by Psychology. With the sixth PC, 94.7% of the variation is explained.

Let’s visualize the effect of components from PCA 2:

ggplot(data.frame(pc=1:7,cum_var=c(0.2969981,0.5199133,0.6772041,0.7944755,0.87627215,0.94751754,1.00000000)),aes(x=pc,y=cum_var)) + 
  geom_point() + 
  geom_line()

The plot above implies that with the first 3 component, nearly 70% of the variation is explained. If the first 6 components are selected, nearly 95% of the variation is explained.

6. Multidimensional Scaling

#create table for multidimensional scaling
s_mds_data <- s[,sapply(s,class)=="integer"] %>%
  select(History:Pets) %>%
  tbl_df()

#Take negative of correlation values and add 1 to make the distance values start from 0
s_mds_distance <- 1 - cor(s_mds_data)
#Apply Multidimensional Scaling
s_mds <- cmdscale(s_mds_distance,k=2)
#Provide column names
colnames(s_mds) <- c("x","y")

#print coordinates for categories
print(s_mds)
##                                  x            y
## History                 0.06372563  0.012956654
## Psychology              0.23826552 -0.024422880
## Politics               -0.14749496  0.107881476
## Mathematics            -0.32345626 -0.297702691
## Physics                -0.32464296 -0.499240678
## Internet               -0.43673303  0.187118496
## PC                     -0.58516404 -0.107666761
## Economy.Management     -0.30422083  0.383264449
## Biology                 0.25483318 -0.395803325
## Chemistry               0.12745384 -0.486209735
## Reading                 0.54412403 -0.002498553
## Geography              -0.04741594  0.042633809
## Foreign.languages       0.20488850  0.223464512
## Medicine                0.21225026 -0.352629731
## Law                    -0.07751035  0.252871501
## Cars                   -0.58387376  0.046912232
## Art.exhibitions         0.37143545  0.009747048
## Religion                0.19597972 -0.174122589
## Countryside..outdoors   0.14091867 -0.071530114
## Dancing                 0.27743365  0.136486519
## Musical.instruments     0.16521053 -0.134880364
## Writing                 0.30297822 -0.032446839
## Passive.sport          -0.30096850  0.108795592
## Active.sport           -0.24240576  0.047405197
## Gardening               0.19439442 -0.077481786
## Celebrities             0.06494596  0.467218540
## Shopping                0.20807212  0.463704403
## Science.and.technology -0.36445481 -0.278443856
## Theatre                 0.44547786  0.054248884
## Fun.with.friends       -0.05316373  0.253721685
## Adrenaline.sports      -0.30291870  0.041430796
## Pets                    0.08203608  0.095218109
#Display relational distance values of categories acquired from Multidimensional Scaling
ggplot(data.frame(s_mds),aes(x=x,y=y)) +
  geom_text(label=rownames(s_mds),size=3) +
  labs(x="x",y="y", title="MDS - Categories By Relational Distance") 

In the graph above, categories that seems to have close relationship with each other are:

When we think about real life relationship, the clusters MDS technique found out is not absurd at all. Furthermore, it really helps to decide about which groups to start include for possible pairings.

7. References