The aim of this study is to find connections between (o)esophageal cancer and 3 variables such as: + Alcohol consumption + Tobacco consumption + Age
I will be using “esoph” sample data set existing locally in R package. Data set consists of 88 oberservations for 6 different age groups, 4 different alcohol consumption groups and 4 different tobacco consumption groups. Each combination of groups has its own number of cancer cases and controls. For this data set has no missing values, it didn’t require cleanup in my analysis.
#Load required libraries
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
library(ggplot2)
library(ggcorrplot)
#Create custom colors vector
custom_colors=c("#E32800", "#FDB205","#FDF505","#009BDF","#E3FD05","#A7FD05","#7CBE00","#639700","#972000","#871D00","#50FF95","#00DEAF","#00B891","#00B5B8","#0080B8","#0063E8","#0047A7","#9F55FF","#C69BFF","#D69BFF","#B956FE","#DF56FE","#FE5681","#9BDF00")
#Look at the data structure and sample of first 10 observations
head(esoph,10)
## agegp alcgp tobgp ncases ncontrols
## 1 25-34 0-39g/day 0-9g/day 0 40
## 2 25-34 0-39g/day 10-19 0 10
## 3 25-34 0-39g/day 20-29 0 6
## 4 25-34 0-39g/day 30+ 0 5
## 5 25-34 40-79 0-9g/day 0 27
## 6 25-34 40-79 10-19 0 7
## 7 25-34 40-79 20-29 0 4
## 8 25-34 40-79 30+ 0 7
## 9 25-34 80-119 0-9g/day 0 2
## 10 25-34 80-119 10-19 0 1
# Control data ranges for every column
summary(esoph)
## agegp alcgp tobgp ncases ncontrols
## 25-34:15 0-39g/day:23 0-9g/day:24 Min. : 0.000 Min. : 1.00
## 35-44:15 40-79 :23 10-19 :24 1st Qu.: 0.000 1st Qu.: 3.00
## 45-54:16 80-119 :21 20-29 :20 Median : 1.000 Median : 6.00
## 55-64:16 120+ :21 30+ :20 Mean : 2.273 Mean :11.08
## 65-74:15 3rd Qu.: 4.000 3rd Qu.:14.00
## 75+ :11 Max. :17.000 Max. :60.00
Initially, I want to see the first ten combination groups having the highest risks by ordering all observations according to case numbers and control numbers.
## agegp alcgp tobgp ncases ncontrols
## 1 65-74 40-79 0-9g/day 17 34
## 2 55-64 40-79 0-9g/day 9 40
## 3 55-64 80-119 0-9g/day 9 18
## 4 55-64 80-119 10-19 8 15
## 5 45-54 40-79 0-9g/day 6 38
## 6 55-64 40-79 10-19 6 21
## 7 45-54 80-119 10-19 6 14
## 8 65-74 80-119 0-9g/day 6 13
## 9 55-64 120+ 10-19 6 7
## 10 65-74 0-39g/day 0-9g/day 5 48
It seems that three age groups having the highest number of cancer cases are:
When we look at top three alcohol consumption amounts in the highest risk groups, we get:
Let’s view general distributions of percentage of cancer cases normalized by number of controls.
#Grouping by age group, calculate percentage by sum of cases divided by sum of control counts
d1<- esoph %>% group_by(agegp) %>%
summarise(count = n(), total_cases = sum(ncases), total_controls = sum(ncontrols),
percentage=total_cases*100/total_controls)
ggplot(d1, aes(x=d1$agegp, y=d1$percentage,fill=d1$agegp)) +
geom_bar(stat="identity", position = "dodge") +
scale_fill_brewer(palette ="Set1")+
labs(x= 'Age Groups', y= 'Percentage Of Cancer Cases')+
guides(fill=guide_legend(title="Age Groups"))
We see that cancer cases increase by age. This shows the harmful affects of both tobacco and alcohol consumption add up with time. Now let’s look at the effects of each habit. First, tobacco consumptions by age groups:
#Group by both age and tobacco consumption groups
d2<- esoph %>% group_by(agegp,tobgp) %>%
summarise(count = n(), total_cases = sum(ncases), total_controls = sum(ncontrols),
percentage=total_cases*100/total_controls)
ggplot(d2, aes(x=d2$agegp, y=d2$percentage,fill=d2$tobgp)) +
geom_bar(stat="identity", position = "dodge") +
scale_fill_brewer(palette ="Set1")+
labs(x= 'Age Groups', y= 'Percentage Of Cancer Cases')+
guides(fill=guide_legend(title="Tobacco Consumption Groups Per Age Group"))
In the graph above, it’s seen that the cancer making effect of tobacco consumption increases by both age and amount of tobacco used per day. People who consume tobacco 120+ g/day involve in cancer cases dramatically higher than other consumption groups for especially 45-64 age range.
Now let’s look at alcohol consumption by age groups:
d3<- esoph %>% group_by(agegp,alcgp) %>%
summarise(count = n(), total_cases = sum(ncases), total_controls = sum(ncontrols),
percentage=total_cases*100/total_controls)
ggplot(d3, aes(x=d3$agegp, y=d3$percentage,fill=d3$alcgp)) +
geom_bar(stat="identity", position = "dodge") +
scale_fill_brewer(palette ="Set1")+
labs(x= 'Age Groups', y= 'Percentage Of Cancer Cases')+
guides(fill=guide_legend(title="Alcohol Consumption Groups Per Age Group"))
From the alcohol consumption view, we see that alcohol consumption has more negative effect than tobacco consumption for age groups 55-64 and over having alcohol consumption 40-79 g/day and over. Finally, I wanted to look at the added effects of both alcohol and tobacco consumption for the whole observations.
ggplot(esoph, aes(factor(esoph$alcgp), esoph$ncases*100/esoph$ncontrols, fill = esoph$tobgp)) +
geom_bar(stat="identity", position = "dodge") +
scale_fill_brewer(palette ="Set1")+
labs(x= 'Alcohol Consumption Groups', y= 'Percentage Of Cancer Cases')+
guides(fill=guide_legend(title="Tobacco Consumption Groups"))
The graph above shows that the amount of alcohol consumption has a lower effect on cancer cases than tobacco consumption. But as alcohol consumption per day increases, its effect also catches up with tobacco consumption. If alcohol consumption is over 80 g/day then not consuming any tobacco almost has no effect to prevent cancer.
As a second data set, I will analyse a subset of Young People Survey data set acquired from Kaggle. This data set consists of survey answers from 1010 people (thus containing 1010 rows) and 150 variables. 139 columns have numerical values (between 1 to 5 for “Strongly Disagree” to “Strongly Agree”) while 11 columns are categorical variables.
I will try to apply statistical models to this data set to acquire meaningful information about relations or related subgroups in the selected subset. This subset includes data from the columns named below:
#Read data from csv file
s=read.csv("responses.csv")
#Convert to data frame
s=as.data.frame(s)
#Take only the columns to be analysed : "History" to "Pets"
s=s[32:63]
#display first 5 values
head(s)
## History Psychology Politics Mathematics Physics Internet PC
## 1 1 5 1 3 3 5 3
## 2 1 3 4 5 2 4 4
## 3 1 2 1 5 2 4 2
## 4 4 4 5 4 1 3 1
## 5 3 2 3 2 2 2 2
## 6 5 3 4 2 3 4 4
## Economy.Management Biology Chemistry Reading Geography Foreign.languages
## 1 5 3 3 3 3 5
## 2 5 1 1 4 4 5
## 3 4 1 1 5 2 5
## 4 2 3 3 5 4 4
## 5 2 3 3 5 2 3
## 6 1 4 4 3 3 4
## Medicine Law Cars Art.exhibitions Religion Countryside..outdoors Dancing
## 1 3 1 1 1 1 5 3
## 2 1 2 2 2 1 1 1
## 3 2 3 1 5 5 5 5
## 4 2 5 1 5 4 1 1
## 5 3 2 3 1 4 4 1
## 6 4 3 5 2 2 5 1
## Musical.instruments Writing Passive.sport Active.sport Gardening
## 1 3 2 1 5 5
## 2 1 1 1 1 1
## 3 5 5 5 2 1
## 4 1 3 1 1 1
## 5 3 1 3 1 4
## 6 5 1 5 4 2
## Celebrities Shopping Science.and.technology Theatre Fun.with.friends
## 1 1 4 4 2 5
## 2 2 3 3 2 4
## 3 1 4 2 5 5
## 4 2 4 3 1 2
## 5 3 3 3 2 4
## 6 1 2 3 1 3
## Adrenaline.sports Pets
## 1 4 4
## 2 2 5
## 3 5 5
## 4 1 1
## 5 2 1
## 6 3 2
#display statistical descriptions
summary(s)
## History Psychology Politics Mathematics
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000
## Median :3.000 Median :3.000 Median :2.000 Median :2.000
## Mean :3.207 Mean :3.138 Mean :2.596 Mean :2.335
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:3.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## NA's :2 NA's :5 NA's :1 NA's :3
## Physics Internet PC Economy.Management
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:4.000 1st Qu.:2.000 1st Qu.:1.000
## Median :2.000 Median :4.000 Median :3.000 Median :2.000
## Mean :2.065 Mean :4.176 Mean :3.136 Mean :2.644
## 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## NA's :3 NA's :4 NA's :6 NA's :5
## Biology Chemistry Reading Geography
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:2.000
## Median :2.000 Median :2.000 Median :3.000 Median :3.000
## Mean :2.665 Mean :2.165 Mean :3.159 Mean :3.083
## 3rd Qu.:4.000 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## NA's :6 NA's :10 NA's :6 NA's :9
## Foreign.languages Medicine Law Cars
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:3.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :4.000 Median :2.000 Median :2.000 Median :3.000
## Mean :3.778 Mean :2.516 Mean :2.257 Mean :2.687
## 3rd Qu.:5.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## NA's :5 NA's :5 NA's :1 NA's :4
## Art.exhibitions Religion Countryside..outdoors Dancing
## Min. :1.00 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.00 1st Qu.:1.000 1st Qu.:3.000 1st Qu.:1.000
## Median :2.00 Median :2.000 Median :4.000 Median :2.000
## Mean :2.59 Mean :2.273 Mean :3.687 Mean :2.462
## 3rd Qu.:4.00 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.:4.000
## Max. :5.00 Max. :5.000 Max. :5.000 Max. :5.000
## NA's :6 NA's :3 NA's :7 NA's :3
## Musical.instruments Writing Passive.sport Active.sport
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:2.000
## Median :2.000 Median :1.000 Median :3.000 Median :3.000
## Mean :2.324 Mean :1.901 Mean :3.388 Mean :3.291
## 3rd Qu.:4.000 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## NA's :1 NA's :6 NA's :15 NA's :4
## Gardening Celebrities Shopping Science.and.technology
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:2.000
## Median :1.000 Median :2.000 Median :3.000 Median :3.000
## Mean :1.907 Mean :2.362 Mean :3.277 Mean :3.234
## 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## NA's :7 NA's :2 NA's :2 NA's :6
## Theatre Fun.with.friends Adrenaline.sports Pets
## Min. :1.000 Min. :2.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:4.000 1st Qu.:2.000 1st Qu.:2.000
## Median :3.000 Median :5.000 Median :3.000 Median :4.000
## Mean :3.025 Mean :4.558 Mean :2.948 Mean :3.335
## 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## NA's :8 NA's :4 NA's :3 NA's :4
#Get NAs summary in percentage
colMeans(is.na(s))*100
## History Psychology Politics
## 0.1980198 0.4950495 0.0990099
## Mathematics Physics Internet
## 0.2970297 0.2970297 0.3960396
## PC Economy.Management Biology
## 0.5940594 0.4950495 0.5940594
## Chemistry Reading Geography
## 0.9900990 0.5940594 0.8910891
## Foreign.languages Medicine Law
## 0.4950495 0.4950495 0.0990099
## Cars Art.exhibitions Religion
## 0.3960396 0.5940594 0.2970297
## Countryside..outdoors Dancing Musical.instruments
## 0.6930693 0.2970297 0.0990099
## Writing Passive.sport Active.sport
## 0.5940594 1.4851485 0.3960396
## Gardening Celebrities Shopping
## 0.6930693 0.1980198 0.1980198
## Science.and.technology Theatre Fun.with.friends
## 0.5940594 0.7920792 0.3960396
## Adrenaline.sports Pets
## 0.2970297 0.3960396
#remove NAs
s=s[complete.cases(s),]
#Let's display correlation matrix for first 5 columns
cor(s[,1:5])
## History Psychology Politics Mathematics Physics
## History 1.000000000 0.29637359 0.4062343 0.008786356 0.07088193
## Psychology 0.296373585 1.00000000 0.1910873 0.043974943 0.07142061
## Politics 0.406234287 0.19108731 1.0000000 0.103492328 0.13388052
## Mathematics 0.008786356 0.04397494 0.1034923 1.000000000 0.60785444
## Physics 0.070881929 0.07142061 0.1338805 0.607854436 1.00000000
# Principal Component Analysis to get information about which components have higher effects on variance
pca <- princomp(as.matrix(s[,1:32]),cor=T)
summary(pca,loadings=TRUE)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 2.0433958 1.8069970 1.60438109 1.46391156
## Proportion of Variance 0.1304833 0.1020387 0.08043871 0.06696991
## Cumulative Proportion 0.1304833 0.2325220 0.31296072 0.37993063
## Comp.5 Comp.6 Comp.7 Comp.8
## Standard deviation 1.26463164 1.18461175 1.06955650 1.05602251
## Proportion of Variance 0.04997791 0.04385328 0.03574847 0.03484949
## Cumulative Proportion 0.42990854 0.47376182 0.50951029 0.54435978
## Comp.9 Comp.10 Comp.11 Comp.12
## Standard deviation 1.04363010 1.00633062 0.95951254 0.93789360
## Proportion of Variance 0.03403637 0.03164692 0.02877076 0.02748889
## Cumulative Proportion 0.57839615 0.61004306 0.63881382 0.66630271
## Comp.13 Comp.14 Comp.15 Comp.16
## Standard deviation 0.93168719 0.89503805 0.87262765 0.85510068
## Proportion of Variance 0.02712628 0.02503416 0.02379622 0.02284991
## Cumulative Proportion 0.69342899 0.71846315 0.74225937 0.76510928
## Comp.17 Comp.18 Comp.19 Comp.20
## Standard deviation 0.8329419 0.80296355 0.78632469 0.76747805
## Proportion of Variance 0.0216810 0.02014845 0.01932208 0.01840695
## Cumulative Proportion 0.7867903 0.80693874 0.82626082 0.84466777
## Comp.21 Comp.22 Comp.23 Comp.24
## Standard deviation 0.74821762 0.72513227 0.71054805 0.70905599
## Proportion of Variance 0.01749468 0.01643178 0.01577745 0.01571126
## Cumulative Proportion 0.86216245 0.87859422 0.89437168 0.91008294
## Comp.25 Comp.26 Comp.27 Comp.28
## Standard deviation 0.69087699 0.65750430 0.64296368 0.60888823
## Proportion of Variance 0.01491597 0.01350975 0.01291882 0.01158578
## Cumulative Proportion 0.92499891 0.93850866 0.95142748 0.96301326
## Comp.29 Comp.30 Comp.31 Comp.32
## Standard deviation 0.5954011 0.551814415 0.539017589 0.483770803
## Proportion of Variance 0.0110782 0.009515598 0.009079374 0.007313568
## Cumulative Proportion 0.9740915 0.983607058 0.992686432 1.000000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## History -0.196 -0.184 0.243 -0.130 0.243 -0.142
## Psychology -0.244 0.119 -0.112 0.102
## Politics -0.131 -0.202 -0.255 0.204 -0.270 0.153
## Mathematics -0.295 0.126 0.133 -0.309 0.220
## Physics -0.344 0.241 0.136 -0.155 0.151
## Internet -0.237 -0.127 -0.400
## PC -0.377 0.134 -0.291 -0.137
## Economy.Management -0.174 -0.306 -0.214 -0.182 0.166
## Biology -0.269 0.377 -0.151 -0.221
## Chemistry -0.199 0.421 -0.230
## Reading -0.282 0.188 0.176
## Geography -0.160 -0.145 -0.151 0.218 -0.226
## Foreign.languages -0.208 -0.217 -0.117 0.165
## Medicine -0.269 0.317 -0.118 -0.266
## Law -0.143 -0.147 -0.277 -0.363 0.116
## Cars -0.350 -0.168
## Art.exhibitions -0.313 0.201
## Religion -0.230 0.166 -0.128
## Countryside..outdoors -0.194 -0.102 0.334 -0.167
## Dancing -0.241 -0.230 0.107
## Musical.instruments -0.210 0.361
## Writing -0.231 0.161 0.194 -0.114 -0.214
## Passive.sport -0.162 -0.222 0.135 0.116
## Active.sport -0.200 -0.241 0.138 0.340
## Gardening -0.192 -0.178 -0.445
## Celebrities -0.183 -0.351 -0.181 -0.289 -0.234
## Shopping 0.136 -0.178 -0.399 -0.153 -0.241
## Science.and.technology -0.348
## Theatre -0.302 0.128 0.144 0.256
## Fun.with.friends -0.135 -0.260 0.154 0.498
## Adrenaline.sports -0.254 -0.221 0.200 0.313 0.179
## Pets -0.256 -0.138
## Comp.8 Comp.9 Comp.10 Comp.11 Comp.12 Comp.13
## History -0.194 0.162
## Psychology 0.214 -0.181 -0.232 -0.272 0.536
## Politics 0.108 -0.139 -0.187
## Mathematics 0.273 0.360 0.129 0.147
## Physics 0.151 0.127 0.160
## Internet -0.370 -0.107 -0.147 0.264
## PC -0.144 -0.183
## Economy.Management 0.288 -0.163 0.189
## Biology
## Chemistry 0.142
## Reading -0.173 0.116 0.130 0.150
## Geography -0.243 -0.335 0.105 0.349 0.228 0.113
## Foreign.languages -0.227 -0.364 -0.122 0.187 0.114
## Medicine -0.135 -0.178
## Law 0.142 -0.135 -0.179
## Cars 0.101 -0.125 -0.331
## Art.exhibitions 0.166 -0.156
## Religion -0.220 -0.370 -0.456 -0.146
## Countryside..outdoors -0.119 -0.240 0.371 -0.412
## Dancing 0.326 -0.204 0.108 -0.114 0.102
## Musical.instruments 0.158 -0.106 -0.226 -0.270 0.141 -0.110
## Writing 0.171 0.212 -0.319 0.309 0.149
## Passive.sport -0.150 0.259 -0.560 0.466 0.125
## Active.sport 0.286 -0.204 0.214 0.153
## Gardening 0.200 0.108 0.220 -0.109
## Celebrities
## Shopping -0.108
## Science.and.technology -0.195 0.194 -0.169 0.186 -0.121 -0.194
## Theatre -0.115 0.190 0.145 -0.258
## Fun.with.friends -0.208 -0.173 -0.147
## Adrenaline.sports 0.103 -0.120 0.141
## Pets -0.132 0.467 0.336 -0.108 0.306
## Comp.14 Comp.15 Comp.16 Comp.17 Comp.18 Comp.19
## History -0.252 -0.233 0.141 -0.130
## Psychology -0.153 0.266 0.125 0.116
## Politics -0.146 0.149 -0.140
## Mathematics -0.183 -0.171
## Physics -0.130 -0.150 -0.125 0.126
## Internet -0.174 -0.157 0.195 0.216
## PC -0.100 0.226 0.103
## Economy.Management 0.190 0.148 0.333
## Biology
## Chemistry -0.101
## Reading -0.177 -0.411
## Geography -0.155 -0.236 0.302
## Foreign.languages 0.408 -0.329 -0.154
## Medicine 0.101 0.165
## Law 0.158 0.187 0.301
## Cars 0.100 0.174 -0.164 0.145
## Art.exhibitions -0.114 0.343 -0.122 0.313
## Religion -0.147 -0.103 -0.125 -0.435 -0.322
## Countryside..outdoors 0.192
## Dancing -0.145 -0.200 0.345 0.202 -0.211
## Musical.instruments 0.428 -0.197 -0.200 0.303
## Writing 0.169 -0.250
## Passive.sport 0.296 -0.128 -0.274
## Active.sport -0.175 -0.220 0.172 -0.432 -0.108
## Gardening -0.146 0.564 -0.193
## Celebrities -0.185 -0.248 -0.252 0.131
## Shopping -0.105 -0.101 -0.178 -0.282
## Science.and.technology 0.182 0.189 -0.554
## Theatre 0.252 0.185
## Fun.with.friends -0.130 -0.471 0.304 -0.160
## Adrenaline.sports 0.111 -0.365 0.278
## Pets 0.465 -0.209 -0.240
## Comp.20 Comp.21 Comp.22 Comp.23 Comp.24 Comp.25
## History 0.369 0.129 -0.487
## Psychology 0.242 0.163 0.228 -0.162
## Politics 0.154 0.119 0.337 0.355
## Mathematics 0.169
## Physics -0.162 -0.172 0.129
## Internet -0.225 -0.113 0.219
## PC 0.249
## Economy.Management -0.159 0.224 0.180 -0.443 -0.235
## Biology 0.141 -0.102
## Chemistry -0.102 -0.221 0.135
## Reading -0.271 -0.131 0.303 -0.144
## Geography -0.223 0.285 -0.175
## Foreign.languages 0.303 0.202 -0.111 0.174 0.149
## Medicine -0.110
## Law -0.262 -0.183 -0.438 0.128
## Cars 0.380 -0.464 0.262 -0.220
## Art.exhibitions 0.286 0.199 -0.126 0.242
## Religion -0.259
## Countryside..outdoors 0.189 -0.397 -0.125 -0.135 -0.142
## Dancing -0.311 0.108 -0.310 0.274 0.102 0.142
## Musical.instruments 0.213 0.230 -0.103
## Writing -0.224 -0.178 -0.358 -0.110 -0.165 -0.197
## Passive.sport
## Active.sport 0.301 -0.258 -0.139
## Gardening 0.189 0.236 0.112 0.157
## Celebrities 0.254 -0.187 0.201 0.318 0.237 -0.228
## Shopping 0.122 -0.118 -0.342 -0.322 0.255
## Science.and.technology -0.102 -0.148 0.107
## Theatre -0.147 0.153
## Fun.with.friends 0.126 -0.126 -0.305
## Adrenaline.sports -0.278 0.290 0.419
## Pets -0.263 0.152
## Comp.26 Comp.27 Comp.28 Comp.29 Comp.30 Comp.31
## History -0.247 0.164 -0.179
## Psychology -0.152 0.144 -0.131 0.195
## Politics 0.335 0.385 0.120 0.113
## Mathematics -0.195 0.575
## Physics 0.263 -0.123 -0.650
## Internet 0.206 0.359 -0.223
## PC -0.189 -0.442 0.499 0.162
## Economy.Management 0.101 -0.239
## Biology
## Chemistry 0.165 0.623
## Reading -0.245 0.303 -0.348 -0.174 -0.155
## Geography -0.272
## Foreign.languages 0.178 -0.190 0.137 0.118
## Medicine 0.155 -0.632
## Law -0.253 -0.319
## Cars 0.261
## Art.exhibitions 0.332 -0.122 -0.262 -0.311 -0.102
## Religion -0.142 0.111
## Countryside..outdoors 0.280 0.132
## Dancing -0.260
## Musical.instruments -0.257 0.171 -0.179
## Writing 0.289 0.140 0.150
## Passive.sport -0.112 -0.129
## Active.sport 0.125 0.204 0.126
## Gardening -0.193 0.135
## Celebrities 0.169 -0.194
## Shopping -0.186 0.295 -0.215 0.163 -0.134
## Science.and.technology -0.175 -0.435
## Theatre -0.202 0.486 0.372 0.175 0.161
## Fun.with.friends 0.136
## Adrenaline.sports -0.245
## Pets
## Comp.32
## History
## Psychology
## Politics
## Mathematics
## Physics
## Internet
## PC
## Economy.Management
## Biology 0.792
## Chemistry -0.390
## Reading
## Geography
## Foreign.languages
## Medicine -0.397
## Law
## Cars
## Art.exhibitions
## Religion
## Countryside..outdoors
## Dancing
## Musical.instruments
## Writing
## Passive.sport
## Active.sport
## Gardening
## Celebrities
## Shopping
## Science.and.technology
## Theatre
## Fun.with.friends
## Adrenaline.sports
## Pets
PCA shows that the first 2 principal component is responsible for 23% of the variation, while first 7 component is responsible for more than 50% of the variation. For components greater than seventh component have less impact on variance, these are excluded from dataset and a new PCA will be applied for the focused group.
pca <- princomp(as.matrix(s[,1:7]),cor=T)
summary(pca,loadings=TRUE)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 1.4418693 1.2491623 1.0493025 0.9060352 0.75668768
## Proportion of Variance 0.2969981 0.2229152 0.1572908 0.1172714 0.08179661
## Cumulative Proportion 0.2969981 0.5199133 0.6772041 0.7944755 0.87627215
## Comp.6 Comp.7
## Standard deviation 0.70619949 0.60611651
## Proportion of Variance 0.07124539 0.05248246
## Cumulative Proportion 0.94751754 1.00000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## History 0.184 -0.603 0.158 -0.169 0.711 -0.200
## Psychology 0.121 -0.501 0.801 -0.196 0.217
## Politics 0.263 -0.503 0.135 -0.508 -0.631
## Mathematics 0.509 0.137 -0.443 -0.407 0.595
## Physics 0.527 -0.444 0.116 0.125 -0.699
## Internet 0.344 0.213 0.647 0.259 -0.120 -0.526 -0.247
## PC 0.477 0.247 0.376 0.157 0.673 0.298
In the second PCA, first 2 PC is responsible for more than half of the variation. First PC is highly affected by Mathematics, Physics and PC, while third PC is highly affected by Internet and PC and fourth by Psychology. With the sixth PC, 94.7% of the variation is explained.
Let’s visualize the effect of components from PCA 2:
ggplot(data.frame(pc=1:7,cum_var=c(0.2969981,0.5199133,0.6772041,0.7944755,0.87627215,0.94751754,1.00000000)),aes(x=pc,y=cum_var)) +
geom_point() +
geom_line()
The plot above implies that with the first 3 component, nearly 70% of the variation is explained. If the first 6 components are selected, nearly 95% of the variation is explained.
#create table for multidimensional scaling
s_mds_data <- s[,sapply(s,class)=="integer"] %>%
select(History:Pets) %>%
tbl_df()
#Take negative of correlation values and add 1 to make the distance values start from 0
s_mds_distance <- 1 - cor(s_mds_data)
#Apply Multidimensional Scaling
s_mds <- cmdscale(s_mds_distance,k=2)
#Provide column names
colnames(s_mds) <- c("x","y")
#print coordinates for categories
print(s_mds)
## x y
## History 0.06372563 0.012956654
## Psychology 0.23826552 -0.024422880
## Politics -0.14749496 0.107881476
## Mathematics -0.32345626 -0.297702691
## Physics -0.32464296 -0.499240678
## Internet -0.43673303 0.187118496
## PC -0.58516404 -0.107666761
## Economy.Management -0.30422083 0.383264449
## Biology 0.25483318 -0.395803325
## Chemistry 0.12745384 -0.486209735
## Reading 0.54412403 -0.002498553
## Geography -0.04741594 0.042633809
## Foreign.languages 0.20488850 0.223464512
## Medicine 0.21225026 -0.352629731
## Law -0.07751035 0.252871501
## Cars -0.58387376 0.046912232
## Art.exhibitions 0.37143545 0.009747048
## Religion 0.19597972 -0.174122589
## Countryside..outdoors 0.14091867 -0.071530114
## Dancing 0.27743365 0.136486519
## Musical.instruments 0.16521053 -0.134880364
## Writing 0.30297822 -0.032446839
## Passive.sport -0.30096850 0.108795592
## Active.sport -0.24240576 0.047405197
## Gardening 0.19439442 -0.077481786
## Celebrities 0.06494596 0.467218540
## Shopping 0.20807212 0.463704403
## Science.and.technology -0.36445481 -0.278443856
## Theatre 0.44547786 0.054248884
## Fun.with.friends -0.05316373 0.253721685
## Adrenaline.sports -0.30291870 0.041430796
## Pets 0.08203608 0.095218109
#Display relational distance values of categories acquired from Multidimensional Scaling
ggplot(data.frame(s_mds),aes(x=x,y=y)) +
geom_text(label=rownames(s_mds),size=3) +
labs(x="x",y="y", title="MDS - Categories By Relational Distance")
In the graph above, categories that seems to have close relationship with each other are:
When we think about real life relationship, the clusters MDS technique found out is not absurd at all. Furthermore, it really helps to decide about which groups to start include for possible pairings.