1 Introduction

1.1 What type of analysis can be done with this data set?

1.2 Load packages and data

# Download data file from source

# Install tidyverse if not already installed
if (!("tidyverse" %in% installed.packages())) {
    install.packages("tidyverse", repos = "https://cran.r-project.org")
# Load tidyverse package
#For correlogram plots
# Load the data into variable d

#Create custom colors vector
custom_colors=c("#E32800", "#FDB205","#FDF505","#009BDF","#E3FD05","#A7FD05","#7CBE00","#639700","#972000","#871D00","#50FF95","#00DEAF","#00B891","#00B5B8","#0080B8","#0063E8","#0047A7","#9F55FF","#C69BFF","#D69BFF","#B956FE","#DF56FE","#FE5681","#9BDF00")

1.3 General View Of Data And Assumptions

As it is seen below, numerical survey responses are mostly collected using a scale of 1 to 5. So we need a systematic approach to firstly group these answers then compare groups that will be created. I separated the answers to two main groups. These groups correspond to:

I ignored neutrals (answering 3 in each category) in the context of this analysis to catch stronger tendencies in both ends.

Also there are general categories to which each question belongs in this survey. These are:

So I will try to catch possible correlations inside the selected categories I will choose and also in between two categories of interest.

#First five observations and data types for all columns
1.4 Data Integrity Check

Let’s check whether we have rows witn NA values.

#Get NA totals by columns
na_count <-sapply(d, function(y) sum(length(which(is.na(y)))))
na_count <- data.frame(na_count)

Because there is a lot of missing values compared to total row count, removing NA values totally is not preferred. Data removal will be applied to related columns in visualizations where necessary.

2 Visualizations

+Below are some sample visualizations to understand the data set. First, I want to see the respondents’ profile by age and education level.

#See how many empty rows exists

#Replace empty values with NA
d$Gender[as.character(d$Gender)==""] <- NA

#Create histogram plot with variable and label parameters
ggplot(d, aes(x=Age, fill=Gender))+
  geom_histogram(binwidth=1, alpha=.5, position="dodge")+
  labs(y="Participant Count",x="Age",title="Participant Numbers By Age (Gender)")+
                    #breaks=c("Female", "Male", "NA"),
                    labels=c("Female", "Male", "NA"))


#Replace empty values with NA
d$Education[as.character(d$Education)==""] <- NA

#Create histogram plot with variable, axis and label parameters
ggplot(d, aes(x=Age, fill=Education))+
  geom_histogram(binwidth=1.25, alpha=1, position="dodge")+
  scale_x_continuous(breaks = c(seq(from = 10, to = 32, by = 1)),limits = c(15,31))+
  scale_y_continuous(breaks = c(seq(from = 0, to = 300, by = 20)),limits = c(0,160))+
  labs(y="Participant Count",x="Age",title="Participant Numbers By Age (Education Level)")+

music_pref <- d  %>%
  na.omit() %>%

# Correlation matrix
corr <- round(cor(music_pref), 2)

ggcorrplot(corr, hc.order = TRUE, 
           type = "lower", 
           lab = TRUE, 
           lab_size = 2, 
           colors = custom_colors, 
           title="Correlogram of Music Preferences", 


corrplot(M, diag = FALSE, order = "FPC",tl.pos = "td", tl.cex = 0.8, method = "color",type="upper",col=colorRampPalette(c("dark blue","white","orange"))(200))

According to correlation matrix, the highest negative correlation is between Pop music and Metal-Hardrock music choices. Positive correlations exist between Opera and Classical.Music, Metal-Hardrock, Rock and Punk.

# Create plots for habits by age
#p1 <- 
  ggplot(d, aes(x=Age, fill=Alcohol))+
  geom_histogram(binwidth=1.25, alpha=1, position="dodge")+
  scale_x_continuous(breaks = c(seq(from = 10, to = 32, by = 1)),limits = c(15,31))+
  scale_y_continuous(breaks = c(seq(from = 0, to = 300, by = 20)),limits = c(0,150))+
  labs(y="Participant Count",x="Age",title="Participant Numbers By Age (Alcohol Consumption)")+

  ggplot(d, aes(x=Age, fill=Smoking))+
  geom_histogram(binwidth=1.25, alpha=1, position="dodge")+
  scale_x_continuous(breaks = c(seq(from = 10, to = 32, by = 1)),limits = c(15,31))+
  scale_y_continuous(breaks = c(seq(from = 0, to = 300, by = 20)),limits = c(0,100))+
  labs(y="Participant Count",x="Age",title="Participant Numbers By Age (Smoking)")+


Graphs show that harmful habits also settle at the highest rate between ages 18 and 21.
