First we import the egm_example_data to R with readr
package as a tibble. And check with glimpse function for its structure.
egm_df <- read_excel("egm_example_data.xlsx")
glimpse(egm_df)
## Observations: 1,368
## Variables: 15
## $ date <chr> "06.01.2017", "06.01.2017"...
## $ pension_fund_company <chr> "Aegon Emeklilik ve Hayat"...
## $ n_of_participants <dbl> 42826, 118867, 726684, 111...
## $ fund_size_participants <dbl> 139193651, 1840701596, 738...
## $ gov_contribution <dbl> 14409322, 231156574, 81535...
## $ contribution <dbl> 109589132, 1466908275, 579...
## $ n_of_pensioners <chr> "269", "2300", "5640", "11...
## $ n_of_ind_contracts <dbl> 34003, 113222, 627972, 938...
## $ n_of_group_ind_contracts <chr> "14127", "18813", "148377"...
## $ n_of_employer_group_certificates <chr> "884", "4488", "98772", "2...
## $ n_total <dbl> 49014, 136523, 875121, 128...
## $ size_of_ind_contracts <dbl> 87071132, 1165167454, 4001...
## $ size_of_group_ind_contracts <dbl> 17725697, 244504362, 13248...
## $ size_of_employer_group_certificates <dbl> 1808842, 42707063, 3775014...
## $ size_total <dbl> 106605671, 1452378880, 570...
Here some date and numeric columns are imported as character. So we need to tidy the data. We convert columns 7,9 and 10 to integer.
cols = c(7, 9, 10)
egm_df[,cols] = apply(egm_df[,cols], 2, function(x) as.integer(as.character(x)))
Then convert the date column to the date type with lubridate
package
# converting the date column to the date type
egm_df <- egm_df %>% mutate(date=dmy(date))
# creating a year and amonth column
egm_df <- egm_df %>% mutate(month=month(date))
egm_df <- egm_df %>% mutate(year=year(date))
First we check the number of records for each company
##number of records for companies
egm_df %>% group_by(pension_fund_company) %>% count()
## # A tibble: 19 x 2
## # Groups: pension_fund_company [19]
## pension_fund_company n
## <chr> <int>
## 1 Aegon Emeklilik ve Hayat 76
## 2 Allianz Hayat ve Emeklilik 76
## 3 Allianz Yasam ve Emeklilik 76
## 4 Anadolu Hayat Emeklilik 76
## 5 Asya Emeklilik ve Hayat 7
## 6 Avivasa Emeklilik ve Hayat 76
## 7 Axa Hayat ve Emeklilik 76
## 8 Bereket Emeklilik ve Hayat 69
## 9 BNP Paribas Cardif Emeklilik 76
## 10 Cigna Finans Emeklilik ve Hayat 76
## 11 Fiba Emeklilik ve Hayat 76
## 12 Garanti Emeklilik ve Hayat 76
## 13 Groupama Emeklilik 76
## 14 Halk Hayat ve Emeklilik 76
## 15 Katilim Emeklilik ve Hayat 76
## 16 Metlife Emeklilik ve Hayat 76
## 17 NN Hayat ve Emeklilik 76
## 18 Vakif Emeklilik ve Hayat 76
## 19 Ziraat Hayat ve Emeklilik 76
Then we will remove the Asya Emeklilik ve Hayat and Bereket Emeklilik ve Hayat due to the missing months. Then we group for those companies that have an average number of pensioners over 5000 for the year 2017 and 2018
egm_df %>% filter(pension_fund_company != c("Asya Emeklilik ve Hayat", "Bereket Emeklilik ve Hayat")) %>% group_by(pension_fund_company, year) %>% summarise(number_of_pensioners = mean(n_of_pensioners)) %>% filter(number_of_pensioners > 5000)
## # A tibble: 9 x 3
## # Groups: pension_fund_company [5]
## pension_fund_company year number_of_pensioners
## <chr> <dbl> <dbl>
## 1 Allianz Yasam ve Emeklilik 2017 6861.
## 2 Allianz Yasam ve Emeklilik 2018 8349.
## 3 Anadolu Hayat Emeklilik 2017 13922.
## 4 Anadolu Hayat Emeklilik 2018 16974.
## 5 Avivasa Emeklilik ve Hayat 2017 11003.
## 6 Avivasa Emeklilik ve Hayat 2018 13260.
## 7 Garanti Emeklilik ve Hayat 2017 7306.
## 8 Garanti Emeklilik ve Hayat 2018 9223.
## 9 Vakif Emeklilik ve Hayat 2018 5764.
Then we plot the change of the number of pensioners over time for the resulting companies