This is an exercise about weather data of four travel destinations. The data consists of temperature (in Celsius) history of 4 popular travel destinations (NYC, Amsterdam, London and Venice) between November 2015 and October 2017. Raw data is gathered from Weather Underground and it is only for educational purposes.
setwd("F:/DUYGU/MEF_BDA/503/TravelWeather")
I am going to use tidyverse extensively trough out the exercise.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.3
## -- Attaching packages ----------------------------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## <U+221A> ggplot2 2.2.1 <U+221A> purrr 0.2.4
## <U+221A> tibble 1.3.4 <U+221A> dplyr 0.7.4
## <U+221A> tidyr 0.7.2 <U+221A> stringr 1.2.0
## <U+221A> readr 1.1.1 <U+221A> forcats 0.2.0
## Warning: package 'ggplot2' was built under R version 3.4.3
## -- Conflicts -------------------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Let’s load the data set.
load("travel_weather.RData")
The data frame is in tibble type.
travel_weather %>% tbl_df()
## # A tibble: 731 x 7
## year month day Amsterdam London NYC Venice
## * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2015 11 1 8 8 16 13
## 2 2015 11 2 10 11 15 10
## 3 2015 11 3 9 11 16 9
## 4 2015 11 4 12 11 17 10
## 5 2015 11 5 13 13 18 12
## 6 2015 11 6 16 14 21 13
## 7 2015 11 7 16 14 17 14
## 8 2015 11 8 12 12 11 13
## 9 2015 11 9 13 12 11 11
## 10 2015 11 10 14 14 12 11
## # ... with 721 more rows
The names of the columns are as follows;
names(travel_weather)
## [1] "year" "month" "day" "Amsterdam" "London" "NYC"
## [7] "Venice"
Let’s check the data with glimpse, too. We have 731 observations for 7 variables.
glimpse(travel_weather)
## Observations: 731
## Variables: 7
## $ year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015...
## $ month <dbl> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, ...
## $ day <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ Amsterdam <dbl> 8, 10, 9, 12, 13, 16, 16, 12, 13, 14, 13, 13, 11, 11...
## $ London <dbl> 8, 11, 11, 11, 13, 14, 14, 12, 12, 14, 13, 12, 10, 1...
## $ NYC <dbl> 16, 15, 16, 17, 18, 21, 17, 11, 11, 12, 12, 13, 11, ...
## $ Venice <dbl> 13, 10, 9, 10, 12, 13, 14, 13, 11, 11, 9, 11, 8, 11,...
Let’s only select Venice’s dayly weather data.
travel_weather %>% select(year, month, day, Venice)
## # A tibble: 731 x 4
## year month day Venice
## * <dbl> <dbl> <dbl> <dbl>
## 1 2015 11 1 13
## 2 2015 11 2 10
## 3 2015 11 3 9
## 4 2015 11 4 10
## 5 2015 11 5 12
## 6 2015 11 6 13
## 7 2015 11 7 14
## 8 2015 11 8 13
## 9 2015 11 9 11
## 10 2015 11 10 11
## # ... with 721 more rows
The data of all cities are:
travel_weather %>% select(Amsterdam:Venice)
## # A tibble: 731 x 4
## Amsterdam London NYC Venice
## * <dbl> <dbl> <dbl> <dbl>
## 1 8 8 16 13
## 2 10 11 15 10
## 3 9 11 16 9
## 4 12 11 17 10
## 5 13 13 18 12
## 6 16 14 21 13
## 7 16 14 17 14
## 8 12 12 11 13
## 9 13 12 11 11
## 10 14 14 12 11
## # ... with 721 more rows
Exclude NYC and London
travel_weather %>% select(-London, -NYC)
## # A tibble: 731 x 5
## year month day Amsterdam Venice
## * <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2015 11 1 8 13
## 2 2015 11 2 10 10
## 3 2015 11 3 9 9
## 4 2015 11 4 12 10
## 5 2015 11 5 13 12
## 6 2015 11 6 16 13
## 7 2015 11 7 16 14
## 8 2015 11 8 12 13
## 9 2015 11 9 13 11
## 10 2015 11 10 14 11
## # ... with 721 more rows
Let’ write Big Apple’s name more properly using back ticks and see whether the column names are changed or not.
travel_weather %>% rename(`New York` = NYC) %>% names()
## [1] "year" "month" "day" "Amsterdam" "London" "New York"
## [7] "Venice"
travel_weather %>% select(NYC) %>% rename(`New York` = NYC) %>% names()
## [1] "New York"
names(travel_weather)
## [1] "year" "month" "day" "Amsterdam" "London" "NYC"
## [7] "Venice"
Let’s check temperature values of first 3 days of each month
travel_weather %>% filter(day<=3)
## # A tibble: 72 x 7
## year month day Amsterdam London NYC Venice
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2015 11 1 8 8 16 13
## 2 2015 11 2 10 11 15 10
## 3 2015 11 3 9 11 16 9
## 4 2015 12 1 9 11 9 6
## 5 2015 12 2 10 12 11 8
## 6 2015 12 3 9 11 10 8
## 7 2016 1 1 4 3 3 2
## 8 2016 1 2 6 10 2 0
## 9 2016 1 3 7 8 4 3
## 10 2016 2 1 10 12 11 6
## # ... with 62 more rows
We are looking for the dates in November which Venice is warmer than NYC.
travel_weather %>% filter(month==11 & Venice > NYC)
## # A tibble: 20 x 7
## year month day Amsterdam London NYC Venice
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2015 11 8 12 12 11 13
## 2 2015 11 14 11 10 8 11
## 3 2015 11 15 12 14 9 11
## 4 2015 11 17 13 13 8 9
## 5 2015 11 23 3 3 4 6
## 6 2015 11 24 5 8 4 6
## 7 2016 11 1 10 9 9 11
## 8 2016 11 6 7 4 11 12
## 9 2016 11 7 4 6 8 11
## 10 2016 11 12 1 8 7 9
## 11 2016 11 19 6 4 10 11
## 12 2016 11 20 7 7 3 11
## 13 2016 11 21 10 10 4 12
## 14 2016 11 22 10 9 4 14
## 15 2016 11 23 8 7 4 14
## 16 2016 11 24 6 9 6 13
## 17 2016 11 25 3 7 10 13
## 18 2016 11 26 3 6 7 12
## 19 2016 11 27 5 7 7 11
## 20 2016 11 28 1 6 7 8
Find the dates whether Amsterdam is warmer than either London or Venice in July and create a table just to compare these three cities
travel_weather %>% filter(month == 7 & Amsterdam > (London | Venice)) %>% select(-NYC)
## # A tibble: 62 x 6
## year month day Amsterdam London Venice
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2016 7 1 16 16 25
## 2 2016 7 2 16 14 25
## 3 2016 7 3 14 15 24
## 4 2016 7 4 16 16 24
## 5 2016 7 5 16 17 23
## 6 2016 7 6 14 16 24
## 7 2016 7 7 16 18 26
## 8 2016 7 8 18 19 25
## 9 2016 7 9 18 19 27
## 10 2016 7 10 20 20 28
## # ... with 52 more rows
Lets focus on NYC and list the temperature values in ascending order(i.e. from min. to max.).
travel_weather %>% arrange(NYC) %>% select(year, month, day, NYC)
## # A tibble: 731 x 4
## year month day NYC
## <dbl> <dbl> <dbl> <dbl>
## 1 2016 2 14 -14
## 2 2016 2 13 -10
## 3 2016 1 5 -7
## 4 2017 1 9 -7
## 5 2016 1 19 -6
## 6 2016 2 12 -6
## 7 2016 12 16 -6
## 8 2017 1 8 -6
## 9 2017 1 7 -5
## 10 2017 3 11 -5
## # ... with 721 more rows
Let’s arrange temperature values such that they are in ascending order in NYC but descending in Amsterdam
travel_weather %>% arrange(NYC, desc(Amsterdam)) %>% select(-London, -Venice)
## # A tibble: 731 x 5
## year month day Amsterdam NYC
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2016 2 14 2 -14
## 2 2016 2 13 1 -10
## 3 2016 1 5 6 -7
## 4 2017 1 9 6 -7
## 5 2016 12 16 6 -6
## 6 2017 1 8 4 -6
## 7 2016 2 12 2 -6
## 8 2016 1 19 -2 -6
## 9 2017 3 15 9 -5
## 10 2017 3 11 7 -5
## # ... with 721 more rows
Let’s arrange the data with decreasing date
travel_weather %>% arrange(desc(year),desc(month),desc(day))
## # A tibble: 731 x 7
## year month day Amsterdam London NYC Venice
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2017 10 31 9 9 11 11
## 2 2017 10 30 8 6 12 13
## 3 2017 10 29 11 11 18 9
## 4 2017 10 28 12 10 17 10
## 5 2017 10 27 12 9 13 13
## 6 2017 10 26 13 10 13 13
## 7 2017 10 25 13 14 17 13
## 8 2017 10 24 13 16 21 13
## 9 2017 10 23 13 13 20 13
## 10 2017 10 22 11 11 19 13
## # ... with 721 more rows
The data is ordered such that the absolute temperature difference between London and Amsterdam increasing.
travel_weather %>% arrange(abs(London-Amsterdam))
## # A tibble: 731 x 7
## year month day Amsterdam London NYC Venice
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2015 11 1 8 8 16 13
## 2 2015 11 5 13 13 18 12
## 3 2015 11 8 12 12 11 13
## 4 2015 11 10 14 14 12 11
## 5 2015 11 11 13 13 12 9
## 6 2015 11 16 12 12 15 10
## 7 2015 11 17 13 13 8 9
## 8 2015 11 18 12 12 11 11
## 9 2015 11 19 11 11 16 10
## 10 2015 11 23 3 3 4 6
## # ... with 721 more rows
Temperature difference between Venice and Amsterdam is added as a new column as follows.
travel_weather %>% mutate(VAdiff = Venice - Amsterdam)
## # A tibble: 731 x 8
## year month day Amsterdam London NYC Venice VAdiff
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2015 11 1 8 8 16 13 5
## 2 2015 11 2 10 11 15 10 0
## 3 2015 11 3 9 11 16 9 0
## 4 2015 11 4 12 11 17 10 -2
## 5 2015 11 5 13 13 18 12 -1
## 6 2015 11 6 16 14 21 13 -3
## 7 2015 11 7 16 14 17 14 -2
## 8 2015 11 8 12 12 11 13 1
## 9 2015 11 9 13 12 11 11 -2
## 10 2015 11 10 14 14 12 11 -3
## # ... with 721 more rows
The condition for Venice to be warmer than Amsterdam is added as a column, as follows:
travel_weather %>% mutate(VwarmerA = Venice>Amsterdam)
## # A tibble: 731 x 8
## year month day Amsterdam London NYC Venice VwarmerA
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
## 1 2015 11 1 8 8 16 13 TRUE
## 2 2015 11 2 10 11 15 10 FALSE
## 3 2015 11 3 9 11 16 9 FALSE
## 4 2015 11 4 12 11 17 10 FALSE
## 5 2015 11 5 13 13 18 12 FALSE
## 6 2015 11 6 16 14 21 13 FALSE
## 7 2015 11 7 16 14 17 14 FALSE
## 8 2015 11 8 12 12 11 13 TRUE
## 9 2015 11 9 13 12 11 11 FALSE
## 10 2015 11 10 14 14 12 11 FALSE
## # ... with 721 more rows
The dates when the Venice is warmer(or colder) than Amsterdam or equal to Amsterdam is tagged as “warmer” (or “colder”) or as “equal to” and added as a new column, as follows:
travel_weather %>%
transmute(year,month,day,
VwarmerA = ifelse(Venice == Amsterdam,"equal to", ifelse(Venice>Amsterdam,"warmer","colder")))
## # A tibble: 731 x 4
## year month day VwarmerA
## <dbl> <dbl> <dbl> <chr>
## 1 2015 11 1 warmer
## 2 2015 11 2 equal to
## 3 2015 11 3 equal to
## 4 2015 11 4 colder
## 5 2015 11 5 colder
## 6 2015 11 6 colder
## 7 2015 11 7 colder
## 8 2015 11 8 warmer
## 9 2015 11 9 colder
## 10 2015 11 10 colder
## # ... with 721 more rows
Calculate the avarage temperature of Venice and NYC:
travel_weather %>% summarise(Venice_avg=mean(Venice), NYC_avg=mean(NYC))
## # A tibble: 1 x 2
## Venice_avg NYC_avg
## <dbl> <dbl>
## 1 14.31601 14.41313
As you see, NYC is slightly warmer than Venice on avarage. Let’s calculate avarage temperatures of each month in Amsterdam
travel_weather %>% group_by(month) %>% summarise(Amsterdam_avg =mean(Amsterdam)) %>% round(digits=2)
## # A tibble: 12 x 2
## month Amsterdam_avg
## <dbl> <dbl>
## 1 1 3.00
## 2 2 4.32
## 3 3 6.92
## 4 4 8.43
## 5 5 14.48
## 6 6 17.28
## 7 7 18.02
## 8 8 17.68
## 9 9 15.95
## 10 10 11.68
## 11 11 7.65
## 12 12 6.97
Calculate the number of days Amsterdam is warmer than NYC for each monthe of each year.
travel_weather %>% group_by(year, month) %>% summarise(WarmerDays = sum(Amsterdam>NYC))
## # A tibble: 24 x 3
## # Groups: year [?]
## year month WarmerDays
## <dbl> <dbl> <int>
## 1 2015 11 11
## 2 2015 12 12
## 3 2016 1 23
## 4 2016 2 16
## 5 2016 3 5
## 6 2016 4 10
## 7 2016 5 8
## 8 2016 6 1
## 9 2016 7 1
## 10 2016 8 0
## # ... with 14 more rows
The max., the min and median temperatures of London city for each month and year.
travel_weather %>% group_by(year,month) %>% summarise(minT = min(London), T_med = median(London), maxT = max(London))
## # A tibble: 24 x 5
## # Groups: year [?]
## year month minT T_med maxT
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2015 11 1 11 14
## 2 2015 12 0 10 14
## 3 2016 1 0 6 11
## 4 2016 2 1 4 12
## 5 2016 3 2 6 11
## 6 2016 4 4 8 11
## 7 2016 5 8 13 16
## 8 2016 6 11 16 19
## 9 2016 7 14 18 24
## 10 2016 8 14 18 24
## # ... with 14 more rows
Temperature difference between consecutive days in Amsterdam can be calculated as below.
travel_weather %>% transmute(year,month,day, Amsterdam, deltaT = lead(Amsterdam)- Amsterdam)
## # A tibble: 731 x 5
## year month day Amsterdam deltaT
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2015 11 1 8 2
## 2 2015 11 2 10 -1
## 3 2015 11 3 9 3
## 4 2015 11 4 12 1
## 5 2015 11 5 13 3
## 6 2015 11 6 16 0
## 7 2015 11 7 16 -4
## 8 2015 11 8 12 1
## 9 2015 11 9 13 1
## 10 2015 11 10 14 -1
## # ... with 721 more rows
As you can see, the temperature value of each day can be calculated from summation of T and deltaT of the previous day. We could also obtain a similar table by using lag, but this this time we should add the deltaT to temperature value of the previous day to find the current value.
travel_weather %>% transmute(year, month, day, Amsterdam, deltaT = Amsterdam - lag(Amsterdam) )
## # A tibble: 731 x 5
## year month day Amsterdam deltaT
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2015 11 1 8 NA
## 2 2015 11 2 10 2
## 3 2015 11 3 9 -1
## 4 2015 11 4 12 3
## 5 2015 11 5 13 1
## 6 2015 11 6 16 3
## 7 2015 11 7 16 0
## 8 2015 11 8 12 -4
## 9 2015 11 9 13 1
## 10 2015 11 10 14 1
## # ... with 721 more rows
Let’s see 1st three days of the whole data
travel_weather %>% slice(1:3)
## # A tibble: 3 x 7
## year month day Amsterdam London NYC Venice
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2015 11 1 8 8 16 13
## 2 2015 11 2 10 11 15 10
## 3 2015 11 3 9 11 16 9
We could also group data by year and then look at the 1st three observations of each year as follow;
travel_weather %>% group_by(year) %>% slice(1:3)
## # A tibble: 9 x 7
## # Groups: year [3]
## year month day Amsterdam London NYC Venice
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2015 11 1 8 8 16 13
## 2 2015 11 2 10 11 15 10
## 3 2015 11 3 9 11 16 9
## 4 2016 1 1 4 3 3 2
## 5 2016 1 2 6 10 2 0
## 6 2016 1 3 7 8 4 3
## 7 2017 1 1 1 7 7 2
## 8 2017 1 2 3 2 3 1
## 9 2017 1 3 4 2 5 3
Now we will transform data from wide to long format. To see a summary table of avarage temperatures of each city for each month we employed gather() method.
travel_weather_long <- travel_weather %>% gather(City, Temperature, Amsterdam:Venice)
travel_weather_long
## # A tibble: 2,924 x 5
## year month day City Temperature
## <dbl> <dbl> <dbl> <chr> <dbl>
## 1 2015 11 1 Amsterdam 8
## 2 2015 11 2 Amsterdam 10
## 3 2015 11 3 Amsterdam 9
## 4 2015 11 4 Amsterdam 12
## 5 2015 11 5 Amsterdam 13
## 6 2015 11 6 Amsterdam 16
## 7 2015 11 7 Amsterdam 16
## 8 2015 11 8 Amsterdam 12
## 9 2015 11 9 Amsterdam 13
## 10 2015 11 10 Amsterdam 14
## # ... with 2,914 more rows
Now, let’s group the temperature values in long format by month and city and then summarise to get avarage temperatures for each month in each city. Finally, we will reshape the data such that month will appear as columns.
travel_weather_long %>% group_by(month,City) %>% summarise(avgT=round(mean(Temperature))) %>% spread(month,avgT)
## # A tibble: 4 x 13
## City `1` `2` `3` `4` `5` `6` `7` `8` `9` `10`
## * <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Amsterdam 3 4 7 8 14 17 18 18 16 12
## 2 London 4 6 8 9 13 17 18 18 16 12
## 3 NYC 2 4 7 13 17 22 26 25 22 16
## 4 Venice 2 7 11 14 18 22 25 25 20 14
## # ... with 2 more variables: `11` <dbl>, `12` <dbl>
Let’s observe avarage temperature of all cities by using summarise_all
travel_weather %>% select(Amsterdam:Venice) %>% summarise_all(funs(round(mean(.))))
## # A tibble: 1 x 4
## Amsterdam London NYC Venice
## <dbl> <dbl> <dbl> <dbl>
## 1 11 12 14 14
Alternatively, we could do the same thing by using summarise_at as follows:
travel_weather %>% summarise_at(vars(Amsterdam:Venice),funs(round(mean(.))))
## # A tibble: 1 x 4
## Amsterdam London NYC Venice
## <dbl> <dbl> <dbl> <dbl>
## 1 11 12 14 14
Let’s use mutate_at to see temperature difference betwee NYC and the other cities.
travel_weather %>% mutate_at(vars(Amsterdam,London,Venice),funs(deltaT = abs(NYC-.))) %>% select(-Amsterdam,-London,-Venice)
## # A tibble: 731 x 7
## year month day NYC Amsterdam_deltaT London_deltaT Venice_deltaT
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2015 11 1 16 8 8 3
## 2 2015 11 2 15 5 4 5
## 3 2015 11 3 16 7 5 7
## 4 2015 11 4 17 5 6 7
## 5 2015 11 5 18 5 5 6
## 6 2015 11 6 21 5 7 8
## 7 2015 11 7 17 1 3 3
## 8 2015 11 8 11 1 1 2
## 9 2015 11 9 11 2 1 0
## 10 2015 11 10 12 2 2 1
## # ... with 721 more rows
The dates which Amsterdam is strictly warmer than London but colder than Venice can be calculated as follows:
travel_weather %>% filter(Amsterdam>London&Amsterdam<Venice)
## # A tibble: 165 x 7
## year month day Amsterdam London NYC Venice
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2015 11 21 5 3 9 8
## 2 2015 11 22 3 1 9 8
## 3 2016 1 13 4 3 -3 6
## 4 2016 1 16 2 1 8 4
## 5 2016 2 3 5 4 11 8
## 6 2016 2 11 4 3 -4 7
## 7 2016 2 12 2 1 -6 6
## 8 2016 2 23 4 3 3 11
## 9 2016 2 24 2 1 9 10
## 10 2016 2 25 2 1 9 8
## # ... with 155 more rows
Average temperature difference for each month of each year between NYC and Amsterdam for the days NYC is warmer than Amsterdam, in descending order.
travel_weather %>% select(-London,-Venice) %>% filter(NYC>Amsterdam) %>% group_by(year,month) %>% summarise(deltaT_NYC_A = round(mean(NYC-Amsterdam),1)) %>% arrange(desc(deltaT_NYC_A))
## # A tibble: 24 x 3
## # Groups: year [3]
## year month deltaT_NYC_A
## <dbl> <dbl> <dbl>
## 1 2016 8 8.4
## 2 2016 7 8.1
## 3 2017 9 7.9
## 4 2016 4 7.5
## 5 2017 4 7.4
## 6 2017 7 7.3
## 7 2017 8 6.5
## 8 2016 11 6.4
## 9 2016 3 6.3
## 10 2016 6 6.0
## # ... with 14 more rows
Finally, let’s find the warmest city and its temperature of each day. I’m still working on it.
travel_weather %>% group_by(year,month,day)%>% summarise_at(vars(Amsterdam:Venice),funs(maxT=max(.)))#gather(key=City, value=Temperature,-year,-month,-day)
## # A tibble: 731 x 7
## # Groups: year, month [?]
## year month day Amsterdam_maxT London_maxT NYC_maxT Venice_maxT
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2015 11 1 8 8 16 13
## 2 2015 11 2 10 11 15 10
## 3 2015 11 3 9 11 16 9
## 4 2015 11 4 12 11 17 10
## 5 2015 11 5 13 13 18 12
## 6 2015 11 6 16 14 21 13
## 7 2015 11 7 16 14 17 14
## 8 2015 11 8 12 12 11 13
## 9 2015 11 9 13 12 11 11
## 10 2015 11 10 14 14 12 11
## # ... with 721 more rows