## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.600 5.800 6.600 6.442 7.200 9.500
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## title_year num
## Min. :1916 Min. : 1.00
## 1st Qu.:1948 1st Qu.: 3.00
## Median :1971 Median : 10.00
## Mean :1971 Mean : 54.82
## 3rd Qu.:1994 3rd Qu.: 58.00
## Max. :2016 Max. :260.00
## NA's :1
## Warning: Removed 108 rows containing non-finite values (stat_bin).
## Warning: Removed 108 rows containing missing values (geom_point).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 166 7526 3000 349000
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 186 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).
## Warning: Removed 3779 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.221e+10 -1.027e+07 8.516e+05 5.845e+06 2.475e+07 5.235e+08
## NA's
## 1152
## Warning: Removed 507 rows containing non-finite values (stat_bin).
## Warning: Removed 111 rows containing non-finite values (stat_bin).
Our dataset consists of 28 variables, with almost 5043 observations.
[1] movie_title -> Categorical, Nominal [2] genres -> Categorical, Nominal [3] title_year -> Categorical, Ordinal [4] plot_keywords -> Categorical, Nominal [5] aspect_ratio -> Categorical, Ordinal [6] color -> Categorical, Nominal [7] duration -> Categorical, Ordinal [8] content_rating -> Categorical, Nominal [9] language -> Categorical, Nominal [10] country -> Categorical,Nominal [11] director_name -> Categorical, Nominal [12] actor_1_name -> Categorical, Nominal [13] actor_2_name -> Categorical, Nominal [14] actor_3_name -> Categorical, Nominal [15] movie_imdb_link -> Categorical, Nominal [16] director_facebook_likes -> Numerical, discrete [17] actor_1_facebook_likes -> Numerical, discrete [18] actor_2_facebook_likes -> Numerical, discrete [19] actor_3_facebook_likes -> Numerical, discrete [20] cast_total_facebook_likes -> Numerical, discrete [21] movie_facebook_likes -> Numerical, discrete [22] imdb_score -> Numerical, Continues [23] num_critic_for_reviews -> Numerical, discrete [24] num_user_for_reviews -> Numerical, discrete [25] num_voted_users -> Numerical, discrete [26] gross -> Numerical, Continues [27] budget -> Numerical, Continues [28] facenumber_in_poster -> Numerical, discrete
names(mmd)
## [1] "color" "director_name"
## [3] "num_critic_for_reviews" "duration"
## [5] "director_facebook_likes" "actor_3_facebook_likes"
## [7] "actor_2_name" "actor_1_facebook_likes"
## [9] "gross" "genres"
## [11] "actor_1_name" "movie_title"
## [13] "num_voted_users" "cast_total_facebook_likes"
## [15] "actor_3_name" "facenumber_in_poster"
## [17] "plot_keywords" "movie_imdb_link"
## [19] "num_user_for_reviews" "language"
## [21] "country" "content_rating"
## [23] "budget" "title_year"
## [25] "actor_2_facebook_likes" "imdb_score"
## [27] "aspect_ratio" "movie_facebook_likes"
## [29] "profit"
head(mmd)
## color director_name num_critic_for_reviews duration
## 1 Color James Cameron 723 178
## 2 Color Gore Verbinski 302 169
## 3 Color Sam Mendes 602 148
## 4 Color Christopher Nolan 813 164
## 5 Doug Walker NA NA
## 6 Color Andrew Stanton 462 132
## director_facebook_likes actor_3_facebook_likes actor_2_name
## 1 0 855 Joel David Moore
## 2 563 1000 Orlando Bloom
## 3 0 161 Rory Kinnear
## 4 22000 23000 Christian Bale
## 5 131 NA Rob Walker
## 6 475 530 Samantha Morton
## actor_1_facebook_likes gross genres
## 1 1000 760505847 Action|Adventure|Fantasy|Sci-Fi
## 2 40000 309404152 Action|Adventure|Fantasy
## 3 11000 200074175 Action|Adventure|Thriller
## 4 27000 448130642 Action|Thriller
## 5 131 NA Documentary
## 6 640 73058679 Action|Adventure|Sci-Fi
## actor_1_name movie_title
## 1 CCH Pounder Avatar
## 2 Johnny Depp Pirates of the Caribbean: At World's End
## 3 Christoph Waltz Spectre
## 4 Tom Hardy The Dark Knight Rises
## 5 Doug Walker Star Wars: Episode VII - The Force Awakens
## 6 Daryl Sabara John Carter
## num_voted_users cast_total_facebook_likes actor_3_name
## 1 886204 4834 Wes Studi
## 2 471220 48350 Jack Davenport
## 3 275868 11700 Stephanie Sigman
## 4 1144337 106759 Joseph Gordon-Levitt
## 5 8 143
## 6 212204 1873 Polly Walker
## facenumber_in_poster
## 1 0
## 2 0
## 3 1
## 4 0
## 5 0
## 6 1
## plot_keywords
## 1 avatar|future|marine|native|paraplegic
## 2 goddess|marriage ceremony|marriage proposal|pirate|singapore
## 3 bomb|espionage|sequel|spy|terrorist
## 4 deception|imprisonment|lawlessness|police officer|terrorist plot
## 5
## 6 alien|american civil war|male nipple|mars|princess
## movie_imdb_link
## 1 http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1
## 2 http://www.imdb.com/title/tt0449088/?ref_=fn_tt_tt_1
## 3 http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1
## 4 http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1
## 5 http://www.imdb.com/title/tt5289954/?ref_=fn_tt_tt_1
## 6 http://www.imdb.com/title/tt0401729/?ref_=fn_tt_tt_1
## num_user_for_reviews language country content_rating budget
## 1 3054 English USA PG-13 237000000
## 2 1238 English USA PG-13 300000000
## 3 994 English UK PG-13 245000000
## 4 2701 English USA PG-13 250000000
## 5 NA NA
## 6 738 English USA PG-13 263700000
## title_year actor_2_facebook_likes imdb_score aspect_ratio
## 1 2009 936 7.9 1.78
## 2 2007 5000 7.1 2.35
## 3 2015 393 6.8 2.35
## 4 2012 23000 8.5 2.35
## 5 NA 12 7.1 NA
## 6 2012 632 6.6 2.35
## movie_facebook_likes profit
## 1 33000 523505847
## 2 0 9404152
## 3 85000 -44925825
## 4 164000 198130642
## 5 0 NA
## 6 24000 -190641321
The main feature that I am interested in is imdb_score. I would like to understand and try to create a pattern on how it is effected by other variables.
I would like to support my investigation with;
[11] director_name -> Categorical, Nominal [8] content_rating -> Categorical, Nominal [3] title_year -> Categorical, Ordinal [21] movie_facebook_likes -> Numerical, discrete [25] num_voted_users -> Numerical, discrete
Yes, I have created a variable called profit which equals to “gross - budget”
[1] Director distribution - Unfiltered data for directors is not meaningful. I create new data frame with directors that have more than 10
[2] Content_rating distribution - It seem like focusing only to “PG”, “PG-13”, “R”, “G” makes sense to understand their effect on imdb score.
[3] Movie_facebook_likes distribution - It seem like focusing only to certain area gives me more understanding of data so I limit movie_facebook_likes 50000 and count with 1000 occurrences.
## `geom_smooth()` using method = 'gam'
## Warning: Removed 108 rows containing non-finite values (stat_smooth).
## Warning: Removed 108 rows containing missing values (geom_point).
## `geom_smooth()` using method = 'gam'
## Warning: Removed 108 rows containing non-finite values (stat_smooth).
## Warning: Removed 108 rows containing missing values (geom_point).
## `geom_smooth()` using method = 'gam'
## Warning: Removed 8 rows containing non-finite values (stat_smooth).
## Warning: Removed 8 rows containing missing values (geom_point).
## Warning: Ignoring unknown parameters: binwidth
## `geom_smooth()` using method = 'gam'
## `geom_smooth()` using method = 'gam'
## Warning: Removed 109 rows containing non-finite values (stat_smooth).
## Warning: Removed 109 rows containing missing values (geom_point).
## `geom_smooth()` using method = 'gam'
## Warning: Removed 661 rows containing non-finite values (stat_smooth).
## Warning: Removed 661 rows containing missing values (geom_point).
## `geom_smooth()` using method = 'gam'
## `geom_smooth()` using method = 'gam'
## Warning: Removed 10 rows containing non-finite values (stat_smooth).
## Warning: Removed 10 rows containing missing values (geom_point).
## `geom_smooth()` using method = 'gam'
## Warning: Removed 108 rows containing missing values (geom_point).
## Warning: Removed 487 rows containing missing values (geom_point).
## `geom_smooth()` using method = 'gam'
## Warning: Removed 507 rows containing non-finite values (stat_smooth).
## Warning: Removed 507 rows containing missing values (geom_point).
## `geom_smooth()` using method = 'gam'
## Warning: Removed 525 rows containing non-finite values (stat_smooth).
## Warning: Removed 526 rows containing missing values (geom_point).
## `geom_smooth()` using method = 'gam'
## Warning: Removed 507 rows containing non-finite values (stat_smooth).
## Warning: Removed 508 rows containing missing values (geom_point).
[1] Imdb Score vs. Title Year of Movies There is a decrease in the imdb score over years. This seems to be caused by the effect of number of movies in the early periods. Once the amount of movie increases the it decreases the average of imdb scores in late 2000’s. Which shows that the early titles movies are less but with better imdb score. Which is displayed in the graph below.
[2] Title Year vs. Imdb Score Distribution Title year is one of the key identifiers on how the imdb score change in time. It Will help support my investigation into features of interest.
[3] Budget vs Imdb Score It seems like there is a uniform relation between imdb score and budget. So with this data we might say that with higher budget you cannot guarantee a higher imdb score.
[4] Content Rating vs. Average Imdb Score When we look deeper into content rating we see that “R” (restricted) and “PG-13” movies have the most observations but this doesn’t put them on the high parts of the imdb score scale. The “TV-MA” (unsuitable for children under 17) stands on the top because these titles mostly broadcasted through TV, and tend to viewed by much broader audience.
[1] Budget vs. Title Year of Movies The budget of movies seams to increase from 1920’s to early 2000’s. then there is a decrease. This might be caused because of the uprising of independent movies. Which is displayed in the graph below.
[2] Budget vs profit By looking at the graph It is you might increase profit by increasing the budget. If you think about the general audience this makes sense, because people tend to watch movies with high budgets which provides extreme movie effects.
[3] Facebook Likes vs. Title Year I have checked the correspondence between movie facebook likes and title year. As I expected the amount of movie facebook likes start increasing around 2007, which clearly shows that the newer movies is tend to have more likes. Old movies don’t have any facebook page unless some fans of the movie created one. In contrary newer movies lunch their movies with a facebook page. To be able to find the reason of decrease around 2015 we need more data.
[4] Number of Voted Users vs. Title Year It is clearly show that as the movie imdb score increase people tend to like movie on facebook. This makes perfect sense; popular things reach to wider audience.
[1] Profit vs. Imdb Score When I checked the correspondence of these two variables I found out if the imdb score increases the profit of the movie increases in a near quadratic way. The main reason for this might be; [a] The movie become popular on imdb and then movie created so much profit. [b] The movie created high level of profit which means high gross and then since it is viewed by much wider audience, it is tend get high voted in imdb *** The reasons will be searched in multi variance analysis.
## `geom_smooth()` using method = 'gam'
## Warning: Removed 109 rows containing non-finite values (stat_smooth).
## Warning: Removed 109 rows containing missing values (geom_point).
## `geom_smooth()` using method = 'gam'
## Warning: Removed 44 rows containing non-finite values (stat_smooth).
## Warning: Removed 44 rows containing missing values (geom_point).
## Warning: Removed 2350 rows containing missing values (geom_point).
## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'
## Warning: Removed 5 rows containing non-finite values (stat_smooth).
## Warning: Removed 5 rows containing missing values (geom_point).
## Warning: Removed 5 rows containing missing values (geom_label_repel).
## `geom_smooth()` using method = 'loess'
[1] Title Year vs. Imdb Score vs. Facebook likes There is a clear trend starting on 2000’s that shows when the imdb_score increases movie facebook likes increases. Also this effect strengthens after 2000’s. Keeping in mind that the early adapters of facebook is millennials it is expected that they mostly vote for movies that are created after 2000’s. Assuming this is not hard since the ability to create life like animations is increased significantly.
[1] Title Year vs. Imdb Score vs. Content Rating One of the surprising case appeared when I analyse content rating. In the univariate analysis “R” and “pg-13” content has a huge dominance over general audience. By this fact in mind I assume that the people tend to like and rate high imdb scores for this kind of movies. But when I check it in multivariate it is clearly seen that content rating have nearly uniform distribution over imdb score. I concentrate on the booming years of movie production.
[2] avg.Imdb Score vs. avg.Profit grouped by title Year Another surprising case is this analysis. In general I assumed that the movies that are made after 2000’s have an high average of profit and imdb score. In contrary earlier movies have a high avg. profit and imdb score. This shows that with the increasing number of movie productions in 2000’s average quality of the movies decrease. In addition, the movies on the earlier years are low but the quality is high.
[3] Profit vs. imdb_score vs. num_voted_users grouped by Director Name There is a pattern that show highly voted movies have higher imdb score. But do not always make huge profits. The effect to profit can be computed in another study for deeper analysis.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.600 5.800 6.600 6.442 7.200 9.500
The distribution Imdb scores appeared to be a normal distribution. It has median of 6.6 and mean of 6.442.
## Warning: Ignoring unknown parameters: binwidth
## `geom_smooth()` using method = 'gam'
The effect of budget to imdb score appeared to be uniform. So we can easily say that budget have no real effect on imdb score.
## `geom_smooth()` using method = 'gam'
## Warning: Removed 44 rows containing non-finite values (stat_smooth).
## Warning: Removed 44 rows containing missing values (geom_point).
The plot is a more focused version of “Title Year vs. Imdb Score vs. Facebook likes” plot. It clearly indicated the uprising in the facebook likes between 2008 and 2013. Also high rated imdb scores are common in this section of the plot with high facebook likes.
The imdb movie dataset which is downloaded by kaggle.com consists of 28 variables, with almost 5043 observations. This data set has a distribution of 15 categorical and 13 Numerical variables. I started by understanding the individual variables in the data set, and then I explored interesting questions and leads as I continued to make observations on plots. Eventually, I explored the imdb score of movies across many variables and they to find the patterns that effect the imdb score.
There was a clear pattern between the variables of budget and imdb score. Budget seem to have a uniform effect on imdb score. In addition to this profit seems to have a near exponential effect on imdb score which makes the real identifier gross. in the further analysis with other individual variable facebook likes and directors make an possible effect on imdb score.
Some limitations of this research include the source of the data. The source data is limited and didn’t not cover huge number of observations. More over the there is a lack of data (NA) in many of the individual variables. Since this kind of data is eliminated while doing analysis, it might have a positive or negative effect on to the patterns presented. A dataset with more observations and fully filled data, would be better to make predictions of on effects of individual variables on imdb score. To investigate this data further, I would examine how profit is effected by the individual variables that exist in this data set.