What is the effects variables on imdb score by CEM KILICLI

Univariate Plots Section

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.600   5.800   6.600   6.442   7.200   9.500
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    title_year        num        
##  Min.   :1916   Min.   :  1.00  
##  1st Qu.:1948   1st Qu.:  3.00  
##  Median :1971   Median : 10.00  
##  Mean   :1971   Mean   : 54.82  
##  3rd Qu.:1994   3rd Qu.: 58.00  
##  Max.   :2016   Max.   :260.00  
##  NA's   :1
## Warning: Removed 108 rows containing non-finite values (stat_bin).

## Warning: Removed 108 rows containing missing values (geom_point).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0     166    7526    3000  349000
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 186 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).

## Warning: Removed 3779 rows containing non-finite values (stat_bin).

## Warning: Removed 1 rows containing missing values (geom_bar).

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -1.221e+10 -1.027e+07  8.516e+05  5.845e+06  2.475e+07  5.235e+08 
##       NA's 
##       1152
## Warning: Removed 507 rows containing non-finite values (stat_bin).

## Warning: Removed 111 rows containing non-finite values (stat_bin).

Univariate Analysis

What is the structure of your dataset?

Our dataset consists of 28 variables, with almost 5043 observations.

[1] movie_title -> Categorical, Nominal [2] genres -> Categorical, Nominal [3] title_year -> Categorical, Ordinal [4] plot_keywords -> Categorical, Nominal [5] aspect_ratio -> Categorical, Ordinal [6] color -> Categorical, Nominal [7] duration -> Categorical, Ordinal [8] content_rating -> Categorical, Nominal [9] language -> Categorical, Nominal [10] country -> Categorical,Nominal [11] director_name -> Categorical, Nominal [12] actor_1_name -> Categorical, Nominal [13] actor_2_name -> Categorical, Nominal [14] actor_3_name -> Categorical, Nominal [15] movie_imdb_link -> Categorical, Nominal [16] director_facebook_likes -> Numerical, discrete [17] actor_1_facebook_likes -> Numerical, discrete [18] actor_2_facebook_likes -> Numerical, discrete [19] actor_3_facebook_likes -> Numerical, discrete [20] cast_total_facebook_likes -> Numerical, discrete [21] movie_facebook_likes -> Numerical, discrete [22] imdb_score -> Numerical, Continues [23] num_critic_for_reviews -> Numerical, discrete [24] num_user_for_reviews -> Numerical, discrete [25] num_voted_users -> Numerical, discrete [26] gross -> Numerical, Continues [27] budget -> Numerical, Continues [28] facenumber_in_poster -> Numerical, discrete

names(mmd)
##  [1] "color"                     "director_name"            
##  [3] "num_critic_for_reviews"    "duration"                 
##  [5] "director_facebook_likes"   "actor_3_facebook_likes"   
##  [7] "actor_2_name"              "actor_1_facebook_likes"   
##  [9] "gross"                     "genres"                   
## [11] "actor_1_name"              "movie_title"              
## [13] "num_voted_users"           "cast_total_facebook_likes"
## [15] "actor_3_name"              "facenumber_in_poster"     
## [17] "plot_keywords"             "movie_imdb_link"          
## [19] "num_user_for_reviews"      "language"                 
## [21] "country"                   "content_rating"           
## [23] "budget"                    "title_year"               
## [25] "actor_2_facebook_likes"    "imdb_score"               
## [27] "aspect_ratio"              "movie_facebook_likes"     
## [29] "profit"
head(mmd)
##   color     director_name num_critic_for_reviews duration
## 1 Color     James Cameron                    723      178
## 2 Color    Gore Verbinski                    302      169
## 3 Color        Sam Mendes                    602      148
## 4 Color Christopher Nolan                    813      164
## 5             Doug Walker                     NA       NA
## 6 Color    Andrew Stanton                    462      132
##   director_facebook_likes actor_3_facebook_likes     actor_2_name
## 1                       0                    855 Joel David Moore
## 2                     563                   1000    Orlando Bloom
## 3                       0                    161     Rory Kinnear
## 4                   22000                  23000   Christian Bale
## 5                     131                     NA       Rob Walker
## 6                     475                    530  Samantha Morton
##   actor_1_facebook_likes     gross                          genres
## 1                   1000 760505847 Action|Adventure|Fantasy|Sci-Fi
## 2                  40000 309404152        Action|Adventure|Fantasy
## 3                  11000 200074175       Action|Adventure|Thriller
## 4                  27000 448130642                 Action|Thriller
## 5                    131        NA                     Documentary
## 6                    640  73058679         Action|Adventure|Sci-Fi
##      actor_1_name                                             movie_title
## 1     CCH Pounder                                                 Avatar 
## 2     Johnny Depp               Pirates of the Caribbean: At World's End 
## 3 Christoph Waltz                                                Spectre 
## 4       Tom Hardy                                  The Dark Knight Rises 
## 5     Doug Walker Star Wars: Episode VII - The Force Awakens             
## 6    Daryl Sabara                                            John Carter 
##   num_voted_users cast_total_facebook_likes         actor_3_name
## 1          886204                      4834            Wes Studi
## 2          471220                     48350       Jack Davenport
## 3          275868                     11700     Stephanie Sigman
## 4         1144337                    106759 Joseph Gordon-Levitt
## 5               8                       143                     
## 6          212204                      1873         Polly Walker
##   facenumber_in_poster
## 1                    0
## 2                    0
## 3                    1
## 4                    0
## 5                    0
## 6                    1
##                                                      plot_keywords
## 1                           avatar|future|marine|native|paraplegic
## 2     goddess|marriage ceremony|marriage proposal|pirate|singapore
## 3                              bomb|espionage|sequel|spy|terrorist
## 4 deception|imprisonment|lawlessness|police officer|terrorist plot
## 5                                                                 
## 6               alien|american civil war|male nipple|mars|princess
##                                        movie_imdb_link
## 1 http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1
## 2 http://www.imdb.com/title/tt0449088/?ref_=fn_tt_tt_1
## 3 http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1
## 4 http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1
## 5 http://www.imdb.com/title/tt5289954/?ref_=fn_tt_tt_1
## 6 http://www.imdb.com/title/tt0401729/?ref_=fn_tt_tt_1
##   num_user_for_reviews language country content_rating    budget
## 1                 3054  English     USA          PG-13 237000000
## 2                 1238  English     USA          PG-13 300000000
## 3                  994  English      UK          PG-13 245000000
## 4                 2701  English     USA          PG-13 250000000
## 5                   NA                                        NA
## 6                  738  English     USA          PG-13 263700000
##   title_year actor_2_facebook_likes imdb_score aspect_ratio
## 1       2009                    936        7.9         1.78
## 2       2007                   5000        7.1         2.35
## 3       2015                    393        6.8         2.35
## 4       2012                  23000        8.5         2.35
## 5         NA                     12        7.1           NA
## 6       2012                    632        6.6         2.35
##   movie_facebook_likes     profit
## 1                33000  523505847
## 2                    0    9404152
## 3                85000  -44925825
## 4               164000  198130642
## 5                    0         NA
## 6                24000 -190641321

What is/are the main feature(s) of interest in your dataset?

The main feature that I am interested in is imdb_score. I would like to understand and try to create a pattern on how it is effected by other variables.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I would like to support my investigation with;

[11] director_name -> Categorical, Nominal [8] content_rating -> Categorical, Nominal [3] title_year -> Categorical, Ordinal [21] movie_facebook_likes -> Numerical, discrete [25] num_voted_users -> Numerical, discrete

Did you create any new variables from existing variables in the dataset?

Yes, I have created a variable called profit which equals to “gross - budget”

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

[1] Director distribution - Unfiltered data for directors is not meaningful. I create new data frame with directors that have more than 10

[2] Content_rating distribution - It seem like focusing only to “PG”, “PG-13”, “R”, “G” makes sense to understand their effect on imdb score.

[3] Movie_facebook_likes distribution - It seem like focusing only to certain area gives me more understanding of data so I limit movie_facebook_likes 50000 and count with 1000 occurrences.

Bivariate Plots Section

## `geom_smooth()` using method = 'gam'
## Warning: Removed 108 rows containing non-finite values (stat_smooth).
## Warning: Removed 108 rows containing missing values (geom_point).

## `geom_smooth()` using method = 'gam'
## Warning: Removed 108 rows containing non-finite values (stat_smooth).

## Warning: Removed 108 rows containing missing values (geom_point).

## `geom_smooth()` using method = 'gam'
## Warning: Removed 8 rows containing non-finite values (stat_smooth).
## Warning: Removed 8 rows containing missing values (geom_point).

## Warning: Ignoring unknown parameters: binwidth
## `geom_smooth()` using method = 'gam'

## `geom_smooth()` using method = 'gam'
## Warning: Removed 109 rows containing non-finite values (stat_smooth).
## Warning: Removed 109 rows containing missing values (geom_point).

## `geom_smooth()` using method = 'gam'
## Warning: Removed 661 rows containing non-finite values (stat_smooth).
## Warning: Removed 661 rows containing missing values (geom_point).

## `geom_smooth()` using method = 'gam'

## `geom_smooth()` using method = 'gam'
## Warning: Removed 10 rows containing non-finite values (stat_smooth).
## Warning: Removed 10 rows containing missing values (geom_point).

## `geom_smooth()` using method = 'gam'

## Warning: Removed 108 rows containing missing values (geom_point).

## Warning: Removed 487 rows containing missing values (geom_point).

## `geom_smooth()` using method = 'gam'
## Warning: Removed 507 rows containing non-finite values (stat_smooth).
## Warning: Removed 507 rows containing missing values (geom_point).

## `geom_smooth()` using method = 'gam'
## Warning: Removed 525 rows containing non-finite values (stat_smooth).
## Warning: Removed 526 rows containing missing values (geom_point).

## `geom_smooth()` using method = 'gam'
## Warning: Removed 507 rows containing non-finite values (stat_smooth).
## Warning: Removed 508 rows containing missing values (geom_point).

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

[1] Imdb Score vs. Title Year of Movies There is a decrease in the imdb score over years. This seems to be caused by the effect of number of movies in the early periods. Once the amount of movie increases the it decreases the average of imdb scores in late 2000’s. Which shows that the early titles movies are less but with better imdb score. Which is displayed in the graph below.

[2] Title Year vs. Imdb Score Distribution Title year is one of the key identifiers on how the imdb score change in time. It Will help support my investigation into features of interest.

[3] Budget vs Imdb Score It seems like there is a uniform relation between imdb score and budget. So with this data we might say that with higher budget you cannot guarantee a higher imdb score.

[4] Content Rating vs. Average Imdb Score When we look deeper into content rating we see that “R” (restricted) and “PG-13” movies have the most observations but this doesn’t put them on the high parts of the imdb score scale. The “TV-MA” (unsuitable for children under 17) stands on the top because these titles mostly broadcasted through TV, and tend to viewed by much broader audience.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

[1] Budget vs. Title Year of Movies The budget of movies seams to increase from 1920’s to early 2000’s. then there is a decrease. This might be caused because of the uprising of independent movies. Which is displayed in the graph below.

[2] Budget vs profit By looking at the graph It is you might increase profit by increasing the budget. If you think about the general audience this makes sense, because people tend to watch movies with high budgets which provides extreme movie effects.

[3] Facebook Likes vs. Title Year I have checked the correspondence between movie facebook likes and title year. As I expected the amount of movie facebook likes start increasing around 2007, which clearly shows that the newer movies is tend to have more likes. Old movies don’t have any facebook page unless some fans of the movie created one. In contrary newer movies lunch their movies with a facebook page. To be able to find the reason of decrease around 2015 we need more data.

[4] Number of Voted Users vs. Title Year It is clearly show that as the movie imdb score increase people tend to like movie on facebook. This makes perfect sense; popular things reach to wider audience.

What was the strongest relationship you found?

[1] Profit vs. Imdb Score When I checked the correspondence of these two variables I found out if the imdb score increases the profit of the movie increases in a near quadratic way. The main reason for this might be; [a] The movie become popular on imdb and then movie created so much profit. [b] The movie created high level of profit which means high gross and then since it is viewed by much wider audience, it is tend get high voted in imdb *** The reasons will be searched in multi variance analysis.

Multivariate Plots Section

## `geom_smooth()` using method = 'gam'
## Warning: Removed 109 rows containing non-finite values (stat_smooth).
## Warning: Removed 109 rows containing missing values (geom_point).

## `geom_smooth()` using method = 'gam'
## Warning: Removed 44 rows containing non-finite values (stat_smooth).
## Warning: Removed 44 rows containing missing values (geom_point).

## Warning: Removed 2350 rows containing missing values (geom_point).

## `geom_smooth()` using method = 'loess'

## `geom_smooth()` using method = 'loess'
## Warning: Removed 5 rows containing non-finite values (stat_smooth).
## Warning: Removed 5 rows containing missing values (geom_point).
## Warning: Removed 5 rows containing missing values (geom_label_repel).

## `geom_smooth()` using method = 'loess'

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

[1] Title Year vs. Imdb Score vs. Facebook likes There is a clear trend starting on 2000’s that shows when the imdb_score increases movie facebook likes increases. Also this effect strengthens after 2000’s. Keeping in mind that the early adapters of facebook is millennials it is expected that they mostly vote for movies that are created after 2000’s. Assuming this is not hard since the ability to create life like animations is increased significantly.

Were there any interesting or surprising interactions between features?

[1] Title Year vs. Imdb Score vs. Content Rating One of the surprising case appeared when I analyse content rating. In the univariate analysis “R” and “pg-13” content has a huge dominance over general audience. By this fact in mind I assume that the people tend to like and rate high imdb scores for this kind of movies. But when I check it in multivariate it is clearly seen that content rating have nearly uniform distribution over imdb score. I concentrate on the booming years of movie production.

[2] avg.Imdb Score vs. avg.Profit grouped by title Year Another surprising case is this analysis. In general I assumed that the movies that are made after 2000’s have an high average of profit and imdb score. In contrary earlier movies have a high avg. profit and imdb score. This shows that with the increasing number of movie productions in 2000’s average quality of the movies decrease. In addition, the movies on the earlier years are low but the quality is high.

[3] Profit vs. imdb_score vs. num_voted_users grouped by Director Name There is a pattern that show highly voted movies have higher imdb score. But do not always make huge profits. The effect to profit can be computed in another study for deeper analysis.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.600   5.800   6.600   6.442   7.200   9.500

Description One

The distribution Imdb scores appeared to be a normal distribution. It has median of 6.6 and mean of 6.442.

Plot Two

## Warning: Ignoring unknown parameters: binwidth
## `geom_smooth()` using method = 'gam'

Description Two

The effect of budget to imdb score appeared to be uniform. So we can easily say that budget have no real effect on imdb score.

Plot Three

## `geom_smooth()` using method = 'gam'
## Warning: Removed 44 rows containing non-finite values (stat_smooth).
## Warning: Removed 44 rows containing missing values (geom_point).

Description Three

The plot is a more focused version of “Title Year vs. Imdb Score vs. Facebook likes” plot. It clearly indicated the uprising in the facebook likes between 2008 and 2013. Also high rated imdb scores are common in this section of the plot with high facebook likes.


Reflection

The imdb movie dataset which is downloaded by kaggle.com consists of 28 variables, with almost 5043 observations. This data set has a distribution of 15 categorical and 13 Numerical variables. I started by understanding the individual variables in the data set, and then I explored interesting questions and leads as I continued to make observations on plots. Eventually, I explored the imdb score of movies across many variables and they to find the patterns that effect the imdb score.

There was a clear pattern between the variables of budget and imdb score. Budget seem to have a uniform effect on imdb score. In addition to this profit seems to have a near exponential effect on imdb score which makes the real identifier gross. in the further analysis with other individual variable facebook likes and directors make an possible effect on imdb score.

Some limitations of this research include the source of the data. The source data is limited and didn’t not cover huge number of observations. More over the there is a lack of data (NA) in many of the individual variables. Since this kind of data is eliminated while doing analysis, it might have a positive or negative effect on to the patterns presented. A dataset with more observations and fully filled data, would be better to make predictions of on effects of individual variables on imdb score. To investigate this data further, I would examine how profit is effected by the individual variables that exist in this data set.