## Warning: package 'knitr' was built under R version 3.4.3
Below you can see interesting R examples includes Explatory Data Analysis
This is an early Exploratory Data Analysis for the “Personalized Medicine”: Redefining Cancer Treatment challenge. ggplot2 and the tidyverse tools to study and visualise the structures in the data.
The data comes in 4 different files. Two csv files and two text files:
training/test variants: These are csv catalogues of the gene mutations together with the target value Class, which is the (manually) classified assessment of the mutation. The feature variables are Gene, the specific gene where the mutation took place, and Variation, the nature of the mutation. The test data of course doesn’t have the Class values. This is what we have to predict. These two files each are linked through an ID variable to another file each, namely:
training/test text: Those contain an extensive description of the evidence that was used (by experts) to manually label the mutation classes.
The text information holds the key to the classification problem and will have to be understood/modelled well to achieve a useful accuracy.
https://www.kaggle.com/headsortails/personalised-medicine-eda-with-tidy-r
Instacart is an internet – based grocery delivery service with a slogan of Groceries Delivered in an Hour. The purpose of this exercise is to analyze the trend in customer buying pattern on Instacart, suggest combination of products which can be sold together under various offers.
https://www.kaggle.com/swapnil2129/exploratory-data-analysis-market-basket-analysis
IBM HR Analytics Employee Attrition & Performance Predict attrition of your valuable employees
Uncover the factors that lead to employee attrition and explore important questions such as ‘show me a breakdown of distance from home by job role and attrition’ or ‘compare average monthly income by education and attrition’. This is a fictional data set created by IBM data scientists
https://www.kaggle.com/mbkinaci/comprehensive-eda-on-r-with-logistic-regression
Using data from Text Normalization Challenge - English Language
Extensive Exploratory Data Analysis for the Google Text Normalization Challenge - English Language competition with tidy R, ggplot2, and tidytext.
The aim of this challenge is to “automate the process of developing text normalization grammars via machine learning” (see the challenge description). Text normalisation describes the process of transforming language into a specific, self-consistent grammar system with well-defined rules. This analysis aim to convert “written expressions into into appropriate ‘spoken’ forms”.”
The data comes in the shape of two files: ../input/en_train.csv and ../input/en_test.csv. Each row contains a single language element (such as words, letters, or punctuation) together with its associated identifiers. The evaluation metric is the total percentage of correctly translated tokens.
https://www.kaggle.com/headsortails/watch-your-language-update-feature-engineering
A snapshot of American history in the making Charlottesville is home to a statue of Robert E. Lee which is slated to be removed. This dataset below captures the discussion–and copious amounts of anger–revolving around this past week’s events.
This data set consists of a random sample of 50,000 tweets per day (in accordance with the Twitter Developer Agreement) of tweets mentioning Charlottesville or containing “#charlottesville” extracted via the Twitter Streaming API, starting on August 15. The files were copied from a large Postgres database containing–currently–over 2 million tweets.
https://www.kaggle.com/vincela9/r-exploratory-data-analysis-on-charlottesville/data