Essentials of Data Analysis
MEF University
This course has a very simple objective: To teach you the ropes of exploratory data analysis pipeline.
Mainly the stages of the pipeline are
Check similar courses from this link.
These examples are from assignments and projects of previous years’ students.
Automotive sales data: Total sales per brand per year, domestic auto shares per brand, top sellers of 2019, Mercedes vs Volkswagen
Credit card spending analysis: Credit Card and Debit Card Transaction Amount, Credit Card Transaction Amount by Sector, Credit Card and Debit Card Transaction Amount for Market
You can find more examples in this post. For some short analyses you can check out #tidytuesday hashtag in Twitter and Github Repo (see this bonus also).
In simple terms, the ability to reproduce result and analysis without much effort.
A fully reproducible analysis should include raw data and clear methodology (also, if possible, code) behind the analysis.
There are consequences if a study is not reproducible. Read more about reproducibility crisis here.
See a webinar here about how to create reproducible reports with R Markdown. We will learn basics of R Markdown as soon as possible and do similar stuff.
Additionally, if you can generalize your templates, you can reproduce the study for different data sets in the same topic. It can be financial reports for companies, weather reports for cities etc. See an example here about elections (in Turkish) where each city and county’s reports are generated using a single template.
Download checklist is below. Everything is completely free.
Optional and advanced
Packages are code collections which can be needed for general purpose or specific needs. You can also think of them as modules.
R has many useful packages. Our course depends on several main packages and package collections.
Plus we will be using many other packages either as dependency (e.g. tibble) or as supporting packages (e.g. lubridate) during the course.
We install packages to make them available for our use with install.packages
command. You are effectively downloading the package. Remember, you need to install only once!
Once installed, we can make use of (load) packages with library
command (You can also use require
). Then we can use everything inside that package.
Advanced: There is an alternative way to use functions in packages. Once installed, you can directly refer to them using ::
.
p.s: You can learn more about packages from this tutorial.
We are going to make use of Datacamp courses throughout the semester.
Datacamp for Education provides free access to Datacamp resources for 6 months. You are going to get your invitations. Datacamp usually charges monthly fees but you do not need to pay anything since you are a student of this course.
Since this semester is completely online, we will rely more on online sources. You will have timed assignments on Datacamp but it will not affect your grade.
Optionally you are encouraged to learn from other sources. A list will be given in a separate slide.
We are going to use Github Classroom as the main assignment submission platform.
All students are going to have two repositories (where you will submit the code files) + Individual repository for their assignments + Group repository for their group assignments and group project
Invitation links will be provided by the instructor to student email addresses.
First assignment will be to set up your Github Pages. You can check the tutorial in the link (choose Project Site) but Github Classroom will handle the initial setup.
Don’t forget to download Github Desktop. If you use Linux, try shiftkey releases.
IMPORTANT (optional): It is highly recommended to get your Github Student Pack
Markdown is a special and minimal syntax “language” which is also used in R Markdown documents.
Although there are some changes between different types of Markdown syntax Github’s guide is a good start. It is very quick and easy to learn (takes ~5 min to get the basics and you can always use a cheatsheet).
Here are some markdown editors as playground: Stackedit, Dillinger, jbt
R Markdown is essentially Markdown + R Code. You can start learning from here.
In July 2022, RStudio announced that they will change their name to Posit starting from November 2022. They are going to be more inclusive towards Python.
They announced Python Shiny (still in Alpha phase).
They also announced Quarto will supersede RMarkdown. Announcement came a little late, so there are no lecture notes about Quarto. However, a list of sources and quality guides will be included in a separate document.
Ultimately these updates are to your benefit. You will be able to integrate Python and R as seamlessly as possible. Also, some of what you learned in this class will be applicable to Python.
It is actually not very complicated :) Here is a checklist to get you started.
Download required programs and make sure they are working.
Open a Github account.
Open a Slack account.
When it comes, accept Datacamp invitation.
When it comes, accept Github Classroom invitation.
Minimal learning expectations from this course are very clearly defined.
Fair understanding of data manipulation (using dplyr
) and data visualization (using ggplot2
).
Ability to analyze data and communicate findings with clear and coherent reporting (using rmarkdown
).
Ability to create interactive analysis systems (using shiny
).
Ability to deploy and publish (using Github Pages
and shinyapps.io
).
Basic understanding of Python & R interoperability.
It is up to the student to extend their learning experience. Both R and the data science field have much to offer. For instance, there are topics like cloud computing, package making/management, containerization, advanced modelling, process automation. There are also numerous interesting R packages making analytics life easier.
We have lectures every second Wednesday between 18:30-21:30. Times may vary occasionally.
Schedule may change with progress. We can have some bonus material to cover.
Week 1: Setup (rmarkdown
) and Base R
Weeks 2-3: Data Processing, Visualization and Reporting (dplyr
+ ggplot2
)
Week 4: Interactive Analysis (shiny
) and Packaging
Week 5: Data Processing, Visualization and Reporting - 2
Week 6: R & Python Interoperability
Week 7: Recap, Presentations and Final (Take Home)
At start, instructor will deliver a short lecture (15-30 mins) either live or from a recorded video.
Most communication will be on Slack (and perhaps Google Meet 1-to-1) for “live support”. It means questions of any kind can be asked on Slack.
There might be “tasks” at each lecture to be finished within lecture hours. Most of the time it will be self-learning from a source and exercises to be finished.
Most 3-hour lectures will be block, ending at 21:30. Since most of it will be spent with self-learning, students can arrange their own breaks.
No regular office hours. We will use Slack.
We may have guests from both academia and professional domains about data science related practices in real life problems.
In-class exercises & Homework: 30%
Group Project: 30%
Take home final: 40%
Bonuses!
As a group project you will be asked to do a complete data analysis on a relevant real-life data set. Check previous examples.
Groups should be about 4 to 5 students. Any fewer or more will not be allowed.
You need to form your groups next lecture, 18:30. For those without a group or groups without members, there will be a random assignment.
Once settled, you cannot change your group under any circumstances.
There will be group assignments as well.
There are lots of fantastic resources out there. I suggest signing up for R newsletters and following people on Twitter. Static sources are on course webpage.
plus lots of people to follow on Twitter.
These are very simple rules to follow to ensure a fair and productive course experience.
Please do collaborate with your classmates.
Please do check internet and other sources for solutions. You may occasionally come across exact solutions to some homeworks and exercises but it is up to you to learn or simply paste and pass.
Please do not blatantly copy paste stuff and always provide references to your sources (with links). It is completely OK to “oh I found this amazing plotting code on this link and using it for my analysis”. It is actually very desirable.
But it is not OK to “Here is a complete copy paste of some guy’s analysis from Kaggle as my assignment”. It is a direct F and a report to the department.
This course is about end-to-end analytics and reproducibility.
Complete your setup (install programs and packages, sign up for services -remember, everything is free!-).
Build your Github Pages webpage. Learn how to push to your repository.
Finish your first assignment by next week.
Determine your group and email to instructor (group name + members) as soon as possible.
Course webpage https://mef-bda503.github.io/fall22