MEF BDA 503

Essentials of Data Analysis

Berk Orbay

MEF University

Introduction

What is this course about?

This course has a very simple objective: To teach you the ropes of exploratory data analysis pipeline.

Mainly the stages of the pipeline are

Importing raw data
Preprocessing data to prepare for analysis
Exploring and analyzing data to find valuable information
Communicating findings with verbal reporting, visualization and interactive tools
Making everything reproducible

Check similar courses from this link.

What are some examples of exploratory data analysis?

These examples are from assignments and projects of previous years’ students.

Automotive sales data: Total sales per brand per year, domestic auto shares per brand, top sellers of 2019, Mercedes vs Volkswagen
Credit card spending analysis: Credit Card and Debit Card Transaction Amount, Credit Card Transaction Amount by Sector, Credit Card and Debit Card Transaction Amount for Market
BIST 30 stock analysis
Turkish Super League analysis
Global Terrorism Database analysis

You can find more examples in this post. For some short analyses you can check out #tidytuesday hashtag in Twitter and Github Repo (see this bonus also).

What is reproducibility?

In simple terms, the ability to reproduce result and analysis without much effort.

A fully reproducible analysis should include raw data and clear methodology (also, if possible, code) behind the analysis.

There are consequences if a study is not reproducible. Read more about reproducibility crisis here.

See a webinar here about how to create reproducible reports with R Markdown. We will learn basics of R Markdown as soon as possible and do similar stuff.

Additionally, if you can generalize your templates, you can reproduce the study for different data sets in the same topic. It can be financial reports for companies, weather reports for cities etc. See an example here about elections (in Turkish) where each city and county’s reports are generated using a single template.

Setup

Which tools are we going to use? (Programs)

Download checklist is below. Everything is completely free.

R Programming Language to code (we will also install packages)
RStudio Desktop as IDE
pandoc for document conversion
LaTeX for PDF documents (we will not cover LaTeX notation)
- Alternatively you can use tiny tex
Github Desktop for codebase updates (you can use alternatives)
- You will need a Github account
Slack for live communication (you can also use the web client)

Optional and advanced

Docker for containerization (if time permits)

Which tools are we going to use? (Packages)

Packages are code collections which can be needed for general purpose or specific needs. You can also think of them as modules.

R has many useful packages. Our course depends on several main packages and package collections.

tidyverse is a very useful package collection
- dplyr for data manipulation
- ggplot2 for data visualization
RMarkdown for document creation
- We will also use xaringan package for presentations.
- Alternatively you can use Quarto for both document creation and presentations. (Both your course page and this presentation are created with Quarto)
shiny for interactive analysis

Plus we will be using many other packages either as dependency (e.g. tibble) or as supporting packages (e.g. lubridate) during the course.

Packages 101 - A short introduction

We install packages to make them available for our use with install.packages command. You are effectively downloading the package. Remember, you need to install only once!

Once installed, we can make use of (load) packages with library command (You can also use require). Then we can use everything inside that package.

Advanced: There is an alternative way to use functions in packages. Once installed, you can directly refer to them using ::.

p.s: You can learn more about packages from this tutorial.

Datacamp and Supplementary Online Learning

We are going to make use of Datacamp courses throughout the semester.

Datacamp for Education provides free access to Datacamp resources for 6 months. You are going to get your invitations. Datacamp usually charges monthly fees but you do not need to pay anything since you are a student of this course.

Since this semester is completely online, we will rely more on online sources. You will have timed assignments on Datacamp but it will not affect your grade.

Optionally you are encouraged to learn from other sources. A list will be given in a separate slide.

Github & Github Classroom

We are going to use Github Classroom as the main assignment submission platform.

All students are going to have two repositories (where you will submit the code files) + Individual repository for their assignments + Group repository for their group assignments and group project

Invitation links will be provided by the instructor to student email addresses.

First assignment will be to set up your Github Pages. You can check the tutorial in the link (choose Project Site) but Github Classroom will handle the initial setup.

Don’t forget to download Github Desktop. If you use Linux, try shiftkey releases.

IMPORTANT (optional): It is highly recommended to get your Github Student Pack

Markdown (and R Markdown)

Markdown is a special and minimal syntax “language” which is also used in R Markdown documents.

Although there are some changes between different types of Markdown syntax Github’s guide is a good start. It is very quick and easy to learn (takes ~5 min to get the basics and you can always use a cheatsheet).

Here are some markdown editors as playground: Stackedit, Dillinger, jbt

R Markdown is essentially Markdown + R Code. You can start learning from here.

Important Update - Quarto (2022)

In July 2022, RStudio announced that they will change their name to Posit starting from November 2022. They are going to be more inclusive towards Python.
They announced Python Shiny (still in Alpha phase).
They also announced Quarto will supersede RMarkdown. Announcement came a little late, so there are no lecture notes about Quarto. However, a list of sources and quality guides will be included in a separate document.
Ultimately these updates are to your benefit. You will be able to integrate Python and R as seamlessly as possible. Also, some of what you learned in this class will be applicable to Python.

Setup Checklist

It is actually not very complicated :) Here is a checklist to get you started.

Download required programs and make sure they are working.
Open a Github account.
Open a Slack account.
When it comes, accept Datacamp invitation.
When it comes, accept Github Classroom invitation.

7-week Course Schedule

Course Expectations

Minimal learning expectations from this course are very clearly defined.

Fair understanding of data manipulation (using dplyr) and data visualization (using ggplot2).
Ability to analyze data and communicate findings with clear and coherent reporting (using rmarkdown).
Ability to create interactive analysis systems (using shiny).
Ability to deploy and publish (using Github Pages and shinyapps.io).
Basic understanding of Python & R interoperability.

It is up to the student to extend their learning experience. Both R and the data science field have much to offer. For instance, there are topics like cloud computing, package making/management, containerization, advanced modelling, process automation. There are also numerous interesting R packages making analytics life easier.

7-week Course Schedule

We have lectures every second Wednesday between 18:30-21:30. Times may vary occasionally.

Tentative Schedule

Schedule may change with progress. We can have some bonus material to cover.

Week 1: Setup (rmarkdown) and Base R

Weeks 2-3: Data Processing, Visualization and Reporting (dplyr + ggplot2)

Week 4: Interactive Analysis (shiny) and Packaging

Week 5: Data Processing, Visualization and Reporting - 2

Week 6: R & Python Interoperability

Week 7: Recap, Presentations and Final (Take Home)

Course Medium: Online Lectures

At start, instructor will deliver a short lecture (15-30 mins) either live or from a recorded video.
Most communication will be on Slack (and perhaps Google Meet 1-to-1) for “live support”. It means questions of any kind can be asked on Slack.
There might be “tasks” at each lecture to be finished within lecture hours. Most of the time it will be self-learning from a source and exercises to be finished.
Most 3-hour lectures will be block, ending at 21:30. Since most of it will be spent with self-learning, students can arrange their own breaks.
No regular office hours. We will use Slack.
We may have guests from both academia and professional domains about data science related practices in real life problems.

Grading

In-class exercises & Homework: 30%
- There will be a number of graded exercises and homeworks. No individual weights, they will be determined as a whole
- All homeworks should be on your public Github Pages and explicitly linked from your main Progress Journal. Otherwise, they will not be graded.
Group Project: 30%
Take home final: 40%
Bonuses!

Groups and Group Project

As a group project you will be asked to do a complete data analysis on a relevant real-life data set. Check previous examples.
Groups should be about 4 to 5 students. Any fewer or more will not be allowed.
You need to form your groups next lecture, 18:30. For those without a group or groups without members, there will be a random assignment.
Once settled, you cannot change your group under any circumstances.
There will be group assignments as well.

External Resources (Streams)

There are lots of fantastic resources out there. I suggest signing up for R newsletters and following people on Twitter. Static sources are on course webpage.

plus lots of people to follow on Twitter.

Course “Ethics”

These are very simple rules to follow to ensure a fair and productive course experience.

Please do collaborate with your classmates.
Please do check internet and other sources for solutions. You may occasionally come across exact solutions to some homeworks and exercises but it is up to you to learn or simply paste and pass.
Please do not blatantly copy paste stuff and always provide references to your sources (with links). It is completely OK to “oh I found this amazing plotting code on this link and using it for my analysis”. It is actually very desirable.

But it is not OK to “Here is a complete copy paste of some guy’s analysis from Kaggle as my assignment”. It is a direct F and a report to the department.

Please do not submit the same as your classmate verbatim. Write your own code even if you peek.

Summary

This course is about end-to-end analytics and reproducibility.
Complete your setup (install programs and packages, sign up for services -remember, everything is free!-).
Build your Github Pages webpage. Learn how to push to your repository.
Finish your first assignment by next week.
Determine your group and email to instructor (group name + members) as soon as possible.

Thanks!

Course webpage https://mef-bda503.github.io/fall22