How People Order Online? Analysis of 3 million Instacart orders.

BDA-503 - Term Project - Berk Orbay
29.10.2017

As group cleveR, we will work on the online order behaviours of Instacart customers with group members:

About the Company

Instacart is a San Francisco based online shopping website that provide service for online grocery shopping orders and same-day delivery service from various local stores. Instacart has founded in San Francisco on June 2012 and operating in 39 states across United Satates of America today.

Our Purpose

In this project, our work will be composed of 3 main areas.

1.Exploratory Analysis on the data
2.Segmentation
3.Recommendation

Our study will begin with exploratory data analysis. We will try to explain customer behaviours and pruchasing patterns.We will try to reach facts like “what time of the day customers order most”, “What is avarege basket size?”, “Which products ordered togather” etc.

After exploratory analysis, we will try to group customers according to their purchasing behaviours. We will exemine their orders according to product category, order time, order amount in order to find similarities and differences.

At the end, we will try to find what a customer most likely to buy in his/her next order and try to suggest them the products.

About the Data

The dataset contains over 3 million grocery orders and consist of 6 csv files. One of them is a train dataset. Lets look at the row numbers and variables in each file:

Products

products<-read.csv("data/products.csv")

Product has 4 attributes which are product_id, product_name, aisle_id, department_id

kable(head(products,5))
product_id product_name aisle_id department_id
1 Chocolate Sandwich Cookies 61 19
2 All-Seasons Salt 104 13
3 Robust Golden Unsweetened Oolong Tea 94 7
4 Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce 38 1
5 Green Chile Anytime Sauce 5 13

Aisles

aisles<-read.csv("data/aisles.csv")

Aisle has 4 attributes which are product_id, product_name, aisle_id, department_id

kable(head(aisles,5))
aisle_id aisle
1 prepared soups salads
2 specialty cheeses
3 energy granola bars
4 instant foods
5 marinades meat preparation

Departments

departments<-read.csv("data/departments.csv")

Department has 4 attributes which are product_id, product_name, aisle_id, department_id

kable(head(departments,5))
department_id department
1 frozen
2 other
3 bakery
4 produce
5 alcohol

Orders

orders<-read.csv("data/orders.csv")

Order has 4 attributes which are product_id, product_name, aisle_id, department_id

kable(head(orders,5))
order_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order
2539329 1 prior 1 2 8 NA
2398795 1 prior 2 3 7 15
473747 1 prior 3 3 12 21
2254736 1 prior 4 4 7 29
431534 1 prior 5 4 15 28

Order_Products_Train

op_train<-read.csv("data/order_products__train.csv")

Order product train has 4 attributes which are order_id, product_id, add_to_cart_order, reordered

kable(head(op_train,5))
order_id product_id add_to_cart_order reordered
1 49302 1 1
1 11109 2 1
1 10246 3 0
1 49683 4 0
1 43633 5 1

Order_Products_Prior

op_prior<-read.csv("data/order_products__prior.csv")

Order product train has 4 attributes which are order_id, product_id, add_to_cart_order, reordered

kable(head(op_prior,5))
order_id product_id add_to_cart_order reordered
2 33120 1 1
2 28985 2 1
2 9327 3 0
2 45918 4 1
2 30035 5 0

DataJoinProcess

orders_product<-rbind(op_prior,op_train)
orders_full<-left_join(orders,orders_product,by="order_id")
orders_full<-left_join(orders_full,products,by="product_id")
orders_full<-left_join(orders_full,aisles,by="aisle_id")
orders_full<-left_join(orders_full,departments,by="department_id")
rm(orders_product, aisles,op_prior,op_train,orders,products,departments)
#orders_full <-orders_full %>% filter(row_number()<=1000000)
order_number_by_user_id<-orders_full %>%  group_by(user_id) %>% summarise(number_of_order=n_distinct(order_id)) 


General Information

Data includes 3421083 order. There are 206209 users who placed on order. Minumum of order placed by the users is 4 . Maxiumum of order placed by the users is 100

#rm(order_number_by_user_id)
temp<- orders_full %>% group_by(department) %>% summarise(count=n()) %>% mutate(per=count/sum(count) ) %>% arrange(desc(per))
per<- temp %>% filter(row_number()<=4) %>% summarise(sum=round(sum(per),2)*100) 


Bestseller Departmants

When we look at the number of products sold in each department, departmant produce and dairy eggs are obvious leaders in sales quantity. Produce departmant contain products such as vegatable and fruit in the form of fresh or package whereas dairy eggs contain yogurt, packaged cheese, milk and eggs, etc. Also we can see that 63 % of the sales are come from 4 departrments.

Product_number<-orders_full %>% filter(!is.na(department)) %>%  group_by(department) %>% summarise(Number_of_Product=n()) %>% arrange(desc(Number_of_Product)) 
ggplot(Product_number,aes(x=reorder(department,-Number_of_Product),y=Number_of_Product)) +geom_bar(fill="#56B4E6", stat = "identity") + scale_x_discrete(name="Department") + theme( axis.text.x = element_text(angle =90)) + ggtitle('Number of Product Sold by Department') 

rm(Product_number,per,temp)


Bestseller Aisles

Most ordered product’s aisles are looking in tandem with product’s departmant. Fresh vegatables and fruits aisles take the lead in sales quantity by far.

Product_number<-orders_full %>% filter(!is.na(aisle)) %>%  group_by(aisle) %>% summarise(Number_of_Product=n()) %>% arrange(desc(Number_of_Product)) 
ggplot(Product_number %>%  filter(row_number()<=15),aes(x=reorder(aisle,-Number_of_Product),y=Number_of_Product)) +geom_bar(position="dodge",fill="#56B4E6", stat = "identity") + theme(axis.text.x = element_text(angle =90)) + scale_x_discrete(name="Aisle")+ggtitle('Number of Product Sold by Aisle') 

rm(Product_number)


Bestseller Products

If we put top 15 products on the chart, we can see that Bananas are the most ordered products followed by Bag of Organic Bananas and Organic Strawberries.

#Sira halinde gösterim yapilsin
Product_number<-orders_full %>% filter(!is.na(product_name)) %>%  group_by(product_name) %>% summarise(Number_of_Product=n()) %>% arrange(desc(Number_of_Product)) 

ggplot(Product_number %>%  filter(row_number()<=15),aes(x=reorder(product_name,-Number_of_Product),y=Number_of_Product)) +geom_bar(fill="#56B4E6", stat = "identity") +scale_x_discrete(name="Product") + theme(axis.text.x = element_text(angle =90)) + ggtitle('Number of Product Sold') 

rm(Product_number)


Order Number according to Day Of Week

It can be said that customers order at weekends more. (In the row data days described by number from 0 to 6. We assume that 0 equals Saturday)

#ls(orders_full)
orders_full<-orders_full %>% mutate(order_day=ifelse(order_dow==0,"Sat",
ifelse(order_dow==1,"Sun",
ifelse(order_dow==2,"Man",
ifelse(order_dow==3,"Tue",
ifelse(order_dow==4,"Wed",
ifelse(order_dow==5,"Thu",
ifelse(order_dow==6,"Fri",NA))))))))


order_number_by_day<-orders_full %>%  group_by(order_day,order_dow) %>% summarise(number_of_order=n_distinct(order_id))
ggplot(order_number_by_day,aes(x=reorder(order_day,order_dow),y=number_of_order)) +geom_bar(fill="#56B4E6", stat = "identity")+ scale_x_discrete(name="Day") + ggtitle('Number of Order by Day of Week') 

rm(order_number_by_day)


Order Number according to Day Of Week

Most orders come during daytime, from 7 am to 7 pm, which is not very surprising.

#ls(orders_full)
#orders_full <- mutate(orders_full , order_day=derivedFactor('Sat'=order_dow==0,'Sun'=order_dow==1,'Man'=order_dow==2,'Tue'=order_dow==3,'Wed'=order_dow==4,'Thu'=order_dow==5,'Fri'=order_dow==6 ,.default = NA))
order_number_by_hour<-orders_full %>%  group_by(order_hour_of_day) %>% summarise(number_of_order=n_distinct(order_id))
ggplot(order_number_by_hour,aes(x=order_hour_of_day,y=number_of_order)) +geom_bar(fill="#56B4E6", stat = "identity")  + ggtitle('Number of Order by Department') 

rm(order_number_by_hour)


Item Number in the Orders

People mostly add to cart less than 10 items when they order. As it clear in the grap, after 6 item per order there is negative correlation between number of items in the order and order number.

Max_Products_by_Order_Id<-orders_full %>%  group_by(order_id) %>% summarise(max_product=max(add_to_cart_order)) %>% arrange(max_product)
ggplot(Max_Products_by_Order_Id,aes(x=max_product)) + geom_histogram(color="red", fill="#56B4E6" ,bins = 50)  + 
  theme(axis.text.x = element_text(angle =0), axis.title.x=element_blank()) +
  ggtitle('Add to Chart Order Histogram')
## Warning: Removed 75000 rows containing non-finite values (stat_bin).

rm(Max_Products_by_Order_Id) 


First Order Ratio by Product

People put those products first to cart if they buy them. Emergency contraceptive, Rethab Energy Iced Tea and California Champaigne around 80% of time added the cart first when they were bought.

first_add_product<-orders_full %>%  group_by(product_name) %>% summarise(first_percentage=round(n_distinct(order_id[add_to_cart_order==1]) /n(),2),count_product=n_distinct(order_id[add_to_cart_order==1] )) %>% arrange(desc(first_percentage)) %>% filter(count_product>10)

ggplot(first_add_product%>%filter(row_number()<=15) ,aes(x=reorder(product_name,-first_percentage),y=first_percentage))  +geom_bar(fill="#56B4E6", stat = "identity")+ scale_x_discrete(name="Product Name")  + ggtitle('First Order Percentage by Product') +theme(axis.text.x = element_text(angle =90)) +
rm(first_add_product) 



Day Since Prior Order

People tend to order more in 7 or less days after their prior order

days_since_prior_order_data<-orders_full %>%  group_by(order_id) %>% summarise(days_since_prior_order=max(days_since_prior_order))
ggplot(days_since_prior_order_data,aes(x=days_since_prior_order)) + geom_histogram(color="red", fill="#56B4E6" ,bins = 30) +  theme(axis.text.x = element_text(angle =0), axis.title.x=element_blank()) + ggtitle('Day Since Prior Order Histogram')
## Warning: Removed 206209 rows containing non-finite values (stat_bin).

rm(days_since_prior_order_data)


Relationship Add to Chart Order and Product Sales

The chart show that product between 50 and 100 add to chart order is the sold more.

#Sira halinde gösterim yapilsin
Product_number<-orders_full %>% filter(!is.na(product_name)) %>%  group_by(product_name) %>% summarise(Number_of_Product=n() , add_to_chart_order=max(add_to_cart_order)) %>% filter(Number_of_Product>=100) %>% arrange(desc(Number_of_Product)) 

ggplot(Product_number,aes(x=Number_of_Product,y=add_to_chart_order)) + geom_point(color="#56B4E6")

rm(Product_number)


Relationship Add to Chart Order and Day Since Prior Order

The chart show that as Day Since Prior Order increases, the number of product in order will increase.As it is seen below, In days between 1 and 5, there are few product in order.

#Sira halinde gösterim yapilsin
temp<-orders_full  %>%  group_by(product_name) %>% summarise(maximum_product=max(add_to_cart_order) , days_since_prior_order=max(days_since_prior_order))  %>% filter(days_since_prior_order>0)

ggplot(temp,aes(x=days_since_prior_order,y=maximum_product)) + geom_point(color="#56B4E6")
## Warning: Removed 1 rows containing missing values (geom_point).

rm(temp)

References

“The Instacart Online Grocery Shopping Dataset 2017”, Accessed from https://www.instacart.com/datasets/grocery-shopping-2017 on October 24,2017

“Isntacart Market Basket Analysis”, Accessed from https://www.kaggle.com/c/instacart-market-basket-analysis on October 25,2017