As group cleveR, we will work on the online order behaviours of Instacart customers with group members:
Instacart is a San Francisco based online shopping website that provide service for online grocery shopping orders and same-day delivery service from various local stores. Instacart has founded in San Francisco on June 2012 and operating in 39 states across United Satates of America today.
In this project, our work will be composed of 3 main areas.
1.Exploratory Analysis on the data
2.Segmentation
3.Recommendation
Our study will begin with exploratory data analysis. We will try to explain customer behaviours and pruchasing patterns.We will try to reach facts like “what time of the day customers order most”, “What is avarege basket size?”, “Which products ordered togather” etc.
After exploratory analysis, we will try to group customers according to their purchasing behaviours. We will exemine their orders according to product category, order time, order amount in order to find similarities and differences.
At the end, we will try to find what a customer most likely to buy in his/her next order and try to suggest them the products.
The dataset contains over 3 million grocery orders and consist of 6 csv files. One of them is a train dataset. Lets look at the row numbers and variables in each file:
products<-read.csv("data/products.csv")
Product has 4 attributes which are product_id, product_name, aisle_id, department_id
kable(head(products,5))
product_id | product_name | aisle_id | department_id |
---|---|---|---|
1 | Chocolate Sandwich Cookies | 61 | 19 |
2 | All-Seasons Salt | 104 | 13 |
3 | Robust Golden Unsweetened Oolong Tea | 94 | 7 |
4 | Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce | 38 | 1 |
5 | Green Chile Anytime Sauce | 5 | 13 |
aisles<-read.csv("data/aisles.csv")
Aisle has 4 attributes which are product_id, product_name, aisle_id, department_id
kable(head(aisles,5))
aisle_id | aisle |
---|---|
1 | prepared soups salads |
2 | specialty cheeses |
3 | energy granola bars |
4 | instant foods |
5 | marinades meat preparation |
departments<-read.csv("data/departments.csv")
Department has 4 attributes which are product_id, product_name, aisle_id, department_id
kable(head(departments,5))
department_id | department |
---|---|
1 | frozen |
2 | other |
3 | bakery |
4 | produce |
5 | alcohol |
orders<-read.csv("data/orders.csv")
Order has 4 attributes which are product_id, product_name, aisle_id, department_id
kable(head(orders,5))
order_id | user_id | eval_set | order_number | order_dow | order_hour_of_day | days_since_prior_order |
---|---|---|---|---|---|---|
2539329 | 1 | prior | 1 | 2 | 8 | NA |
2398795 | 1 | prior | 2 | 3 | 7 | 15 |
473747 | 1 | prior | 3 | 3 | 12 | 21 |
2254736 | 1 | prior | 4 | 4 | 7 | 29 |
431534 | 1 | prior | 5 | 4 | 15 | 28 |
op_train<-read.csv("data/order_products__train.csv")
Order product train has 4 attributes which are order_id, product_id, add_to_cart_order, reordered
kable(head(op_train,5))
order_id | product_id | add_to_cart_order | reordered |
---|---|---|---|
1 | 49302 | 1 | 1 |
1 | 11109 | 2 | 1 |
1 | 10246 | 3 | 0 |
1 | 49683 | 4 | 0 |
1 | 43633 | 5 | 1 |
op_prior<-read.csv("data/order_products__prior.csv")
Order product train has 4 attributes which are order_id, product_id, add_to_cart_order, reordered
kable(head(op_prior,5))
order_id | product_id | add_to_cart_order | reordered |
---|---|---|---|
2 | 33120 | 1 | 1 |
2 | 28985 | 2 | 1 |
2 | 9327 | 3 | 0 |
2 | 45918 | 4 | 1 |
2 | 30035 | 5 | 0 |
orders_product<-rbind(op_prior,op_train)
orders_full<-left_join(orders,orders_product,by="order_id")
orders_full<-left_join(orders_full,products,by="product_id")
orders_full<-left_join(orders_full,aisles,by="aisle_id")
orders_full<-left_join(orders_full,departments,by="department_id")
rm(orders_product, aisles,op_prior,op_train,orders,products,departments)
#orders_full <-orders_full %>% filter(row_number()<=1000000)
order_number_by_user_id<-orders_full %>% group_by(user_id) %>% summarise(number_of_order=n_distinct(order_id))
Data includes 3421083 order. There are 206209 users who placed on order. Minumum of order placed by the users is 4 . Maxiumum of order placed by the users is 100
#rm(order_number_by_user_id)
temp<- orders_full %>% group_by(department) %>% summarise(count=n()) %>% mutate(per=count/sum(count) ) %>% arrange(desc(per))
per<- temp %>% filter(row_number()<=4) %>% summarise(sum=round(sum(per),2)*100)
When we look at the number of products sold in each department, departmant produce and dairy eggs are obvious leaders in sales quantity. Produce departmant contain products such as vegatable and fruit in the form of fresh or package whereas dairy eggs contain yogurt, packaged cheese, milk and eggs, etc. Also we can see that 63 % of the sales are come from 4 departrments.
Product_number<-orders_full %>% filter(!is.na(department)) %>% group_by(department) %>% summarise(Number_of_Product=n()) %>% arrange(desc(Number_of_Product))
ggplot(Product_number,aes(x=reorder(department,-Number_of_Product),y=Number_of_Product)) +geom_bar(fill="#56B4E6", stat = "identity") + scale_x_discrete(name="Department") + theme( axis.text.x = element_text(angle =90)) + ggtitle('Number of Product Sold by Department')
rm(Product_number,per,temp)
Most ordered product’s aisles are looking in tandem with product’s departmant. Fresh vegatables and fruits aisles take the lead in sales quantity by far.
Product_number<-orders_full %>% filter(!is.na(aisle)) %>% group_by(aisle) %>% summarise(Number_of_Product=n()) %>% arrange(desc(Number_of_Product))
ggplot(Product_number %>% filter(row_number()<=15),aes(x=reorder(aisle,-Number_of_Product),y=Number_of_Product)) +geom_bar(position="dodge",fill="#56B4E6", stat = "identity") + theme(axis.text.x = element_text(angle =90)) + scale_x_discrete(name="Aisle")+ggtitle('Number of Product Sold by Aisle')
rm(Product_number)
If we put top 15 products on the chart, we can see that Bananas are the most ordered products followed by Bag of Organic Bananas and Organic Strawberries.
#Sira halinde gösterim yapilsin
Product_number<-orders_full %>% filter(!is.na(product_name)) %>% group_by(product_name) %>% summarise(Number_of_Product=n()) %>% arrange(desc(Number_of_Product))
ggplot(Product_number %>% filter(row_number()<=15),aes(x=reorder(product_name,-Number_of_Product),y=Number_of_Product)) +geom_bar(fill="#56B4E6", stat = "identity") +scale_x_discrete(name="Product") + theme(axis.text.x = element_text(angle =90)) + ggtitle('Number of Product Sold')
rm(Product_number)
It can be said that customers order at weekends more. (In the row data days described by number from 0 to 6. We assume that 0 equals Saturday)
#ls(orders_full)
orders_full<-orders_full %>% mutate(order_day=ifelse(order_dow==0,"Sat",
ifelse(order_dow==1,"Sun",
ifelse(order_dow==2,"Man",
ifelse(order_dow==3,"Tue",
ifelse(order_dow==4,"Wed",
ifelse(order_dow==5,"Thu",
ifelse(order_dow==6,"Fri",NA))))))))
order_number_by_day<-orders_full %>% group_by(order_day,order_dow) %>% summarise(number_of_order=n_distinct(order_id))
ggplot(order_number_by_day,aes(x=reorder(order_day,order_dow),y=number_of_order)) +geom_bar(fill="#56B4E6", stat = "identity")+ scale_x_discrete(name="Day") + ggtitle('Number of Order by Day of Week')
rm(order_number_by_day)
Most orders come during daytime, from 7 am to 7 pm, which is not very surprising.
#ls(orders_full)
#orders_full <- mutate(orders_full , order_day=derivedFactor('Sat'=order_dow==0,'Sun'=order_dow==1,'Man'=order_dow==2,'Tue'=order_dow==3,'Wed'=order_dow==4,'Thu'=order_dow==5,'Fri'=order_dow==6 ,.default = NA))
order_number_by_hour<-orders_full %>% group_by(order_hour_of_day) %>% summarise(number_of_order=n_distinct(order_id))
ggplot(order_number_by_hour,aes(x=order_hour_of_day,y=number_of_order)) +geom_bar(fill="#56B4E6", stat = "identity") + ggtitle('Number of Order by Department')
rm(order_number_by_hour)
People mostly add to cart less than 10 items when they order. As it clear in the grap, after 6 item per order there is negative correlation between number of items in the order and order number.
Max_Products_by_Order_Id<-orders_full %>% group_by(order_id) %>% summarise(max_product=max(add_to_cart_order)) %>% arrange(max_product)
ggplot(Max_Products_by_Order_Id,aes(x=max_product)) + geom_histogram(color="red", fill="#56B4E6" ,bins = 50) +
theme(axis.text.x = element_text(angle =0), axis.title.x=element_blank()) +
ggtitle('Add to Chart Order Histogram')
## Warning: Removed 75000 rows containing non-finite values (stat_bin).
rm(Max_Products_by_Order_Id)
People put those products first to cart if they buy them. Emergency contraceptive, Rethab Energy Iced Tea and California Champaigne around 80% of time added the cart first when they were bought.
first_add_product<-orders_full %>% group_by(product_name) %>% summarise(first_percentage=round(n_distinct(order_id[add_to_cart_order==1]) /n(),2),count_product=n_distinct(order_id[add_to_cart_order==1] )) %>% arrange(desc(first_percentage)) %>% filter(count_product>10)
ggplot(first_add_product%>%filter(row_number()<=15) ,aes(x=reorder(product_name,-first_percentage),y=first_percentage)) +geom_bar(fill="#56B4E6", stat = "identity")+ scale_x_discrete(name="Product Name") + ggtitle('First Order Percentage by Product') +theme(axis.text.x = element_text(angle =90)) +
rm(first_add_product)
People tend to order more in 7 or less days after their prior order
days_since_prior_order_data<-orders_full %>% group_by(order_id) %>% summarise(days_since_prior_order=max(days_since_prior_order))
ggplot(days_since_prior_order_data,aes(x=days_since_prior_order)) + geom_histogram(color="red", fill="#56B4E6" ,bins = 30) + theme(axis.text.x = element_text(angle =0), axis.title.x=element_blank()) + ggtitle('Day Since Prior Order Histogram')
## Warning: Removed 206209 rows containing non-finite values (stat_bin).
rm(days_since_prior_order_data)
The chart show that product between 50 and 100 add to chart order is the sold more.
#Sira halinde gösterim yapilsin
Product_number<-orders_full %>% filter(!is.na(product_name)) %>% group_by(product_name) %>% summarise(Number_of_Product=n() , add_to_chart_order=max(add_to_cart_order)) %>% filter(Number_of_Product>=100) %>% arrange(desc(Number_of_Product))
ggplot(Product_number,aes(x=Number_of_Product,y=add_to_chart_order)) + geom_point(color="#56B4E6")
rm(Product_number)
The chart show that as Day Since Prior Order increases, the number of product in order will increase.As it is seen below, In days between 1 and 5, there are few product in order.
#Sira halinde gösterim yapilsin
temp<-orders_full %>% group_by(product_name) %>% summarise(maximum_product=max(add_to_cart_order) , days_since_prior_order=max(days_since_prior_order)) %>% filter(days_since_prior_order>0)
ggplot(temp,aes(x=days_since_prior_order,y=maximum_product)) + geom_point(color="#56B4E6")
## Warning: Removed 1 rows containing missing values (geom_point).
rm(temp)
“The Instacart Online Grocery Shopping Dataset 2017”, Accessed from https://www.instacart.com/datasets/grocery-shopping-2017 on October 24,2017
“Isntacart Market Basket Analysis”, Accessed from https://www.kaggle.com/c/instacart-market-basket-analysis on October 25,2017