Exploratory Data Analysis

Column

Dataset OverView

Lets first have a look at provided datasets:

orders

This file gives a list of all orders we have in the dataset. 1 row per order.

order_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order
2539329 1 prior 1 2 8 NA
2398795 1 prior 2 3 7 15
473747 1 prior 3 3 12 21
2254736 1 prior 4 4 7 29
431534 1 prior 5 4 15 28
Observations: 3,421,083
Variables: 7
$ order_id               <int> 2539329, 2398795, 473747, 2254736, 4315...
$ user_id                <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, ...
$ eval_set               <chr> "prior", "prior", "prior", "prior", "pr...
$ order_number           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1, 2...
$ order_dow              <int> 2, 3, 3, 4, 4, 2, 1, 1, 1, 4, 4, 2, 5, ...
$ order_hour_of_day      <int> 8, 7, 12, 7, 15, 7, 9, 14, 16, 8, 8, 11...
$ days_since_prior_order <dbl> NA, 15, 21, 29, 28, 19, 20, 14, 0, 30, ...

order_products_train

This file gives us information about which products (product_id) were ordered. It also contains information of the order (add_to_cart_order) in which the products were put into the cart and information of whether this product is a re-order(1) or not(0).

order_id product_id add_to_cart_order reordered
1 49302 1 1
1 11109 2 1
1 10246 3 0
1 49683 4 0
1 43633 5 1
Observations: 1,384,617
Variables: 4
$ order_id          <int> 1, 1, 1, 1, 1, 1, 1, 1, 36, 36, 36, 36, 36, ...
$ product_id        <int> 49302, 11109, 10246, 49683, 43633, 13176, 47...
$ add_to_cart_order <int> 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7,...
$ reordered         <int> 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1,...

products

This file contains the names of the products with their corresponding product_id.

product_id product_name aisle_id department_id
1 Chocolate Sandwich Cookies 61 19
2 All-Seasons Salt 104 13
3 Robust Golden Unsweetened Oolong Tea 94 7
4 Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce 38 1
5 Green Chile Anytime Sauce 5 13
Observations: 49,688
Variables: 4
$ product_id    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
$ product_name  <chr> "Chocolate Sandwich Cookies", "All-Seasons Salt"...
$ aisle_id      <int> 61, 104, 94, 38, 5, 11, 98, 116, 120, 115, 31, 1...
$ department_id <int> 19, 13, 7, 1, 13, 11, 7, 1, 16, 7, 7, 1, 11, 17,...

order_products_prior

This file is structurally the same as the other_products_train.csv.

order_id product_id add_to_cart_order reordered
2 33120 1 1
2 28985 2 1
2 9327 3 0
2 45918 4 1
2 30035 5 0
Observations: 32,434,489
Variables: 4
$ order_id          <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3,...
$ product_id        <int> 33120, 28985, 9327, 45918, 30035, 17794, 401...
$ add_to_cart_order <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6,...
$ reordered         <int> 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,...

aisles

This file contains the different aisles.

aisle_id aisle
1 prepared soups salads
2 specialty cheeses
3 energy granola bars
4 instant foods
5 marinades meat preparation
Observations: 134
Variables: 2
$ aisle_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
$ aisle    <chr> "prepared soups salads", "specialty cheeses", "energy...

departments

department_id department
1 frozen
2 other
3 bakery
4 produce
5 alcohol
Observations: 21
Variables: 2
$ department_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
$ department    <chr> "frozen", "other", "bakery", "produce", "alcohol...

Order Frequency

When people buy groceries online.

Hour of Day

Most people order between 9:00 AM to 6:00 PM in the evening.

Day of Week

Sunday and Monday are the days when people order most on Instacart

Re-order analysis

find two categories of people! One that reorders monthly other who does weekly. This is based on the peaks formed at 30th day and 7th day.

Department

Which department is most commonly purchased by Day of Week (Top 5 Departments)

What are the Top 10 Departments being purchased during Customer’s first order?

Items

How many items does a customer order have ?

Items sold most often (top10)


Banana is most purchase item followed Organic Strawberries and Baby Spinach.

Product Portfolio

Visualizing the structure of instacarts product portfolio

Market Basket Analysis

Column

Rules->For Products

Rules generated with Support=0.01% and Confidence=50%

lhs rhs support confidence lift
{Organic Hass Avocado,Organic Raspberries,Organic Strawberries} {Bag of Organic Bananas} 0.0017377 0.5984252 5.072272
{Organic Cucumber,Organic Hass Avocado,Organic Strawberries} {Organic Strawberries, Organic Cucumber} 0.0010670 0.5468750 4.635331
{Organic Hass Avocado,Organic Kiwi} {Organic Whole String Cheese} 0.0014481 0.5459770 4.627720
{Organic Navel Orange,Organic Raspberries} {Bag of Organic Bananas} 0.0011508 0.5412186 4.587387
{Organic Hass Avocado,Organic Whole String Cheese} {Organic Navel Orange} 0.0011585 0.5314685 4.504745
{Organic Hass Avocado,Organic Navel Orange} {Bag of Organic Bananas} 0.0014938 0.5283019 4.477905
{Organic Hass Avocado,Organic Raspberries} {Yellow Onions} 0.0040470 0.5210991 4.416854
{Organic D’Anjou Pears,Organic Hass Avocado} {Bag of Organic Bananas} 0.0013871 0.5170455 4.382495
{Organic Hass Avocado,Organic Unsweetened Almond Milk} {Bag of Organic Bananas} 0.0012499 0.5141066 4.357585
{Organic Broccoli,Organic Hass Avocado} {Bag of Organic Bananas} 0.0011966 0.5348232 3.758898
Observations: 11
Variables: 5
$ lhs        <chr> "{Organic Hass Avocado,Organic Raspberries,Organic ...
$ rhs        <chr> "{Bag of Organic Bananas}", "{Bag of Organic Banana...
$ support    <dbl> 0.001737686, 0.001067000, 0.001448071, 0.001150836,...
$ confidence <dbl> 0.5984252, 0.5468750, 0.5459770, 0.5412186, 0.53146...
$ lift       <dbl> 5.072272, 4.635331, 4.627719, 4.587387, 4.504745, 4...

Rules- Scatter Plot (For products)

Rules- Network Graph Visual ( For products)

Rules->For Aisles

Rules generated with Support=0.7% and Confidence=40%

lhs rhs support confidence lift
{cereal,lunch meat} {bread} 0.0076595 0.4574420 2.793210
{chips pretzels,lunch meat,packaged cheese} {bread} 0.0073699 0.4472710 2.731105
{lunch meat,milk,yogurt} {bread} 0.0080406 0.4429051 2.704446
{lunch meat,milk,packaged cheese} {bread} 0.0084826 0.4427208 2.703320
{packaged cheese,preserved dips spreads} {chips pretzels} 0.0072708 0.4760479 2.694408
{cookies cakes,crackers} {chips pretzels} 0.0077129 0.4755639 2.691669
{eggs,fresh fruits,milk,packaged cheese} {bread} 0.0071413 0.4337963 2.648826
{lunch meat,pasta sauce} {packaged cheese} 0.0080254 0.6290323 2.645427
{fresh dips tapenades,ice cream ice} {chips pretzels} 0.0073699 0.4662488 2.638946
{fresh fruits,lunch meat,packaged cheese,yogurt} {bread} 0.0078348 0.4321143 2.638556
Observations: 10,022
Variables: 5
$ lhs        <chr> "{cereal,lunch meat}", "{chips pretzels,lunch meat,...
$ rhs        <chr> "{bread}", "{bread}", "{bread}", "{bread}", "{chips...
$ support    <dbl> 0.007659536, 0.007369921, 0.008040607, 0.008482650,...
$ confidence <dbl> 0.4574420, 0.4472710, 0.4429051, 0.4427208, 0.47604...
$ lift       <dbl> 2.793210, 2.731105, 2.704446, 2.703320, 2.694408, 2...

Rules- Scatter Plot ( For Aisles)

Rules- Network Graph Visual (For Aisle)

Recommendations by Implicit feedback-ALS

Column

User-Id:1

User ID : 38

User ID : 796

User ID : 1000

User ID : 421