Market Basket Analysis¶
This notebook performs market basket analysis to identify products frequently purchased together, revealing shopping patterns and cross-selling opportunities for strategic merchandising and promotional planning.
Setup and Data Loading¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from itertools import combinations
from collections import Counter
from completejourney_py import get_data
# Set plotting style
plt.style.use('default')
sns.set_palette("viridis")
# Set display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
# Load required datasets
data = get_data(["transactions", "products"])
transactions = data["transactions"]
products = data["products"]
print(f"Loaded {len(transactions):,} transaction records")
print(f"Loaded {len(products):,} product records")
# Show basket-level summary
basket_summary = (
transactions
.groupby('basket_id')
.agg({
'product_id': 'nunique',
'sales_value': 'sum',
'quantity': 'sum'
})
.rename(columns={'product_id': 'unique_products'})
)
print(f"\nBasket Analysis Overview:")
print(f"- Total baskets: {len(basket_summary):,}")
print(f"- Average products per basket: {basket_summary['unique_products'].mean():.1f}")
print(f"- Median products per basket: {basket_summary['unique_products'].median():.1f}")
print(f"- Max products in single basket: {basket_summary['unique_products'].max()}")
print(f"- Average basket value: ${basket_summary['sales_value'].mean():.2f}")
Loaded 1,469,307 transaction records Loaded 92,331 product records Basket Analysis Overview: - Total baskets: 155,848 - Average products per basket: 9.4 - Median products per basket: 5.0 - Max products in single basket: 161 - Average basket value: $29.49
Data Preparation: Create Product Categories for Analysis¶
# Merge transactions with product information
basket_data = (
transactions
.merge(products[['product_id', 'product_category', 'department', 'brand']],
on='product_id', how='left')
)
print(f"Combined dataset: {len(basket_data):,} transaction records")
print(f"- Unique product categories: {basket_data['product_category'].nunique():,}")
print(f"- Unique departments: {basket_data['department'].nunique()}")
# Filter out coupon/misc items for meaningful basket analysis
excluded_categories = [
'COUPON/MISC ITEMS',
'MISCELLANEOUS',
'COUPON',
'MISC ITEMS',
'PHARMACEUTICAL',
'RX CLAIMS'
]
basket_data_clean = basket_data[
~basket_data['product_category'].isin(excluded_categories)
].copy()
print(f"\nAfter filtering out coupons/misc: {len(basket_data_clean):,} transactions")
print(f"- Unique product categories: {basket_data_clean['product_category'].nunique():,}")
# Show distribution of basket sizes
clean_basket_summary = (
basket_data_clean
.groupby('basket_id')
.agg({
'product_category': 'nunique',
'sales_value': 'sum'
})
.rename(columns={'product_category': 'unique_categories'})
)
print(f"\nClean Basket Statistics:")
print(f"- Average categories per basket: {clean_basket_summary['unique_categories'].mean():.1f}")
print(f"- Baskets with 2+ categories: {(clean_basket_summary['unique_categories'] >= 2).sum():,} ({(clean_basket_summary['unique_categories'] >= 2).mean()*100:.1f}%)")
print(f"- Baskets with 5+ categories: {(clean_basket_summary['unique_categories'] >= 5).sum():,} ({(clean_basket_summary['unique_categories'] >= 5).mean()*100:.1f}%)")
Combined dataset: 1,469,307 transaction records - Unique product categories: 302 - Unique departments: 30 After filtering out coupons/misc: 1,451,904 transactions - Unique product categories: 299 Clean Basket Statistics: - Average categories per basket: 7.4 - Baskets with 2+ categories: 116,668 (81.8%) - Baskets with 5+ categories: 69,258 (48.5%)
Approach 1: Category-Level Basket Analysis¶
# Create basket-category matrix for category-level analysis
basket_categories = (
basket_data_clean
.groupby(['basket_id', 'product_category'])
.size()
.reset_index(name='quantity')
.assign(purchased=1)
.pivot_table(index='basket_id', columns='product_category', values='purchased', fill_value=0)
)
print(f"Basket-Category Matrix: {basket_categories.shape[0]:,} baskets × {basket_categories.shape[1]} categories")
# Filter to baskets with multiple categories for meaningful associations
multi_category_baskets = basket_categories[basket_categories.sum(axis=1) >= 2]
print(f"Baskets with 2+ categories for analysis: {len(multi_category_baskets):,}")
# Calculate category support (frequency)
category_support = (multi_category_baskets.sum() / len(multi_category_baskets) * 100).sort_values(ascending=False)
print(f"\nTop 15 Most Popular Categories (% of multi-category baskets):")
for i, (category, support) in enumerate(category_support.head(15).items()):
print(f"{i+1:2d}. {category}: {support:.1f}% ({int(support * len(multi_category_baskets) / 100):,} baskets)")
Basket-Category Matrix: 142,286 baskets × 299 categories Baskets with 2+ categories for analysis: 116,668 Top 15 Most Popular Categories (% of multi-category baskets): 1. FLUID MILK PRODUCTS: 32.7% (38,189 baskets) 2. SOFT DRINKS: 31.7% (37,035 baskets) 3. BAKED BREAD/BUNS/ROLLS: 28.8% (33,641 baskets) 4. CHEESE: 22.7% (26,441 baskets) 5. BAG SNACKS: 20.1% (23,485 baskets) 6. BEEF: 17.5% (20,404 baskets) 7. TROPICAL FRUIT: 15.8% (18,448 baskets) 8. EGGS: 13.4% (15,576 baskets) 9. COLD CEREAL: 12.1% (14,112 baskets) 10. REFRGRATD JUICES/DRNKS: 12.1% (14,094 baskets) 11. LUNCHMEAT: 11.2% (13,105 baskets) 12. SOUP: 11.2% (13,072 baskets) 13. CRACKERS/MISC BKD FD: 10.9% (12,714 baskets) 14. FRZN MEAT/MEAT DINNERS: 10.8% (12,608 baskets) 15. VEGETABLES - SHELF STABLE: 10.8% (12,551 baskets)
# Find category pairs with highest co-occurrence
def calculate_category_pairs(basket_matrix, min_support=0.01):
"""Calculate support, confidence, and lift for category pairs"""
# Filter categories with minimum support
frequent_categories = (basket_matrix.sum() / len(basket_matrix)) >= min_support
basket_frequent = basket_matrix.loc[:, frequent_categories]
# Calculate pairwise associations
associations = []
for cat1 in basket_frequent.columns:
for cat2 in basket_frequent.columns:
if cat1 != cat2:
# Support calculations
support_cat1 = basket_frequent[cat1].sum() / len(basket_frequent)
support_cat2 = basket_frequent[cat2].sum() / len(basket_frequent)
support_both = ((basket_frequent[cat1] == 1) & (basket_frequent[cat2] == 1)).sum() / len(basket_frequent)
if support_both > 0:
# Confidence: P(cat2|cat1)
confidence = support_both / support_cat1
# Lift: support(cat1,cat2) / (support(cat1) * support(cat2))
lift = support_both / (support_cat1 * support_cat2)
associations.append({
'category_1': cat1,
'category_2': cat2,
'support': support_both,
'confidence': confidence,
'lift': lift,
'basket_count': int(support_both * len(basket_frequent))
})
return pd.DataFrame(associations)
# Calculate category associations
category_associations = calculate_category_pairs(multi_category_baskets, min_support=0.05)
print(f"Total category pair associations found: {len(category_associations):,}")
print(f"\nTop 10 Category Pairs by Support (frequency of co-occurrence):")
print("=" * 80)
top_support = category_associations.nlargest(10, 'support')
n = 0
for i, row in top_support.iterrows():
n += 1
print(f"{n:2d}. {row['category_1']} → {row['category_2']}")
print(f" Support: {row['support']:.1%} | Confidence: {row['confidence']:.1%} | Lift: {row['lift']:.2f} | Baskets: {row['basket_count']:,}")
print(f"\nTop 10 Category Pairs by Lift (strength of association):")
print("=" * 80)
top_lift = category_associations.nlargest(10, 'lift')
n = 0
for i, row in top_lift.iterrows():
n += 1
print(f"{n:2d}. {row['category_1']} → {row['category_2']}")
print(f" Support: {row['support']:.1%} | Confidence: {row['confidence']:.1%} | Lift: {row['lift']:.2f} | Baskets: {row['basket_count']:,}")
Total category pair associations found: 3,422
Top 10 Category Pairs by Support (frequency of co-occurrence):
================================================================================
1. BAKED BREAD/BUNS/ROLLS → FLUID MILK PRODUCTS
Support: 15.0% | Confidence: 52.2% | Lift: 1.59 | Baskets: 17,554
2. FLUID MILK PRODUCTS → BAKED BREAD/BUNS/ROLLS
Support: 15.0% | Confidence: 46.0% | Lift: 1.59 | Baskets: 17,554
3. FLUID MILK PRODUCTS → SOFT DRINKS
Support: 12.2% | Confidence: 37.3% | Lift: 1.17 | Baskets: 14,234
4. SOFT DRINKS → FLUID MILK PRODUCTS
Support: 12.2% | Confidence: 38.4% | Lift: 1.17 | Baskets: 14,234
5. CHEESE → FLUID MILK PRODUCTS
Support: 12.0% | Confidence: 53.0% | Lift: 1.62 | Baskets: 14,017
6. FLUID MILK PRODUCTS → CHEESE
Support: 12.0% | Confidence: 36.7% | Lift: 1.62 | Baskets: 14,017
7. BAKED BREAD/BUNS/ROLLS → SOFT DRINKS
Support: 11.7% | Confidence: 40.7% | Lift: 1.28 | Baskets: 13,703
8. SOFT DRINKS → BAKED BREAD/BUNS/ROLLS
Support: 11.7% | Confidence: 37.0% | Lift: 1.28 | Baskets: 13,703
9. BAKED BREAD/BUNS/ROLLS → CHEESE
Support: 11.7% | Confidence: 40.5% | Lift: 1.79 | Baskets: 13,614
10. CHEESE → BAKED BREAD/BUNS/ROLLS
Support: 11.7% | Confidence: 51.5% | Lift: 1.79 | Baskets: 13,614
Top 10 Category Pairs by Lift (strength of association):
================================================================================
1. DRY NOODLES/PASTA → PASTA SAUCE
Support: 3.0% | Confidence: 45.4% | Lift: 8.23 | Baskets: 3,528
2. PASTA SAUCE → DRY NOODLES/PASTA
Support: 3.0% | Confidence: 54.8% | Lift: 8.23 | Baskets: 3,528
3. CHEESES → DELI MEATS
Support: 3.5% | Confidence: 63.3% | Lift: 7.78 | Baskets: 4,059
4. DELI MEATS → CHEESES
Support: 3.5% | Confidence: 42.8% | Lift: 7.78 | Baskets: 4,059
5. BATH TISSUES → PAPER TOWELS
Support: 1.8% | Confidence: 27.9% | Lift: 5.57 | Baskets: 2,124
6. PAPER TOWELS → BATH TISSUES
Support: 1.8% | Confidence: 36.3% | Lift: 5.57 | Baskets: 2,124
7. TOMATOES → VEGETABLES SALAD
Support: 2.0% | Confidence: 25.9% | Lift: 5.11 | Baskets: 2,317
8. VEGETABLES SALAD → TOMATOES
Support: 2.0% | Confidence: 39.1% | Lift: 5.11 | Baskets: 2,317
9. BACON → BREAKFAST SAUSAGE/SANDWICHES
Support: 1.2% | Confidence: 23.8% | Lift: 4.62 | Baskets: 1,442
10. BREAKFAST SAUSAGE/SANDWICHES → BACON
Support: 1.2% | Confidence: 24.0% | Lift: 4.62 | Baskets: 1,442
Approach 2: Department-Level Basket Analysis¶
# Analyze basket patterns at department level for broader insights
basket_departments = (
basket_data_clean
.groupby(['basket_id', 'department'])
.size()
.reset_index(name='items')
.assign(purchased=1)
.pivot_table(index='basket_id', columns='department', values='purchased', fill_value=0)
)
print(f"Department Analysis: {basket_departments.shape[0]:,} baskets × {basket_departments.shape[1]} departments")
# Filter to multi-department baskets
multi_dept_baskets = basket_departments[basket_departments.sum(axis=1) >= 2]
print(f"Baskets with 2+ departments: {len(multi_dept_baskets):,} ({len(multi_dept_baskets)/len(basket_departments)*100:.1f}%)")
# Department popularity
dept_support = (multi_dept_baskets.sum() / len(multi_dept_baskets) * 100).sort_values(ascending=False)
print(f"\nTop 10 Departments in Multi-Department Baskets:")
for i, (dept, support) in enumerate(dept_support.head(10).items()):
print(f"{i+1:2d}. {dept}: {support:.1f}% ({int(support * len(multi_dept_baskets) / 100):,} baskets)")
# Find strongest department associations
dept_associations = calculate_category_pairs(multi_dept_baskets, min_support=0.10)
print(f"\nTop 10 Department Pairs by Lift:")
print("=" * 70)
top_dept_lift = dept_associations.nlargest(10, 'lift')
n = 0
for i, row in top_dept_lift.iterrows():
n += 1
print(f"{n:2d}. {row['category_1']} + {row['category_2']}")
print(f" Support: {row['support']:.1%} | Confidence: {row['confidence']:.1%} | Lift: {row['lift']:.2f}")
Department Analysis: 142,320 baskets × 27 departments
Baskets with 2+ departments: 98,381 (69.1%)
Top 10 Departments in Multi-Department Baskets:
1. GROCERY: 95.4% (93,831 baskets)
2. DRUG GM: 57.8% (56,881 baskets)
3. PRODUCE: 50.4% (49,563 baskets)
4. MEAT-PCKGD: 32.7% (32,135 baskets)
5. MEAT: 30.1% (29,580 baskets)
6. DELI: 19.3% (19,011 baskets)
7. PASTRY: 16.5% (16,270 baskets)
8. NUTRITION: 9.6% (9,468 baskets)
9. SEAFOOD-PCKGD: 5.1% (5,019 baskets)
10. SALAD BAR: 4.3% (4,244 baskets)
Top 10 Department Pairs by Lift:
======================================================================
1. MEAT + MEAT-PCKGD
Support: 14.6% | Confidence: 48.7% | Lift: 1.49
2. MEAT-PCKGD + MEAT
Support: 14.6% | Confidence: 44.8% | Lift: 1.49
3. DELI + PASTRY
Support: 4.7% | Confidence: 24.3% | Lift: 1.47
4. PASTRY + DELI
Support: 4.7% | Confidence: 28.5% | Lift: 1.47
5. MEAT + PRODUCE
Support: 20.5% | Confidence: 68.1% | Lift: 1.35
6. PRODUCE + MEAT
Support: 20.5% | Confidence: 40.7% | Lift: 1.35
7. DELI + PRODUCE
Support: 12.0% | Confidence: 62.2% | Lift: 1.24
8. PRODUCE + DELI
Support: 12.0% | Confidence: 23.9% | Lift: 1.24
9. MEAT-PCKGD + PRODUCE
Support: 20.0% | Confidence: 61.1% | Lift: 1.21
10. PRODUCE + MEAT-PCKGD
Support: 20.0% | Confidence: 39.6% | Lift: 1.21
Approach 3: Individual Product Basket Analysis¶
# Analyze specific product combinations for detailed insights
# Focus on products with reasonable frequency to avoid sparse data
# Find top products by basket frequency
product_basket_freq = (
basket_data_clean
.groupby('product_id')
.agg({
'basket_id': 'nunique',
'product_category': 'first',
'department': 'first',
'brand': 'first'
})
.rename(columns={'basket_id': 'basket_count'})
.sort_values('basket_count', ascending=False)
)
# Filter to products appearing in at least 100 baskets for meaningful analysis
frequent_products = product_basket_freq[product_basket_freq['basket_count'] >= 100]
print(f"Products appearing in 100+ baskets: {len(frequent_products):,}")
print(f"\nTop 15 Most Frequent Products in Baskets:")
for i, (product_id, row) in enumerate(frequent_products.head(15).iterrows()):
print(f"{i+1:2d}. Product {product_id} ({row['product_category']}) - {row['department']}: {row['basket_count']:,} baskets")
# Create product co-occurrence matrix for top products
top_products = frequent_products.head(50).index.tolist()
basket_products = (
basket_data_clean[basket_data_clean['product_id'].isin(top_products)]
.groupby(['basket_id', 'product_id'])
.size()
.reset_index(name='quantity')
.assign(purchased=1)
.pivot_table(index='basket_id', columns='product_id', values='purchased', fill_value=0)
)
print(f"\nProduct co-occurrence matrix: {basket_products.shape[0]:,} baskets × {basket_products.shape[1]} products")
# Find top product pairs
product_pairs = []
for i, prod1 in enumerate(basket_products.columns):
for j, prod2 in enumerate(basket_products.columns):
if i < j: # Avoid duplicates and self-pairs
# Calculate co-occurrence
both_purchased = ((basket_products[prod1] == 1) & (basket_products[prod2] == 1)).sum()
prod1_purchased = (basket_products[prod1] == 1).sum()
prod2_purchased = (basket_products[prod2] == 1).sum()
if both_purchased >= 10: # Minimum co-occurrence threshold
support = both_purchased / len(basket_products)
confidence_1_to_2 = both_purchased / prod1_purchased
confidence_2_to_1 = both_purchased / prod2_purchased
# Get product details
prod1_details = frequent_products.loc[prod1]
prod2_details = frequent_products.loc[prod2]
product_pairs.append({
'product_1': prod1,
'product_2': prod2,
'product_1_category': prod1_details['product_category'],
'product_2_category': prod2_details['product_category'],
'co_occurrence': both_purchased,
'support': support,
'confidence_1_to_2': confidence_1_to_2,
'confidence_2_to_1': confidence_2_to_1
})
product_pairs_df = pd.DataFrame(product_pairs).sort_values('co_occurrence', ascending=False)
print(f"\nTop 10 Product Pairs by Co-occurrence:")
print("=" * 90)
n = 0
for i, row in product_pairs_df.head(10).iterrows():
n += 1
print(f"{n:2d}. Product {row['product_1']} ({row['product_1_category']}) + Product {row['product_2']} ({row['product_2_category']})")
print(f" Co-occurrence: {row['co_occurrence']} baskets | Support: {row['support']:.2%} | Confidence: {row['confidence_1_to_2']:.1%} / {row['confidence_2_to_1']:.1%}")
Products appearing in 100+ baskets: 2,766
Top 15 Most Frequent Products in Baskets:
1. Product 1082185 (TROPICAL FRUIT) - PRODUCE: 16,992 baskets
2. Product 1029743 (FLUID MILK PRODUCTS) - GROCERY: 7,874 baskets
3. Product 995242 (FLUID MILK PRODUCTS) - GROCERY: 7,441 baskets
4. Product 1106523 (FLUID MILK PRODUCTS) - GROCERY: 5,424 baskets
5. Product 981760 (EGGS) - GROCERY: 5,181 baskets
6. Product 1133018 (FLUID MILK PRODUCTS) - GROCERY: 4,231 baskets
7. Product 883404 (BAKED BREAD/BUNS/ROLLS) - GROCERY: 3,800 baskets
8. Product 1127831 (BERRIES) - PRODUCE: 3,551 baskets
9. Product 951590 (BAKED BREAD/BUNS/ROLLS) - GROCERY: 3,291 baskets
10. Product 826249 (BAKED BREAD/BUNS/ROLLS) - GROCERY: 3,192 baskets
11. Product 840361 (EGGS) - GROCERY: 3,039 baskets
12. Product 908531 (FLUID MILK PRODUCTS) - GROCERY: 3,001 baskets
13. Product 5978656 (None) - None: 2,932 baskets
14. Product 995785 (PEPPERS-ALL) - PRODUCE: 2,925 baskets
15. Product 1098066 (BAKED BREAD/BUNS/ROLLS) - GROCERY: 2,874 baskets
Product co-occurrence matrix: 70,521 baskets × 50 products
Top 10 Product Pairs by Co-occurrence:
==========================================================================================
1. Product 1029743 (FLUID MILK PRODUCTS) + Product 1082185 (TROPICAL FRUIT)
Co-occurrence: 1909 baskets | Support: 2.71% | Confidence: 24.2% / 11.2%
2. Product 995242 (FLUID MILK PRODUCTS) + Product 1082185 (TROPICAL FRUIT)
Co-occurrence: 1800 baskets | Support: 2.55% | Confidence: 24.2% / 10.6%
3. Product 1082185 (TROPICAL FRUIT) + Product 1127831 (BERRIES)
Co-occurrence: 1542 baskets | Support: 2.19% | Confidence: 9.1% / 43.4%
4. Product 981760 (EGGS) + Product 1082185 (TROPICAL FRUIT)
Co-occurrence: 1461 baskets | Support: 2.07% | Confidence: 28.2% / 8.6%
5. Product 1082185 (TROPICAL FRUIT) + Product 1106523 (FLUID MILK PRODUCTS)
Co-occurrence: 1147 baskets | Support: 1.63% | Confidence: 6.8% / 21.1%
6. Product 961554 (CARROTS) + Product 1082185 (TROPICAL FRUIT)
Co-occurrence: 1057 baskets | Support: 1.50% | Confidence: 39.0% / 6.2%
7. Product 860776 (VEGETABLES - ALL OTHERS) + Product 1082185 (TROPICAL FRUIT)
Co-occurrence: 1043 baskets | Support: 1.48% | Confidence: 36.8% / 6.1%
8. Product 1082185 (TROPICAL FRUIT) + Product 1126899 (FLUID MILK PRODUCTS)
Co-occurrence: 1008 baskets | Support: 1.43% | Confidence: 5.9% / 39.1%
9. Product 866211 (GRAPES) + Product 1082185 (TROPICAL FRUIT)
Co-occurrence: 961 baskets | Support: 1.36% | Confidence: 44.4% / 5.7%
10. Product 951590 (BAKED BREAD/BUNS/ROLLS) + Product 1082185 (TROPICAL FRUIT)
Co-occurrence: 953 baskets | Support: 1.35% | Confidence: 29.0% / 5.6%
Visualization: Basket Analysis Insights¶
# Create comprehensive visualizations
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(20, 14))
# 1. Distribution of basket sizes (number of categories)
clean_basket_summary['unique_categories'].hist(bins=range(1, 21), alpha=0.7, edgecolor='black', ax=ax1)
ax1.set_xlabel('Number of Categories per Basket')
ax1.set_ylabel('Number of Baskets')
ax1.set_title('Distribution of Basket Sizes (Categories)', fontsize=14, fontweight='bold')
ax1.axvline(clean_basket_summary['unique_categories'].mean(), color='red', linestyle='--',
label=f'Mean: {clean_basket_summary["unique_categories"].mean():.1f}')
ax1.legend()
ax1.grid(True, alpha=0.3)
# 2. Top category supports
category_support.head(12).sort_values().plot(kind='barh', ax=ax2, alpha=0.8)
ax2.set_xlabel('Support (% of Multi-Category Baskets)')
ax2.set_title('Top 12 Product Categories by Support', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3)
# 3. Top department supports
dept_support.head(10).plot(kind='bar', ax=ax3, alpha=0.8, color='green')
ax3.set_xlabel('Department')
ax3.set_ylabel('Support (% of Multi-Department Baskets)')
ax3.set_title('Top 10 Departments by Support', fontsize=14, fontweight='bold')
ax3.tick_params(axis='x', rotation=45)
ax3.grid(True, alpha=0.3)
# 4. Lift vs Support scatter for category pairs
if len(category_associations) > 0:
ax4.scatter(category_associations['support'], category_associations['lift'],
alpha=0.6, s=50, c='purple')
ax4.set_xlabel('Support')
ax4.set_ylabel('Lift')
ax4.set_title('Category Association: Support vs Lift', fontsize=14, fontweight='bold')
ax4.axhline(y=1, color='red', linestyle='--', alpha=0.5, label='Lift = 1 (No Association)')
ax4.legend()
ax4.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Summary statistics
print(f"\nBasket Analysis Summary:")
print(f"- Average categories per basket: {clean_basket_summary['unique_categories'].mean():.1f}")
print(f"- {(clean_basket_summary['unique_categories'] >= 5).mean()*100:.1f}% of baskets contain 5+ categories")
print(f"- {len(multi_dept_baskets)/len(basket_departments)*100:.1f}% of baskets span multiple departments")
print(f"- Found {len(category_associations):,} category pair associations")
print(f"- Found {len(product_pairs_df):,} frequent product pair associations")
Basket Analysis Summary: - Average categories per basket: 7.4 - 48.5% of baskets contain 5+ categories - 69.1% of baskets span multiple departments - Found 3,422 category pair associations - Found 1,209 frequent product pair associations
Cross-Department Shopping Patterns¶
# Analyze shopping patterns across departments for store layout insights
dept_combinations = []
# Focus on most popular departments for meaningful analysis
top_departments = dept_support.head(8).index.tolist()
for basket_id, row in multi_dept_baskets.iterrows():
# Get departments purchased in this basket
purchased_depts = [dept for dept in top_departments if row[dept] == 1]
# Generate all pairs of departments in this basket
if len(purchased_depts) >= 2:
for dept1, dept2 in combinations(purchased_depts, 2):
dept_combinations.append((min(dept1, dept2), max(dept1, dept2)))
# Count department pair frequencies
dept_pair_counts = Counter(dept_combinations)
print(f"Cross-Department Shopping Patterns:")
print(f"=" * 60)
print(f"Total department pair occurrences: {sum(dept_pair_counts.values()):,}")
print(f"\nTop 15 Department Combinations:")
for i, ((dept1, dept2), count) in enumerate(dept_pair_counts.most_common(15)):
pct = count / len(multi_dept_baskets) * 100
print(f"{i+1:2d}. {dept1} + {dept2}: {count:,} baskets ({pct:.1f}%)")
# Calculate department affinity (how often departments appear together)
print(f"\nDepartment Affinity Analysis:")
print(f"Departments most likely to be shopped together with GROCERY:")
grocery_pairs = [(pair, count) for pair, count in dept_pair_counts.items() if 'GROCERY' in pair]
grocery_pairs_sorted = sorted(grocery_pairs, key=lambda x: x[1], reverse=True)
for i, ((dept1, dept2), count) in enumerate(grocery_pairs_sorted[:8]):
other_dept = dept2 if dept1 == 'GROCERY' else dept1
affinity = count / multi_dept_baskets['GROCERY'].sum() * 100
print(f"{i+1}. GROCERY + {other_dept}: {affinity:.1f}% affinity ({count:,} co-occurrences)")
Cross-Department Shopping Patterns: ============================================================ Total department pair occurrences: 410,513 Top 15 Department Combinations: 1. DRUG GM + GROCERY: 53,836 baskets (54.7%) 2. GROCERY + PRODUCE: 47,914 baskets (48.7%) 3. GROCERY + MEAT-PCKGD: 31,615 baskets (32.1%) 4. GROCERY + MEAT: 28,879 baskets (29.4%) 5. DRUG GM + PRODUCE: 26,117 baskets (26.5%) 6. MEAT + PRODUCE: 20,155 baskets (20.5%) 7. MEAT-PCKGD + PRODUCE: 19,639 baskets (20.0%) 8. DELI + GROCERY: 18,157 baskets (18.5%) 9. DRUG GM + MEAT-PCKGD: 17,315 baskets (17.6%) 10. DRUG GM + MEAT: 15,535 baskets (15.8%) 11. GROCERY + PASTRY: 15,249 baskets (15.5%) 12. MEAT + MEAT-PCKGD: 14,399 baskets (14.6%) 13. DELI + PRODUCE: 11,834 baskets (12.0%) 14. DELI + DRUG GM: 10,160 baskets (10.3%) 15. PASTRY + PRODUCE: 9,363 baskets (9.5%) Department Affinity Analysis: Departments most likely to be shopped together with GROCERY: 1. GROCERY + DRUG GM: 57.4% affinity (53,836 co-occurrences) 2. GROCERY + PRODUCE: 51.1% affinity (47,914 co-occurrences) 3. GROCERY + MEAT-PCKGD: 33.7% affinity (31,615 co-occurrences) 4. GROCERY + MEAT: 30.8% affinity (28,879 co-occurrences) 5. GROCERY + DELI: 19.4% affinity (18,157 co-occurrences) 6. GROCERY + PASTRY: 16.3% affinity (15,249 co-occurrences) 7. GROCERY + NUTRITION: 9.5% affinity (8,888 co-occurrences)
Alternative Approaches to Basket Analysis¶
This analysis demonstrates several approaches to market basket analysis, each revealing different insights:
Approaches Demonstrated:¶
1. Category-Level Analysis
- Purpose: Understand product category relationships
- Metrics: Support, confidence, lift for category pairs
- Business Value: Category management, cross-merchandising strategies
2. Department-Level Analysis
- Purpose: Store layout and shopping flow optimization
- Metrics: Department co-occurrence, affinity analysis
- Business Value: Store design, staff allocation, inventory placement
3. Individual Product Analysis
- Purpose: Specific product recommendations and bundling
- Metrics: Product pair co-occurrence, confidence scores
- Business Value: Product recommendations, promotional bundles
4. Cross-Department Pattern Analysis
- Purpose: Understanding shopping journey across store sections
- Metrics: Department combination frequencies, affinity scores
- Business Value: Store layout optimization, traffic flow analysis
Additional Approaches to Explore:¶
Temporal Basket Analysis:
# Analyze how basket composition changes by:
# - Time of day
# - Day of week
# - Season/month
# - Before/after holidays
Brand-Level Analysis:
# Examine brand switching and loyalty patterns:
# - National vs private label combinations
# - Brand affinity within categories
# - Cross-brand purchasing patterns
Customer Segment Basket Analysis:
# Segment-specific basket patterns:
# - High-value vs low-value customers
# - Frequent vs occasional shoppers
# - Demographic-based basket differences
Sequential Pattern Mining:
# Analyze purchase sequences:
# - Order of items added to basket
# - Shopping path through store departments
# - Temporal patterns within shopping trips
Advanced Association Rule Mining:
# Implement algorithms like:
# - Apriori algorithm for frequent itemsets
# - FP-Growth for efficient pattern mining
# - Eclat algorithm for itemset mining
Basket Size and Value Analysis:
# Analyze basket characteristics:
# - Small vs large basket composition differences
# - Value-based basket segmentation
# - Impulse vs planned purchase patterns
Statistical Significance Testing:
# Test association strength:
# - Chi-square tests for independence
# - Confidence intervals for lift metrics
# - Bootstrap sampling for robust estimates
Experimentation Recommendations:¶
Try Different Granularities: Experiment with product categories, sub-categories, brands, and individual products
Adjust Thresholds: Modify minimum support and confidence thresholds to balance comprehensiveness vs statistical significance
Explore Time Dimensions: Add temporal analysis to understand how associations change over time
Consider Customer Context: Incorporate customer demographics and shopping behavior for personalized insights
Validate Business Impact: Test insights through A/B testing of product placements, recommendations, or promotions
Each approach reveals different aspects of customer shopping behavior and can inform various business decisions from store layout to marketing campaigns. The key is matching the analysis approach to your specific business questions and implementation capabilities.