Dataset Overview¶
The Complete Journey dataset provides a comprehensive view of grocery shopping behavior through eight interconnected datasets. This guide details each dataset's structure, key variables, and important considerations.
⚠️ Educational Data Notice
This package contains simulated data for educational purposes only. The data structure is based on real grocery shopping patterns, but all transaction records, household information, and shopping behaviors are artificially generated. This data is intended for learning data analysis techniques, not for research or commercial applications.
# Import required libraries and load data
import pandas as pd
import numpy as np
from completejourney_py import get_data
# Set display options for better output
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
# Load all datasets
print("Loading Complete Journey datasets...")
data = get_data()
print(f"✅ Loaded {len(data)} datasets successfully!")
print(f"Available datasets: {list(data.keys())}")
Loading Complete Journey datasets... ✅ Loaded 8 datasets successfully! Available datasets: ['campaign_descriptions', 'coupons', 'promotions', 'campaigns', 'demographics', 'transactions', 'coupon_redemptions', 'products']
Core Transaction Data¶
Transactions¶
The heart of the Complete Journey dataset, containing individual product purchases from real grocery shopping trips. Each record represents a single product purchase within a shopping basket, capturing the complete transaction history including pricing, discounts, and timing.
📝 Understanding Sales Value
The
sales_valuevariable represents the amount of dollars received by the retailer on the sale of the specific product, taking the coupon match and loyalty card discount into account. It is not the actual price paid by the customer. If a customer uses a coupon, the actual price paid will be less than thesales_valuebecause the manufacturer issuing the coupon will reimburse the retailer for the amount of the coupon.To calculate actual product prices, use these formulas:
- Loyalty card price = (
sales_value- (retail_disc+coupon_match_disc)) /quantity- Non-loyalty card price = (
sales_value-coupon_match_disc) /quantity
Dataset Size: 1,469,307 records representing individual product purchases
# Load and examine transactions dataset structure
transactions = data["transactions"]
print("=== TRANSACTIONS DATASET ===")
print(f"Shape: {transactions.shape}")
print(f"\nSample data (first 5 rows):")
display(transactions.head())
=== TRANSACTIONS DATASET === Shape: (1469307, 11) Sample data (first 5 rows):
| household_id | store_id | basket_id | product_id | quantity | sales_value | retail_disc | coupon_disc | coupon_match_disc | week | transaction_timestamp | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 900 | 330 | 31198570044 | 1095275 | 1 | 0.50 | 0.00 | 0.0 | 0.0 | 1 | 2017-01-01 11:53:26 |
| 1 | 900 | 330 | 31198570047 | 9878513 | 1 | 0.99 | 0.10 | 0.0 | 0.0 | 1 | 2017-01-01 12:10:28 |
| 2 | 1228 | 406 | 31198655051 | 1041453 | 1 | 1.43 | 0.15 | 0.0 | 0.0 | 1 | 2017-01-01 12:26:30 |
| 3 | 906 | 319 | 31198705046 | 1020156 | 1 | 1.50 | 0.29 | 0.0 | 0.0 | 1 | 2017-01-01 12:30:27 |
| 4 | 906 | 319 | 31198705046 | 1053875 | 2 | 2.78 | 0.80 | 0.0 | 0.0 | 1 | 2017-01-01 12:30:27 |
Key Variables:
| Column | Data Type | Description |
|---|---|---|
household_id |
int64 | Household identifier |
store_id |
int64 | Store location identifier |
basket_id |
int64 | Unique shopping trip identifier |
product_id |
int64 | Product identifier |
quantity |
int64 | Number of items purchased |
sales_value |
float64 | Dollar amount received by retailer |
retail_disc |
float64 | Retail discount applied |
coupon_disc |
float64 | Coupon discount applied |
coupon_match_disc |
float64 | Coupon match discount |
week |
int64 | Week number (1-53) |
transaction_timestamp |
datetime64[ns] | Date and time of purchase |
Common Use Cases:
- Market basket analysis
- Customer lifetime value calculation
- Shopping frequency analysis
- Seasonal trend identification
- Price elasticity studies
Demographics¶
Comprehensive household-level demographic profiles that enable sophisticated customer segmentation and behavioral analysis. This dataset captures key socioeconomic indicators including age, income, family composition, and homeownership status.
Dataset Size: 801 records representing unique households
# Load and examine demographics dataset structure
demographics = data["demographics"]
print("=== DEMOGRAPHICS DATASET ===")
print(f"Shape: {demographics.shape}")
print(f"\nSample data (first 5 rows):")
display(demographics.head())
=== DEMOGRAPHICS DATASET === Shape: (801, 8) Sample data (first 5 rows):
| household_id | age | income | home_ownership | marital_status | household_size | household_comp | kids_count | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 65+ | 35-49K | Homeowner | Married | 2 | 2 Adults No Kids | 0 |
| 1 | 1001 | 45-54 | 50-74K | Homeowner | Unmarried | 1 | 1 Adult No Kids | 0 |
| 2 | 1003 | 35-44 | 25-34K | None | Unmarried | 1 | 1 Adult No Kids | 0 |
| 3 | 1004 | 25-34 | 15-24K | None | Unmarried | 1 | 1 Adult No Kids | 0 |
| 4 | 101 | 45-54 | Under 15K | Homeowner | Married | 4 | 2 Adults Kids | 2 |
Key Variables:
| Column | Data Type | Description |
|---|---|---|
household_id |
int64 | Household identifier |
age |
object | Age range of household head |
income |
object | Household income range |
home_ownership |
object | Home ownership status |
marital_status |
object | Marital status |
household_size |
object | Number of people in household |
household_comp |
object | Household composition |
kids_count |
object | Number of children |
Common Use Cases:
- Customer segmentation
- Demographic analysis of shopping patterns
- Income-based purchasing behavior
- Family composition impact on shopping
Products¶
Detailed product master catalog containing comprehensive metadata for every item sold in the grocery store. This dataset provides the essential product hierarchy including departments, categories, brand types, and package sizes.
Dataset Size: 92,331 records representing unique products
# Load and examine products dataset structure
products = data["products"]
print("=== PRODUCTS DATASET ===")
print(f"Shape: {products.shape}")
print(f"\nSample data (first 5 rows):")
display(products.head())
=== PRODUCTS DATASET === Shape: (92331, 7) Sample data (first 5 rows):
| product_id | manufacturer_id | department | brand | product_category | product_type | package_size | |
|---|---|---|---|---|---|---|---|
| 0 | 25671 | 2 | GROCERY | National | FRZN ICE | ICE - CRUSHED/CUBED | 22 LB |
| 1 | 26081 | 2 | MISCELLANEOUS | National | None | None | None |
| 2 | 26093 | 69 | PASTRY | Private | BREAD | BREAD:ITALIAN/FRENCH | None |
| 3 | 26190 | 69 | GROCERY | Private | FRUIT - SHELF STABLE | APPLE SAUCE | 50 OZ |
| 4 | 26355 | 69 | GROCERY | Private | COOKIES/CONES | SPECIALTY COOKIES | 14 OZ |
Key Variables:
| Column | Data Type | Description |
|---|---|---|
product_id |
int64 | Product identifier |
manufacturer_id |
int64 | Manufacturer identifier |
department |
object | Store department |
brand |
object | Brand type (National, Private) |
product_category |
object | Product category |
product_type |
object | Specific product type |
package_size |
object | Package size information |
Common Use Cases:
- Category performance analysis
- Brand loyalty studies
- Product assortment optimization
- Private label vs national brand comparison
Marketing & Promotional Data¶
Campaigns¶
Records which marketing campaigns each household was exposed to during the study period. This dataset tracks campaign targeting and enables measurement of marketing reach across different customer segments.
Dataset Size: 6,589 records representing campaign-household combinations
# Load and examine campaigns dataset structure
campaigns = data["campaigns"]
print("=== CAMPAIGNS DATASET ===")
print(f"Shape: {campaigns.shape}")
print(f"\nSample data (first 5 rows):")
display(campaigns.head())
=== CAMPAIGNS DATASET === Shape: (6589, 2) Sample data (first 5 rows):
| campaign_id | household_id | |
|---|---|---|
| 0 | 1 | 105 |
| 1 | 1 | 1238 |
| 2 | 1 | 1258 |
| 3 | 1 | 1483 |
| 4 | 1 | 2200 |
Key Variables:
| Column | Data Type | Description |
|---|---|---|
campaign_id |
int64 | Campaign identifier |
household_id |
int64 | Household identifier |
Common Use Cases:
- Campaign reach analysis
- Cross-campaign participation patterns
- Household marketing responsiveness
- Campaign targeting effectiveness
Campaign Descriptions¶
Comprehensive metadata about each marketing campaign including timing, type, and strategic focus. This dataset provides the business context needed to interpret campaign performance.
Dataset Size: 27 records representing unique campaigns
# Load and examine campaign descriptions dataset structure
campaign_descriptions = data["campaign_descriptions"]
print("=== CAMPAIGN DESCRIPTIONS DATASET ===")
print(f"Shape: {campaign_descriptions.shape}")
print(f"\nSample data (first 5 rows):")
display(campaign_descriptions.head())
=== CAMPAIGN DESCRIPTIONS DATASET === Shape: (27, 4) Sample data (first 5 rows):
| campaign_id | campaign_type | start_date | end_date | |
|---|---|---|---|---|
| 0 | 1 | Type B | 2017-03-03 | 2017-04-09 |
| 1 | 2 | Type B | 2017-03-08 | 2017-04-09 |
| 2 | 3 | Type C | 2017-03-13 | 2017-05-08 |
| 3 | 4 | Type B | 2017-03-29 | 2017-04-30 |
| 4 | 5 | Type B | 2017-04-03 | 2017-05-07 |
Key Variables:
| Column | Data Type | Description |
|---|---|---|
campaign_id |
int64 | Campaign identifier |
campaign_type |
object | Type of campaign (Type A, Type B, Type C) |
start_date |
datetime64[ns] | Campaign start date |
end_date |
datetime64[ns] | Campaign end date |
Campaign Types:
- Type A: Personalized campaigns with targeted coupon selection
- Type B: Demographic or category-targeted campaigns
- Type C: Broader promotional campaigns
Common Use Cases:
- Campaign lifecycle analysis
- Performance comparison across campaign types
- Seasonal marketing pattern analysis
- Campaign duration effectiveness
Coupons¶
Detailed inventory of all coupons distributed to customers, linking specific coupon offers to products and marketing campaigns.
📋 Understanding Coupon Distribution
This table lists all the coupons sent to customers as part of a campaign, as well as the products for which each coupon is redeemable. Some coupons are redeemable for multiple products.
Campaign-Specific Coupon Strategies:
- Campaign Type A: Each customer received 16 coupons selected based on prior purchase behavior
- Campaign Type B and Type C: All customers receive all coupons pertaining to that campaign
Dataset Size: 116,204 records representing unique coupons
# Load and examine coupons dataset structure
coupons = data["coupons"]
print("=== COUPONS DATASET ===")
print(f"Shape: {coupons.shape}")
print(f"\nSample data (first 5 rows):")
display(coupons.head())
=== COUPONS DATASET === Shape: (116204, 3) Sample data (first 5 rows):
| coupon_upc | product_id | campaign_id | |
|---|---|---|---|
| 0 | 10000085207 | 9676830 | 26 |
| 1 | 10000085207 | 9676943 | 26 |
| 2 | 10000085207 | 9676944 | 26 |
| 3 | 10000085207 | 9676947 | 26 |
| 4 | 10000085207 | 9677008 | 26 |
Key Variables:
| Column | Data Type | Description |
|---|---|---|
coupon_upc |
int64 | Coupon UPC identifier |
product_id |
int64 | Associated product |
campaign_id |
int64 | Associated campaign |
Common Use Cases:
- Coupon product targeting analysis
- Campaign-coupon relationships
- Product-specific promotion effectiveness
Coupon Redemptions¶
Actual coupon usage behavior capturing when, which, and by whom coupons were redeemed. This dataset reveals customer response to promotional offers.
Dataset Size: 2,102 records representing coupon redemptions
# Load and examine coupon redemptions dataset structure
coupon_redemptions = data["coupon_redemptions"]
print("=== COUPON REDEMPTIONS DATASET ===")
print(f"Shape: {coupon_redemptions.shape}")
print(f"\nSample data (first 5 rows):")
display(coupon_redemptions.head())
=== COUPON REDEMPTIONS DATASET === Shape: (2102, 4) Sample data (first 5 rows):
| household_id | coupon_upc | campaign_id | redemption_date | |
|---|---|---|---|---|
| 0 | 1029 | 51380041013 | 26 | 2017-01-01 |
| 1 | 1029 | 51380041313 | 26 | 2017-01-01 |
| 2 | 165 | 53377610033 | 26 | 2017-01-03 |
| 3 | 712 | 51380041013 | 26 | 2017-01-07 |
| 4 | 712 | 54300016033 | 26 | 2017-01-07 |
Key Variables:
| Column | Data Type | Description |
|---|---|---|
household_id |
int64 | Household identifier |
coupon_upc |
int64 | Coupon UPC identifier |
campaign_id |
int64 | Associated campaign |
redemption_date |
datetime64[ns] | Date of coupon use |
Common Use Cases:
- Coupon redemption rate calculation
- Household coupon usage patterns
- Campaign effectiveness measurement
- Promotional ROI analysis
Promotions¶
Comprehensive record of promotional activities including in-store displays and mailer placements for products across all stores and time periods.
Dataset Size: 20,940,529 records representing promotional placements
# Load and examine promotions dataset structure (sample due to size)
promotions = data["promotions"]
print("=== PROMOTIONS DATASET ===")
print(f"Shape: {promotions.shape}")
print(f"\nSample data (first 5 rows):")
display(promotions.head())
=== PROMOTIONS DATASET === Shape: (20940529, 5) Sample data (first 5 rows):
| product_id | store_id | display_location | mailer_location | week | |
|---|---|---|---|---|---|
| 0 | 1000050 | 316 | 9 | 0 | 1 |
| 1 | 1000050 | 337 | 3 | 0 | 1 |
| 2 | 1000050 | 441 | 5 | 0 | 1 |
| 3 | 1000092 | 292 | 0 | A | 1 |
| 4 | 1000092 | 293 | 0 | A | 1 |
Key Variables:
| Column | Data Type | Description |
|---|---|---|
product_id |
int64 | Product identifier |
store_id |
int64 | Store identifier |
display_location |
object | In-store display placement |
mailer_location |
object | Mailer placement code |
week |
int64 | Week of promotion |
Common Use Cases:
- Promotional impact analysis
- Display effectiveness measurement
- Mailer response analysis
- Store-level promotional performance
Data Relationships and Joins¶
Understanding how datasets connect is crucial for comprehensive analysis:

Entity Relationship Diagram showing connections between all 8 datasets
# Show key relationships between datasets
print("=== KEY DATASET RELATIONSHIPS ===")
print("\n1. Primary Keys and Connections:")
print(f" • Households: {demographics['household_id'].nunique():,} unique IDs")
print(f" • Products: {products['product_id'].nunique():,} unique IDs")
print(f" • Campaigns: {campaign_descriptions['campaign_id'].nunique()} unique IDs")
print(f" • Stores: {transactions['store_id'].nunique()} unique IDs")
print("\n2. Common Join Patterns:")
print(" • transactions ⟵ demographics (on household_id)")
print(" • transactions ⟵ products (on product_id)")
print(" • campaigns ⟵ campaign_descriptions (on campaign_id)")
print(" • coupon_redemptions ⟵ coupons (on coupon_upc)")
print(" • promotions ⟵ products (on product_id)")
print("\n3. Data Coverage:")
print(f" • Transactions cover {transactions['household_id'].nunique():,} households")
print(f" • Campaigns reach {campaigns['household_id'].nunique():,} households")
print(f" • Coupon redemptions by {coupon_redemptions['household_id'].nunique():,} households")
=== KEY DATASET RELATIONSHIPS === 1. Primary Keys and Connections: • Households: 801 unique IDs • Products: 92,331 unique IDs • Campaigns: 27 unique IDs • Stores: 457 unique IDs 2. Common Join Patterns: • transactions ⟵ demographics (on household_id) • transactions ⟵ products (on product_id) • campaigns ⟵ campaign_descriptions (on campaign_id) • coupon_redemptions ⟵ coupons (on coupon_upc) • promotions ⟵ products (on product_id) 3. Data Coverage: • Transactions cover 2,469 households • Campaigns reach 1,559 households • Coupon redemptions by 410 households
Next Steps¶
Now that you understand the dataset structures, explore these resources:
- Dataset Summary Analysis - High-level analysis of each dataset
- Getting Started - Basic usage patterns
- Cookbook Examples - Detailed analysis tutorials
- API Reference - Function documentation