Dataset Overview¶

The Complete Journey dataset provides a comprehensive view of grocery shopping behavior through eight interconnected datasets. This guide details each dataset's structure, key variables, and important considerations.

⚠️ Educational Data Notice

This package contains simulated data for educational purposes only. The data structure is based on real grocery shopping patterns, but all transaction records, household information, and shopping behaviors are artificially generated. This data is intended for learning data analysis techniques, not for research or commercial applications.

In [1]:

Copied!





# Import required libraries and load data
import pandas as pd
import numpy as np
from completejourney_py import get_data

# Set display options for better output
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

# Load all datasets
print("Loading Complete Journey datasets...")
data = get_data()
print(f"✅ Loaded {len(data)} datasets successfully!")
print(f"Available datasets: {list(data.keys())}")
# Import required libraries and load data
import pandas as pd
import numpy as np
from completejourney_py import get_data

# Set display options for better output
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

# Load all datasets
print("Loading Complete Journey datasets...")
data = get_data()
print(f"✅ Loaded {len(data)} datasets successfully!")
print(f"Available datasets: {list(data.keys())}")

Loading Complete Journey datasets...
✅ Loaded 8 datasets successfully!
Available datasets: ['campaign_descriptions', 'coupons', 'promotions', 'campaigns', 'demographics', 'transactions', 'coupon_redemptions', 'products']

Core Transaction Data¶

Transactions¶

The heart of the Complete Journey dataset, containing individual product purchases from real grocery shopping trips. Each record represents a single product purchase within a shopping basket, capturing the complete transaction history including pricing, discounts, and timing.

📝 Understanding Sales Value

The sales_value variable represents the amount of dollars received by the retailer on the sale of the specific product, taking the coupon match and loyalty card discount into account. It is not the actual price paid by the customer. If a customer uses a coupon, the actual price paid will be less than the sales_value because the manufacturer issuing the coupon will reimburse the retailer for the amount of the coupon.

To calculate actual product prices, use these formulas:

Loyalty card price = (sales_value - (retail_disc + coupon_match_disc)) / quantity

Non-loyalty card price = (sales_value - coupon_match_disc) / quantity

Dataset Size: 1,469,307 records representing individual product purchases

In [2]:

Copied!





# Load and examine transactions dataset structure
transactions = data["transactions"]

print("=== TRANSACTIONS DATASET ===")
print(f"Shape: {transactions.shape}")

print(f"\nSample data (first 5 rows):")
display(transactions.head())
# Load and examine transactions dataset structure
transactions = data["transactions"]

print("=== TRANSACTIONS DATASET ===")
print(f"Shape: {transactions.shape}")

print(f"\nSample data (first 5 rows):")
display(transactions.head())

=== TRANSACTIONS DATASET ===
Shape: (1469307, 11)

Sample data (first 5 rows):

	household_id	store_id	basket_id	product_id	quantity	sales_value	retail_disc	week	transaction_timestamp
0	900	330	31198570044	1095275	1	0.50	0.00	1	2017-01-01 11:53:26
1	900	330	31198570047	9878513	1	0.99	0.10	1	2017-01-01 12:10:28
2	1228	406	31198655051	1041453	1	1.43	0.15	1	2017-01-01 12:26:30
3	906	319	31198705046	1020156	1	1.50	0.29	1	2017-01-01 12:30:27
4	906	319	31198705046	1053875	2	2.78	0.80	1	2017-01-01 12:30:27

Key Variables:

Column	Data Type	Description
`household_id`	int64	Household identifier
`store_id`	int64	Store location identifier
`basket_id`	int64	Unique shopping trip identifier
`product_id`	int64	Product identifier
`quantity`	int64	Number of items purchased
`sales_value`	float64	Dollar amount received by retailer
`retail_disc`	float64	Retail discount applied
`coupon_disc`	float64	Coupon discount applied
`coupon_match_disc`	float64	Coupon match discount
`week`	int64	Week number (1-53)
`transaction_timestamp`	datetime64[ns]	Date and time of purchase

Common Use Cases:

Market basket analysis
Customer lifetime value calculation
Shopping frequency analysis
Seasonal trend identification
Price elasticity studies

Demographics¶

Comprehensive household-level demographic profiles that enable sophisticated customer segmentation and behavioral analysis. This dataset captures key socioeconomic indicators including age, income, family composition, and homeownership status.

Dataset Size: 801 records representing unique households

In [3]:

Copied!





# Load and examine demographics dataset structure
demographics = data["demographics"]

print("=== DEMOGRAPHICS DATASET ===")
print(f"Shape: {demographics.shape}")

print(f"\nSample data (first 5 rows):")
display(demographics.head())
# Load and examine demographics dataset structure
demographics = data["demographics"]

print("=== DEMOGRAPHICS DATASET ===")
print(f"Shape: {demographics.shape}")

print(f"\nSample data (first 5 rows):")
display(demographics.head())

=== DEMOGRAPHICS DATASET ===
Shape: (801, 8)

Sample data (first 5 rows):

	household_id	age	income	home_ownership	marital_status	household_size	household_comp	kids_count
0	1	65+	35-49K	Homeowner	Married	2	2 Adults No Kids	0
1	1001	45-54	50-74K	Homeowner	Unmarried	1	1 Adult No Kids	0
2	1003	35-44	25-34K	None	Unmarried	1	1 Adult No Kids	0
3	1004	25-34	15-24K	None	Unmarried	1	1 Adult No Kids	0
4	101	45-54	Under 15K	Homeowner	Married	4	2 Adults Kids	2

Key Variables:

Column	Data Type	Description
`household_id`	int64	Household identifier
`age`	object	Age range of household head
`income`	object	Household income range
`home_ownership`	object	Home ownership status
`marital_status`	object	Marital status
`household_size`	object	Number of people in household
`household_comp`	object	Household composition
`kids_count`	object	Number of children

Common Use Cases:

Customer segmentation
Demographic analysis of shopping patterns
Income-based purchasing behavior
Family composition impact on shopping

Products¶

Detailed product master catalog containing comprehensive metadata for every item sold in the grocery store. This dataset provides the essential product hierarchy including departments, categories, brand types, and package sizes.

Dataset Size: 92,331 records representing unique products

In [4]:

Copied!





# Load and examine products dataset structure
products = data["products"]

print("=== PRODUCTS DATASET ===")
print(f"Shape: {products.shape}")

print(f"\nSample data (first 5 rows):")
display(products.head())
# Load and examine products dataset structure
products = data["products"]

print("=== PRODUCTS DATASET ===")
print(f"Shape: {products.shape}")

print(f"\nSample data (first 5 rows):")
display(products.head())

=== PRODUCTS DATASET ===
Shape: (92331, 7)

Sample data (first 5 rows):

	product_id	manufacturer_id	department	brand	product_category	product_type	package_size
0	25671	2	GROCERY	National	FRZN ICE	ICE - CRUSHED/CUBED	22 LB
1	26081	2	MISCELLANEOUS	National	None	None	None
2	26093	69	PASTRY	Private	BREAD	BREAD:ITALIAN/FRENCH	None
3	26190	69	GROCERY	Private	FRUIT - SHELF STABLE	APPLE SAUCE	50 OZ
4	26355	69	GROCERY	Private	COOKIES/CONES	SPECIALTY COOKIES	14 OZ

Key Variables:

Column	Data Type	Description
`product_id`	int64	Product identifier
`manufacturer_id`	int64	Manufacturer identifier
`department`	object	Store department
`brand`	object	Brand type (National, Private)
`product_category`	object	Product category
`product_type`	object	Specific product type
`package_size`	object	Package size information

Common Use Cases:

Category performance analysis
Brand loyalty studies
Product assortment optimization
Private label vs national brand comparison

Marketing & Promotional Data¶

Campaigns¶

Records which marketing campaigns each household was exposed to during the study period. This dataset tracks campaign targeting and enables measurement of marketing reach across different customer segments.

Dataset Size: 6,589 records representing campaign-household combinations

In [5]:

Copied!





# Load and examine campaigns dataset structure
campaigns = data["campaigns"]

print("=== CAMPAIGNS DATASET ===")
print(f"Shape: {campaigns.shape}")

print(f"\nSample data (first 5 rows):")
display(campaigns.head())
# Load and examine campaigns dataset structure
campaigns = data["campaigns"]

print("=== CAMPAIGNS DATASET ===")
print(f"Shape: {campaigns.shape}")

print(f"\nSample data (first 5 rows):")
display(campaigns.head())

=== CAMPAIGNS DATASET ===
Shape: (6589, 2)

Sample data (first 5 rows):

	campaign_id	household_id
0	1	105
1	1	1238
2	1	1258
3	1	1483
4	1	2200

Key Variables:

Column	Data Type	Description
`campaign_id`	int64	Campaign identifier
`household_id`	int64	Household identifier

Common Use Cases:

Campaign reach analysis
Cross-campaign participation patterns
Household marketing responsiveness
Campaign targeting effectiveness

Campaign Descriptions¶

Comprehensive metadata about each marketing campaign including timing, type, and strategic focus. This dataset provides the business context needed to interpret campaign performance.

Dataset Size: 27 records representing unique campaigns

In [6]:

Copied!





# Load and examine campaign descriptions dataset structure
campaign_descriptions = data["campaign_descriptions"]

print("=== CAMPAIGN DESCRIPTIONS DATASET ===")
print(f"Shape: {campaign_descriptions.shape}")

print(f"\nSample data (first 5 rows):")
display(campaign_descriptions.head())
# Load and examine campaign descriptions dataset structure
campaign_descriptions = data["campaign_descriptions"]

print("=== CAMPAIGN DESCRIPTIONS DATASET ===")
print(f"Shape: {campaign_descriptions.shape}")

print(f"\nSample data (first 5 rows):")
display(campaign_descriptions.head())

=== CAMPAIGN DESCRIPTIONS DATASET ===
Shape: (27, 4)

Sample data (first 5 rows):

	campaign_id	campaign_type	start_date	end_date
0	1	Type B	2017-03-03	2017-04-09
1	2	Type B	2017-03-08	2017-04-09
2	3	Type C	2017-03-13	2017-05-08
3	4	Type B	2017-03-29	2017-04-30
4	5	Type B	2017-04-03	2017-05-07

Key Variables:

Column	Data Type	Description
`campaign_id`	int64	Campaign identifier
`campaign_type`	object	Type of campaign (Type A, Type B, Type C)
`start_date`	datetime64[ns]	Campaign start date
`end_date`	datetime64[ns]	Campaign end date

Campaign Types:

Type A: Personalized campaigns with targeted coupon selection
Type B: Demographic or category-targeted campaigns
Type C: Broader promotional campaigns

Common Use Cases:

Campaign lifecycle analysis
Performance comparison across campaign types
Seasonal marketing pattern analysis
Campaign duration effectiveness

Coupons¶

Detailed inventory of all coupons distributed to customers, linking specific coupon offers to products and marketing campaigns.

📋 Understanding Coupon Distribution

This table lists all the coupons sent to customers as part of a campaign, as well as the products for which each coupon is redeemable. Some coupons are redeemable for multiple products.

Campaign-Specific Coupon Strategies:

Campaign Type A: Each customer received 16 coupons selected based on prior purchase behavior

Campaign Type B and Type C: All customers receive all coupons pertaining to that campaign

Dataset Size: 116,204 records representing unique coupons

In [7]:

Copied!





# Load and examine coupons dataset structure
coupons = data["coupons"]

print("=== COUPONS DATASET ===")
print(f"Shape: {coupons.shape}")

print(f"\nSample data (first 5 rows):")
display(coupons.head())
# Load and examine coupons dataset structure
coupons = data["coupons"]

print("=== COUPONS DATASET ===")
print(f"Shape: {coupons.shape}")

print(f"\nSample data (first 5 rows):")
display(coupons.head())

=== COUPONS DATASET ===
Shape: (116204, 3)

Sample data (first 5 rows):

	coupon_upc	product_id	campaign_id
0	10000085207	9676830	26
1	10000085207	9676943	26
2	10000085207	9676944	26
3	10000085207	9676947	26
4	10000085207	9677008	26

Key Variables:

Column	Data Type	Description
`coupon_upc`	int64	Coupon UPC identifier
`product_id`	int64	Associated product
`campaign_id`	int64	Associated campaign

Common Use Cases:

Coupon product targeting analysis
Campaign-coupon relationships
Product-specific promotion effectiveness

Coupon Redemptions¶

Actual coupon usage behavior capturing when, which, and by whom coupons were redeemed. This dataset reveals customer response to promotional offers.

Dataset Size: 2,102 records representing coupon redemptions

In [8]:

Copied!





# Load and examine coupon redemptions dataset structure
coupon_redemptions = data["coupon_redemptions"]

print("=== COUPON REDEMPTIONS DATASET ===")
print(f"Shape: {coupon_redemptions.shape}")

print(f"\nSample data (first 5 rows):")
display(coupon_redemptions.head())
# Load and examine coupon redemptions dataset structure
coupon_redemptions = data["coupon_redemptions"]

print("=== COUPON REDEMPTIONS DATASET ===")
print(f"Shape: {coupon_redemptions.shape}")

print(f"\nSample data (first 5 rows):")
display(coupon_redemptions.head())

=== COUPON REDEMPTIONS DATASET ===
Shape: (2102, 4)

Sample data (first 5 rows):

	household_id	coupon_upc	campaign_id	redemption_date
0	1029	51380041013	26	2017-01-01
1	1029	51380041313	26	2017-01-01
2	165	53377610033	26	2017-01-03
3	712	51380041013	26	2017-01-07
4	712	54300016033	26	2017-01-07

Key Variables:

Column	Data Type	Description
`household_id`	int64	Household identifier
`coupon_upc`	int64	Coupon UPC identifier
`campaign_id`	int64	Associated campaign
`redemption_date`	datetime64[ns]	Date of coupon use

Common Use Cases:

Coupon redemption rate calculation
Household coupon usage patterns
Campaign effectiveness measurement
Promotional ROI analysis

Promotions¶

Comprehensive record of promotional activities including in-store displays and mailer placements for products across all stores and time periods.

Dataset Size: 20,940,529 records representing promotional placements

In [9]:

Copied!





# Load and examine promotions dataset structure (sample due to size)
promotions = data["promotions"]

print("=== PROMOTIONS DATASET ===")
print(f"Shape: {promotions.shape}")

print(f"\nSample data (first 5 rows):")
display(promotions.head())
# Load and examine promotions dataset structure (sample due to size)
promotions = data["promotions"]

print("=== PROMOTIONS DATASET ===")
print(f"Shape: {promotions.shape}")

print(f"\nSample data (first 5 rows):")
display(promotions.head())

=== PROMOTIONS DATASET ===
Shape: (20940529, 5)

Sample data (first 5 rows):

	product_id	store_id	display_location	mailer_location	week
0	1000050	316	9	0	1
1	1000050	337	3	0	1
2	1000050	441	5	0	1
3	1000092	292	0	A	1
4	1000092	293	0	A	1

Key Variables:

Column	Data Type	Description
`product_id`	int64	Product identifier
`store_id`	int64	Store identifier
`display_location`	object	In-store display placement
`mailer_location`	object	Mailer placement code
`week`	int64	Week of promotion

Common Use Cases:

Promotional impact analysis
Display effectiveness measurement
Mailer response analysis
Store-level promotional performance

Data Relationships and Joins¶

Understanding how datasets connect is crucial for comprehensive analysis:

Complete Journey Dataset ERD

Entity Relationship Diagram showing connections between all 8 datasets

In [10]:

Copied!





# Show key relationships between datasets
print("=== KEY DATASET RELATIONSHIPS ===")
print("\n1. Primary Keys and Connections:")
print(f"   • Households: {demographics['household_id'].nunique():,} unique IDs")
print(f"   • Products: {products['product_id'].nunique():,} unique IDs")
print(f"   • Campaigns: {campaign_descriptions['campaign_id'].nunique()} unique IDs")
print(f"   • Stores: {transactions['store_id'].nunique()} unique IDs")

print("\n2. Common Join Patterns:")
print("   • transactions ⟵ demographics (on household_id)")
print("   • transactions ⟵ products (on product_id)")
print("   • campaigns ⟵ campaign_descriptions (on campaign_id)")
print("   • coupon_redemptions ⟵ coupons (on coupon_upc)")
print("   • promotions ⟵ products (on product_id)")

print("\n3. Data Coverage:")
print(f"   • Transactions cover {transactions['household_id'].nunique():,} households")
print(f"   • Campaigns reach {campaigns['household_id'].nunique():,} households")
print(f"   • Coupon redemptions by {coupon_redemptions['household_id'].nunique():,} households")
# Show key relationships between datasets
print("=== KEY DATASET RELATIONSHIPS ===")
print("\n1. Primary Keys and Connections:")
print(f"   • Households: {demographics['household_id'].nunique():,} unique IDs")
print(f"   • Products: {products['product_id'].nunique():,} unique IDs")
print(f"   • Campaigns: {campaign_descriptions['campaign_id'].nunique()} unique IDs")
print(f"   • Stores: {transactions['store_id'].nunique()} unique IDs")

print("\n2. Common Join Patterns:")
print("   • transactions ⟵ demographics (on household_id)")
print("   • transactions ⟵ products (on product_id)")
print("   • campaigns ⟵ campaign_descriptions (on campaign_id)")
print("   • coupon_redemptions ⟵ coupons (on coupon_upc)")
print("   • promotions ⟵ products (on product_id)")

print("\n3. Data Coverage:")
print(f"   • Transactions cover {transactions['household_id'].nunique():,} households")
print(f"   • Campaigns reach {campaigns['household_id'].nunique():,} households")
print(f"   • Coupon redemptions by {coupon_redemptions['household_id'].nunique():,} households")

=== KEY DATASET RELATIONSHIPS ===

1. Primary Keys and Connections:
   • Households: 801 unique IDs
   • Products: 92,331 unique IDs
   • Campaigns: 27 unique IDs
   • Stores: 457 unique IDs

2. Common Join Patterns:
   • transactions ⟵ demographics (on household_id)
   • transactions ⟵ products (on product_id)
   • campaigns ⟵ campaign_descriptions (on campaign_id)
   • coupon_redemptions ⟵ coupons (on coupon_upc)
   • promotions ⟵ products (on product_id)

3. Data Coverage:
   • Transactions cover 2,469 households
   • Campaigns reach 1,559 households
   • Coupon redemptions by 410 households

Next Steps¶

Now that you understand the dataset structures, explore these resources:

Dataset Summary Analysis - High-level analysis of each dataset
Getting Started - Basic usage patterns
Cookbook Examples - Detailed analysis tutorials
API Reference - Function documentation