API Reference: get_data
The get_data function is the primary interface for loading Complete Journey datasets.
get_data(which: Optional[Union[str, Sequence[str]]] = None) -> Dict[str, pd.DataFrame]
Load datasets from the Complete Journey grocery transaction data.
The Complete Journey dataset contains grocery store shopping transactions from 2,469 households over one year, provided by 84.51°. Includes transaction records, demographics, marketing campaigns, and coupon data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
which
|
Optional[Union[str, Sequence[str]]]
|
Specify which dataset(s) to load. Options: - None (default): Load all 8 datasets - str: Load single dataset by name - Sequence[str]: Load multiple datasets by name Available datasets:
- 'transactions': Purchase records (1.47M records)
- 'demographics': Household demographic information
- 'products': Product metadata and categories |
None
|
Returns:
| Type | Description |
|---|---|
Dict[str, DataFrame]
|
Dict[str, pd.DataFrame]: Dictionary mapping dataset names to pandas DataFrames. Each DataFrame contains the requested dataset with appropriate column names and data types. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If specified dataset name doesn't exist. |
TypeError
|
If 'which' parameter is not None, str, or sequence of strings. |
Examples:
Load all datasets:
Load single dataset:
>>> transactions = get_data("transactions")["transactions"]
>>> transactions.shape[0] > 1000000
True
Load multiple datasets:
>>> sales_data = get_data(["transactions", "products", "demographics"])
>>> list(sales_data.keys())
['transactions', 'products', 'demographics']
Note
Data source: 84.51° "The Complete Journey" dataset Available at: http://www.8451.com/area51/
Usage Examples
Load All Datasets
from completejourney_py import get_data
# Load all 8 datasets
data = get_data()
print(f"Available datasets: {list(data.keys())}")
# Output: ['campaign_descriptions', 'campaigns', 'coupons', 'coupon_redemptions',
# 'demographics', 'products', 'promotions', 'transactions']
Load Single Dataset
# Load only transactions
transactions_dict = get_data("transactions")
transactions = transactions_dict["transactions"]
print(f"Transactions shape: {transactions.shape}")
print(f"Columns: {list(transactions.columns)}")
Load Multiple Datasets
# Load specific datasets for sales analysis
sales_data = get_data(["transactions", "products", "demographics"])
transactions = sales_data["transactions"]
products = sales_data["products"]
demographics = sales_data["demographics"]
# Join for comprehensive analysis
transaction_details = (transactions
.merge(products, on='product_id')
.merge(demographics, on='household_id'))
Load with Different Input Types
# Using list
data1 = get_data(["transactions", "demographics"])
# Using tuple
data2 = get_data(("transactions", "demographics"))
# Using single string
data3 = get_data("transactions")
# Empty list returns empty dictionary
data4 = get_data([])
assert len(data4) == 0
Available Datasets
| Dataset Name | Records | Description |
|---|---|---|
transactions |
1,469,307 | Individual product purchases with prices, quantities, and timing |
demographics |
801 | Household characteristics including age, income, and composition |
products |
92,353 | Product metadata with categories, brands, and descriptions |
campaigns |
6,627 | Marketing campaigns received by households |
campaign_descriptions |
27 | Campaign metadata and details |
promotions |
20,940,529 | Product placement in mailers and stores |
coupons |
116,364 | Coupon metadata with values and restrictions |
coupon_redemptions |
2,318 | Individual coupon usage records |
Error Handling
The function will raise exceptions for invalid inputs:
# Invalid dataset name
try:
get_data("invalid_dataset")
except FileNotFoundError:
print("Dataset not found")
# Invalid parameter type
try:
get_data(123)
except (TypeError, AttributeError):
print("Invalid parameter type")
# Mixed valid/invalid names
try:
get_data(["transactions", "invalid_dataset"])
except FileNotFoundError:
print("One or more datasets not found")
Return Value Details
The function always returns a dictionary where:
- Keys: Dataset names (strings)
- Values: pandas DataFrames containing the dataset
- Structure: Each DataFrame has appropriate column names and data types
data = get_data(["transactions", "demographics"])
# Check return structure
assert isinstance(data, dict)
assert isinstance(data["transactions"], pd.DataFrame)
assert isinstance(data["demographics"], pd.DataFrame)
# Verify data quality
transactions = data["transactions"]
assert len(transactions) > 0 # Has data
assert len(transactions.columns) > 0 # Has columns
assert transactions.index is not None # Has valid index
Performance Considerations
- Large datasets: The
promotionsdataset is very large (20.9M records). Consider filtering after loading if memory is limited. - Selective loading: Load only the datasets you need to optimize memory usage and loading time.
- Caching: Consider storing frequently used datasets locally for repeated analysis.
# Memory-efficient approach for large analysis
# Load core datasets first
core_data = get_data(["transactions", "products", "demographics"])
# Load additional datasets only if needed
if analysis_requires_promotions:
promotions = get_data("promotions")["promotions"]
Common Patterns
Basic Analysis Setup
# Standard setup for transaction analysis
data = get_data(["transactions", "products", "demographics"])
transactions = data["transactions"]
products = data["products"]
demographics = data["demographics"]
# Create enriched transaction dataset
enriched_transactions = (transactions
.merge(products[['product_id', 'commodity_desc', 'department']], on='product_id')
.merge(demographics[['household_id', 'income_desc', 'household_size_desc']], on='household_id'))
Campaign Analysis Setup
# Setup for marketing campaign analysis
campaign_data = get_data([
"campaigns",
"campaign_descriptions",
"demographics",
"transactions"
])
# Create campaign analysis dataset
campaign_analysis = (campaign_data["campaigns"]
.merge(campaign_data["campaign_descriptions"], on='campaign')
.merge(campaign_data["demographics"], on='household_id'))
Coupon Analysis Setup
# Setup for coupon effectiveness analysis
coupon_data = get_data([
"coupons",
"coupon_redemptions",
"products",
"demographics"
])
# Create coupon analysis dataset
coupon_analysis = (coupon_data["coupon_redemptions"]
.merge(coupon_data["coupons"], on='coupon_upc')
.merge(coupon_data["products"], on='product_id')
.merge(coupon_data["demographics"], on='household_id'))
Data Source Information
All datasets are sourced from 84.51° "The Complete Journey" project:
- Provider: 84.51°
- Format: Parquet files for efficient loading and cross-platform compatibility
- Scope: One year of grocery shopping data from 801 households
- License: Available for research and educational purposes
See Also
- Dataset Overview - Detailed information about each dataset
- Getting Started - Basic usage patterns
- Cookbook Examples - Step-by-step analysis tutorials