Getting Started¶
This guide will help you get up and running with the Complete Journey Python package quickly.
Installation¶
Standard Installation¶
Install the package using pip:
pip install completejourney_py
Development Installation¶
For development or if you want the latest features:
# Clone the repository
git clone https://github.com/cunningjames/completejourney_py.git
cd completejourney_py
# Install in development mode
pip install -e .
# Or with development dependencies
pip install -e ".[dev]"
Loading Data and Working with Specific Datasets¶
The main interface is the get_data() function:
In [1]:
Copied!
from completejourney_py import get_data
# Load all datasets (returns a dictionary)
data = get_data()
print(f"Available datasets: {list(data.keys())}")
from completejourney_py import get_data
# Load all datasets (returns a dictionary)
data = get_data()
print(f"Available datasets: {list(data.keys())}")
Available datasets: ['campaign_descriptions', 'coupons', 'promotions', 'campaigns', 'demographics', 'transactions', 'coupon_redemptions', 'products']
Working with Specific Datasets¶
In [2]:
Copied!
# Load just the transactions
transactions = get_data("transactions")["transactions"]
print(f"Transactions shape: {transactions.shape}")
print(f"Transactions columns: {list(transactions.columns)}")
# Load just the transactions
transactions = get_data("transactions")["transactions"]
print(f"Transactions shape: {transactions.shape}")
print(f"Transactions columns: {list(transactions.columns)}")
Transactions shape: (1469307, 11) Transactions columns: ['household_id', 'store_id', 'basket_id', 'product_id', 'quantity', 'sales_value', 'retail_disc', 'coupon_disc', 'coupon_match_disc', 'week', 'transaction_timestamp']
In [3]:
Copied!
# Load multiple specific datasets
sales_data = get_data(["transactions", "products", "demographics"])
transactions = sales_data["transactions"]
products = sales_data["products"]
demographics = sales_data["demographics"]
print(f"Loaded datasets:")
for name, df in sales_data.items():
print(f" {name}: {df.shape[0]:,} rows, {df.shape[1]} columns")
# Load multiple specific datasets
sales_data = get_data(["transactions", "products", "demographics"])
transactions = sales_data["transactions"]
products = sales_data["products"]
demographics = sales_data["demographics"]
print(f"Loaded datasets:")
for name, df in sales_data.items():
print(f" {name}: {df.shape[0]:,} rows, {df.shape[1]} columns")
Loaded datasets: transactions: 1,469,307 rows, 11 columns products: 92,331 rows, 7 columns demographics: 801 rows, 8 columns
Your First Analysis¶
Here's a simple analysis to get you started:
In [4]:
Copied!
import pandas as pd
from completejourney_py import get_data
# Load the data we need
data = get_data(["transactions", "products"])
transactions = data["transactions"]
products = data["products"]
print("Data loaded successfully!")
print(f"Transactions: {transactions.shape}")
print(f"Products: {products.shape}")
import pandas as pd
from completejourney_py import get_data
# Load the data we need
data = get_data(["transactions", "products"])
transactions = data["transactions"]
products = data["products"]
print("Data loaded successfully!")
print(f"Transactions: {transactions.shape}")
print(f"Products: {products.shape}")
Data loaded successfully! Transactions: (1469307, 11) Products: (92331, 7)
In [5]:
Copied!
# Basic transaction summary
print("=== Transaction Summary ===")
print(f"Total transactions: {len(transactions):,}")
print(f"Unique households: {transactions['household_id'].nunique():,}")
# Extract date from transaction_timestamp for date range
transactions['date'] = transactions['transaction_timestamp'].dt.date
print(f"Date range: {transactions['date'].min()} to {transactions['date'].max()}")
print(f"Total sales value: ${transactions['sales_value'].sum():,.2f}")
print(f"Average transaction value: ${transactions['sales_value'].mean():.2f}")
# Basic transaction summary
print("=== Transaction Summary ===")
print(f"Total transactions: {len(transactions):,}")
print(f"Unique households: {transactions['household_id'].nunique():,}")
# Extract date from transaction_timestamp for date range
transactions['date'] = transactions['transaction_timestamp'].dt.date
print(f"Date range: {transactions['date'].min()} to {transactions['date'].max()}")
print(f"Total sales value: ${transactions['sales_value'].sum():,.2f}")
print(f"Average transaction value: ${transactions['sales_value'].mean():.2f}")
=== Transaction Summary === Total transactions: 1,469,307 Unique households: 2,469 Date range: 2017-01-01 to 2018-01-01 Total sales value: $4,596,039.58 Average transaction value: $3.13
In [6]:
Copied!
# Find top-selling products by revenue
top_products = (transactions
.groupby('product_id', as_index=False)
.agg({'sales_value': 'sum'})
.sort_values(by='sales_value', ascending=False)
.head(10))
print("Top 10 Product IDs by Sales Value:")
top_products
# Find top-selling products by revenue
top_products = (transactions
.groupby('product_id', as_index=False)
.agg({'sales_value': 'sum'})
.sort_values(by='sales_value', ascending=False)
.head(10))
print("Top 10 Product IDs by Sales Value:")
top_products
Top 10 Product IDs by Sales Value:
Out[6]:
| product_id | sales_value | |
|---|---|---|
| 42369 | 6534178 | 303116.02 |
| 42340 | 6533889 | 27467.61 |
| 23098 | 1029743 | 22729.71 |
| 42365 | 6534166 | 20477.54 |
| 42333 | 6533765 | 19451.66 |
| 27971 | 1082185 | 17219.59 |
| 12526 | 916122 | 16120.01 |
| 30193 | 1106523 | 15629.95 |
| 19879 | 995242 | 15602.59 |
| 39480 | 5569230 | 13410.46 |
In [7]:
Copied!
# Join with product information to get meaningful names
top_products_info = (top_products
.reset_index()
.merge(products[['product_id', 'product_category', 'brand']],
on='product_id'))
print("\n=== Top 10 Products by Sales Value ===")
for i, row in top_products_info.iterrows():
print(f"{i+1:2d}. ${row['sales_value']:>8,.0f} - {row['product_category']} ({row['brand']})")
# Join with product information to get meaningful names
top_products_info = (top_products
.reset_index()
.merge(products[['product_id', 'product_category', 'brand']],
on='product_id'))
print("\n=== Top 10 Products by Sales Value ===")
for i, row in top_products_info.iterrows():
print(f"{i+1:2d}. ${row['sales_value']:>8,.0f} - {row['product_category']} ({row['brand']})")
=== Top 10 Products by Sales Value === 1. $ 303,116 - COUPON/MISC ITEMS (Private) 2. $ 27,468 - COUPON/MISC ITEMS (Private) 3. $ 22,730 - FLUID MILK PRODUCTS (Private) 4. $ 20,478 - COUPON/MISC ITEMS (Private) 5. $ 19,452 - FUEL (Private) 6. $ 17,220 - TROPICAL FRUIT (National) 7. $ 16,120 - CHICKEN (National) 8. $ 15,630 - FLUID MILK PRODUCTS (Private) 9. $ 15,603 - FLUID MILK PRODUCTS (Private) 10. $ 13,410 - SOFT DRINKS (National)
In [8]:
Copied!
# Let's also look at the product information to understand what we're working with
print("Sample of products data:")
products[['product_id', 'product_category', 'brand', 'department']].head(10)
# Let's also look at the product information to understand what we're working with
print("Sample of products data:")
products[['product_id', 'product_category', 'brand', 'department']].head(10)
Sample of products data:
Out[8]:
| product_id | product_category | brand | department | |
|---|---|---|---|---|
| 0 | 25671 | FRZN ICE | National | GROCERY |
| 1 | 26081 | None | National | MISCELLANEOUS |
| 2 | 26093 | BREAD | Private | PASTRY |
| 3 | 26190 | FRUIT - SHELF STABLE | Private | GROCERY |
| 4 | 26355 | COOKIES/CONES | Private | GROCERY |
| 5 | 26426 | SPICES & EXTRACTS | Private | GROCERY |
| 6 | 26540 | COOKIES/CONES | Private | GROCERY |
| 7 | 26601 | VITAMINS | Private | DRUG GM |
| 8 | 26636 | BREAKFAST SWEETS | Private | PASTRY |
| 9 | 26691 | PNT BTR/JELLY/JAMS | Private | GROCERY |
In [9]:
Copied!
# Quick exploration: What departments are represented in our top products?
top_products_detailed = top_products_info.merge(
products[['product_id', 'department']],
on='product_id'
)
print("\nDepartments represented in top 10 products:")
dept_counts = top_products_detailed['department'].value_counts()
for dept, count in dept_counts.items():
print(f" {dept}: {count} products")
# Quick exploration: What departments are represented in our top products?
top_products_detailed = top_products_info.merge(
products[['product_id', 'department']],
on='product_id'
)
print("\nDepartments represented in top 10 products:")
dept_counts = top_products_detailed['department'].value_counts()
for dept, count in dept_counts.items():
print(f" {dept}: {count} products")
Departments represented in top 10 products: GROCERY: 4 products FUEL: 2 products MISCELLANEOUS: 2 products PRODUCE: 1 products MEAT: 1 products
Next Steps¶
Now that you're familiar with the basics, explore these resources:
- Dataset Overview - Learn about the structure and variables in each dataset
- Cookbook Examples - Step-by-step analysis tutorials covering shopping frequency, coupon analysis, and traffic patterns
- API Reference - Complete function documentation
Getting Help¶
If you encounter issues:
- Check the API documentation
- Look at the cookbook examples
- Visit the GitHub repository
- Open an issue for bugs or feature requests