MMM Example Notebook#

In this notebook we work out a simulated example to showcase the Media Mix Model (MMM) API from pymc-marketing. This package provides a pymc implementation of the MMM presented in the paper Jin, Yuxue, et al. “Bayesian methods for media mix modeling with carryover and shape effects.” (2017). We work with synthetic data as we want to do parameter recovery to better understand the model assumptions. That is, we explicitly set values for our adstock and saturation parameters (see model specification below) and recover them back from the model. The data generation process is an adaptation of the blog post “Media Effect Estimation with PyMC: Adstock, Saturation & Diminishing Returns” by Juan Orduz.

Business Problem#

Before jumping into the data, let’s first define the business problem we are trying to solve. We are a marketing agency and we want to optimize the marketing budget of a client. We have access to the following data:

Sales data: weekly sales of the client.
Media spend data: weekly spend on different media channels (e.g. TV, radio, online, etc.). In this example we consider 2 media channels: \(x_{1}\) and \(x_{2}\).
Domain knowledge:
- We know that there has been a positive sales trend which we believe comes from a strong economic growth.
- We also know that there is a yearly seasonality effect.
- In addition, we were informed about two outliers in the data during the weeks 2019-05-13 and 2021-09-14.

Given this information we can draw a Directed Acyclic Graph (DAG) or graphical model of how we believe our variables are related. In other words, represent how we believe our system is causally related.

../../_images/ff41916606db9dde5cea4dcdc2d2f9d7c2969524411b8eb6576719104f8eb0df.svg

In this example, we will consider a simple system where:

Marketing: It represents the actions generated by \(x_{1}\) and \(x_{2}\).
Special Events: Outliers on specific days, which are possibly given by special dates.
Exogenous Variables: We will consider variables that are determined by external factors, not determined in the model (E.g: Country economic growth or weather conditions that determine seasonal behavior).

Understanding this ecosystem is essential to create a model that reveals the true causal signals and allows us to optimize our advertising budget. But, what do we mean by optimize the marketing budget? We want to find the optimal media mix that maximizes sales.

Now, given the DAG outlined above, we understand that there is a causal relationship between marketing and sales, but what is the nature of that relationship? In this case, we will assume that this relationship is not linear, for example, a \(10\%\) increase in channel \(x_{1}\) spend does not necessarily translate into a \(10\%\) increase in sales. This can be explained by two phenomena:

On one hand, there is a carry-over effect. Meaning, the effect of spend on sales is not instantaneous but accumulates over time.
In addition, there is a saturation effect. Meaning, the effect of spend on sales is not linear but saturates at some point.

The equation implemented to describe the DAG presented above will be the one expressed in Jin, Yuxue, et al. “Bayesian methods for media mix modeling with carryover and shape effects.” (2017), adding a causal assumption around the media effects and their exclusively positive impact. Concretely, given a time series target variable \(y_{t}\) (e.g. sales or conversions), media variables \(x_{m, t}\) (e.g. impressions, clicks or costs) and a set of control covariates \(z_{c, t}\) (e.g. holidays, special events) we consider a linear model of the form

\[ y_{t} = \alpha + \sum_{m=1}^{M}\beta_{m}f(x_{m, t}) + \sum_{c=1}^{C}\gamma_{c}z_{c, t} + \varepsilon_{t}, \]

where \(\alpha\) is the intercept, \(f\) is a media transformation function and \(\varepsilon_{t}\) is the error term which we assume is normally distributed. The function \(f\) encodes the positive media contribution on the target variable. Typically we consider two types of transformation: adstock (carry-over) and saturation effects.

In PyMC-Marketing, we offer an API for a Bayesian Media Mix Model (MMM) with various specifications. In the example, we’ll implement Geometric Adstock and Logistic Saturation as the chosen transformations for our previously discussed Structural Causal Equation.

Tip

The MMM model in pymc-marketing provides additional features on top of this base model:

Experiment Calibration: We have the option to add empirical experiments (lift tests) to calibrate the model using custom likelihood functions. See Lift Test Calibration.
Time-varying Intercept: Capture time-varying baseline contributions in your model (using modern and efficient Gaussian processes approximation methods). That is, we allow the intercept term \(\alpha = \alpha(t)\) to vary over time. See mmm_tvp_example.
Budget Optimization: Allocate your marketing budget based on the parameters recovered by the model, finding the spend distribution to maximize the amount of contribution given a limited budget. See Budget Allocation with PyMC-Marketing.

References:#

Part I: Data Generation Process#

In Part I of this notebook we focus on the data generating process. That is, we want to construct the target variable \(y_{t}\) (sales) by adding each of the components described in the Business Problem section.

Prepare Notebook#

import warnings

import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
import seaborn as sns
from pymc_extras.prior import Prior

from pymc_marketing.mmm import GeometricAdstock, LogisticSaturation
from pymc_marketing.mmm.multidimensional import MMM
from pymc_marketing.mmm.transformers import geometric_adstock, logistic_saturation

warnings.filterwarnings("ignore", category=FutureWarning)

az.style.use("arviz-darkgrid")
plt.rcParams["figure.figsize"] = [12, 7]
plt.rcParams["figure.dpi"] = 100

%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = "retina"

Generate Data#

1. Date Range#

First we set a time range for our data. We consider a bit more than 2 years of data at weekly granularity.

seed: int = sum(map(ord, "mmm"))
rng: np.random.Generator = np.random.default_rng(seed=seed)

# date range
min_date = pd.to_datetime("2018-04-01")
max_date = pd.to_datetime("2021-09-01")

df = pd.DataFrame(
    data={"date_week": pd.date_range(start=min_date, end=max_date, freq="W-MON")}
).assign(
    year=lambda x: x["date_week"].dt.year,
    month=lambda x: x["date_week"].dt.month,
    dayofyear=lambda x: x["date_week"].dt.dayofyear,
)

n = df.shape[0]
print(f"Number of observations: {n}")

Number of observations: 179

2. Media Costs Data#

Now we generate synthetic data from two channels \(x_1\) and \(x_2\). We refer to it as the raw signal as it is going to be the input at the modeling phase. We expect the contribution of each channel to be different, based on the carryover and saturation parameters.

Raw Signal

# media data
x1 = rng.uniform(low=0.0, high=1.0, size=n)
df["x1"] = np.where(x1 > 0.9, x1, x1 / 2)

x2 = rng.uniform(low=0.0, high=1.0, size=n)
df["x2"] = np.where(x2 > 0.8, x2, 0)


fig, ax = plt.subplots(
    nrows=2, ncols=1, figsize=(10, 7), sharex=True, sharey=True, layout="constrained"
)
sns.lineplot(x="date_week", y="x1", data=df, color="C0", ax=ax[0])
sns.lineplot(x="date_week", y="x2", data=df, color="C1", ax=ax[1])
ax[1].set(xlabel="date")
fig.suptitle("Media Costs Data", fontsize=18, fontweight="bold");

../../_images/ef697a4f9272e283a645740f3958382a3516c3fe4835d8cf45395d2b32773194.png

Remark: By design, \(x_{1}\) should resemble a typical paid social channel and \(x_{2}\) an offline (e.g. TV) spend time series.

Effect Signal

Next, we pass the raw signal through the two transformations: first the geometric adstock (carryover effect) and then the logistic saturation. Note that we set the parameters ourselves, but we will recover them back from the model.

Let’s start with the adstock transformation. We set the adstock parameter \(0 < \alpha < 1\) to be \(0.4\) and \(0.2\) for \(x_1\) and \(x_2\) respectively. We set a maximum lag effect of \(8\) weeks.

# apply geometric adstock transformation
alpha1: float = 0.4
alpha2: float = 0.2

df["x1_adstock"] = geometric_adstock(
    x=df["x1"].to_xarray(), alpha=alpha1, l_max=8, normalize=True, dim="index"
).eval()

df["x2_adstock"] = geometric_adstock(
    x=df["x2"].to_xarray(), alpha=alpha2, l_max=8, normalize=True, dim="index"
).eval()

Next, we compose the resulting adstock signals with the logistic saturation function. We set the parameter \(\lambda > 0\) to be \(4\) and \(3\) for \(x_1\) and \(x_2\) respectively.

# apply saturation transformation
lam1: float = 4.0
lam2: float = 3.0

df["x1_adstock_saturated"] = logistic_saturation(
    x=df["x1_adstock"].to_xarray(), lam=lam1
).eval()

df["x2_adstock_saturated"] = logistic_saturation(
    x=df["x2_adstock"].to_xarray(), lam=lam2
).eval()

We can now visualize the effect signal for each channel after each transformation:

fig, ax = plt.subplots(
    nrows=3, ncols=2, figsize=(16, 9), sharex=True, sharey=False, layout="constrained"
)
sns.lineplot(x="date_week", y="x1", data=df, color="C0", ax=ax[0, 0])
sns.lineplot(x="date_week", y="x2", data=df, color="C1", ax=ax[0, 1])
sns.lineplot(x="date_week", y="x1_adstock", data=df, color="C0", ax=ax[1, 0])
sns.lineplot(x="date_week", y="x2_adstock", data=df, color="C1", ax=ax[1, 1])
sns.lineplot(x="date_week", y="x1_adstock_saturated", data=df, color="C0", ax=ax[2, 0])
sns.lineplot(x="date_week", y="x2_adstock_saturated", data=df, color="C1", ax=ax[2, 1])
fig.suptitle("Media Costs Data - Transformed", fontsize=18, fontweight="bold");

../../_images/877395cfb6a2d8af28e0e205b4a29a5b0ccc8683159733b6b9576da4ec63e778.png

3. Trend & Seasonal Components#

Now we add synthetic trend and seasonal components to the effect signal.

df["trend"] = (np.linspace(start=0.0, stop=50, num=n) + 10) ** (1 / 4) - 1

df["cs"] = -np.sin(2 * 2 * np.pi * df["dayofyear"] / 365.5)
df["cc"] = np.cos(1 * 2 * np.pi * df["dayofyear"] / 365.5)
df["seasonality"] = 0.5 * (df["cs"] + df["cc"])

fig, ax = plt.subplots()
sns.lineplot(x="date_week", y="trend", color="C2", label="trend", data=df, ax=ax)
sns.lineplot(
    x="date_week", y="seasonality", color="C3", label="seasonality", data=df, ax=ax
)
ax.legend(loc="upper left")
ax.set(xlabel="date", ylabel=None)
ax.set_title("Trend & Seasonality Components", fontsize=18, fontweight="bold");

../../_images/0e048137a9716b04d3015be431dc21afdb67c24854c54f3e113fd13c683a910f.png

4. Control Variables#

We add two events where there was a remarkable peak in our target variable. We assume they are independent and not seasonal (e.g. launch of a particular product).

df["event_1"] = (df["date_week"] == "2019-05-13").astype(float)
df["event_2"] = (df["date_week"] == "2020-09-14").astype(float)

5. Target Variable#

Finally, we define the target variable (sales) \(y\). We assume it is a linear combination of the effect signal, the trend and the seasonal components, plus the two events and an intercept. We also add some Gaussian noise.

df["intercept"] = 2.0
df["epsilon"] = rng.normal(loc=0.0, scale=0.25, size=n)

amplitude = 1
beta_1 = 3.0
beta_2 = 2.0
betas = [beta_1, beta_2]


df["y"] = amplitude * (
    df["intercept"]
    + df["trend"]
    + df["seasonality"]
    + 1.5 * df["event_1"]
    + 2.5 * df["event_2"]
    + beta_1 * df["x1_adstock_saturated"]
    + beta_2 * df["x2_adstock_saturated"]
    + df["epsilon"]
)

fig, ax = plt.subplots()
sns.lineplot(x="date_week", y="y", color="black", data=df, ax=ax)
ax.set(xlabel="date", ylabel="y (thousands)")
ax.set_title("Sales (Target Variable)", fontsize=18, fontweight="bold");

../../_images/4083a72bdbdbb599c781124db35495ddbd3e2b549e211f34234146c1174c4fb1.png

We can visualize the true component contributions over the historical period:

fig, ax = plt.subplots()

contributions = [
    df["intercept"].sum(),
    (beta_1 * df["x1_adstock_saturated"]).sum(),
    (beta_2 * df["x2_adstock_saturated"]).sum(),
    1.5 * df["event_1"].sum(),
    2.5 * df["event_2"].sum(),
    df["trend"].sum(),
    df["seasonality"].sum(),
]

ax.bar(
    ["intercept", "x1", "x2", "event_1", "event_2", "trend", "seasonality"],
    contributions,
    color=["C0" if x >= 0 else "C3" for x in contributions],
    alpha=0.8,
)
ax.bar_label(
    ax.containers[0],
    fmt="{:,.2f}",
    label_type="edge",
    padding=2,
    fontsize=15,
    fontweight="bold",
)
ax.set(ylabel="Sales (thousands)")
ax.set_title("Sales Attribution", fontsize=18, fontweight="bold");

../../_images/acdec97478cb2336306569d190c2ecbe2705ce020bc98f13173eae08143c85b3.png

We would like to recover these values from the model.

6. Media Contribution Interpretation#

From the data generating process we can compute the relative contribution of each channel to the target variable. We will recover these values back from the model.

contribution_share_x1: float = (beta_1 * df["x1_adstock_saturated"]).sum() / (
    beta_1 * df["x1_adstock_saturated"] + beta_2 * df["x2_adstock_saturated"]
).sum()

contribution_share_x2: float = (beta_2 * df["x2_adstock_saturated"]).sum() / (
    beta_1 * df["x1_adstock_saturated"] + beta_2 * df["x2_adstock_saturated"]
).sum()

print(f"Contribution Share of x1: {contribution_share_x1:.2f}")
print(f"Contribution Share of x2: {contribution_share_x2:.2f}")

Contribution Share of x1: 0.81
Contribution Share of x2: 0.19

We can obtain the contribution plots for each channel where we clearly see the effect of the adstock and saturation transformations.

fig, ax = plt.subplots(
    nrows=2, ncols=1, figsize=(12, 8), sharex=True, sharey=False, layout="constrained"
)

for i, x in enumerate(["x1", "x2"]):
    sns.scatterplot(
        x=df[x],
        y=amplitude * betas[i] * df[f"{x}_adstock_saturated"],
        color=f"C{i}",
        ax=ax[i],
    )
    ax[i].set(
        title=f"$x_{i + 1}$ contribution",
        ylabel=f"$\\beta_{i + 1} \\cdot x_{i + 1}$ adstocked & saturated",
        xlabel="x",
    )

../../_images/8d09e376c97b1fbf6efc8504a57cbe673c715fe65bb5ce419c7452c5d7baed4a.png

This plot shows some interesting aspects of the media contribution:

The adstock effect is reflected in the non-zero contribution of the channel even when the spend is zero.
One can clearly see the saturation effect as the contribution growth (slope) decreases as the spend increases.

As we will see in Part II of this notebook, we will recover these plots from the model!

We see that channel \(x_{1}\) has a higher contribution than \(x_{2}\). This could be explained by the fact that there was more spend in channel \(x_{1}\) than in channel \(x_{2}\):

fig, ax = plt.subplots(figsize=(7, 5))
df[["x1", "x2"]].sum().plot(kind="bar", color=["C0", "C1"], ax=ax)
ax.set(title="Total Media Spend", xlabel="Media Channel", ylabel="Costs (thousands)");

../../_images/fbea453347b11ff8596a332070fd2afc96ab26b1e787caae0b721465dd2fe612.png

However, one usually is not only interested in the contribution itself but rather the Return on Ad Spend (ROAS). That is, the contribution divided by the cost. We can compute the ROAS for each channel as follows:

roas_1 = (amplitude * beta_1 * df["x1_adstock_saturated"]).sum() / df["x1"].sum()
roas_2 = (amplitude * beta_2 * df["x2_adstock_saturated"]).sum() / df["x2"].sum()

fig, ax = plt.subplots(figsize=(7, 5))
(
    pd.Series(data=[roas_1, roas_2], index=["x1", "x2"]).plot(
        kind="bar", color=["C0", "C1"]
    )
)

ax.set(title="ROAS (Approximation)", xlabel="Media Channel", ylabel="ROAS");

../../_images/6713f95b0f1a245787cbfa0e5d4f0d2a5d5a20e3aeb8f3f236768d368d872119.png

That is, channel \(x_{1}\) seems to be more efficient than channel \(x_{2}\).

Note

We recommend reading Section 4.1 in Jin, Yuxue, et al. “Bayesian methods for media mix modeling with carryover and shape effects.” (2017) for a detailed explanation of the ROAS (and mROAS). In particular:

If we transform our target variable \(y\) (e.g. with a log transformation), one needs to be careful with the ROAS computation as setting the spend to zero does not commute with the transformation.
One has to be careful with the adstock effect so that we include a carryover period to fully account for the effect of the spend. The ROAS estimation above is an approximation.

7. Data Output#

We of course will not have all of these features in our real data. Let’s filter out the features we will use for modeling:

columns_to_keep = [
    "date_week",
    "y",
    "x1",
    "x2",
    "event_1",
    "event_2",
    "dayofyear",
]

data = df[columns_to_keep].copy()

data.head()

	date_week	y	x1	dayofyear
0	2018-04-02	3.984662	0.318580	92
1	2018-04-09	3.762872	0.112388	99
2	2018-04-16	4.466967	0.292400	106
3	2018-04-23	3.864219	0.071399	113
4	2018-04-30	4.441625	0.386745	120

Part II: Modeling#

On this second part, we focus on the modeling process. We will use the data generated in Part I.

1. Feature Engineering#

Assuming we did an EDA and we have a good understanding of the data (we did not do it here as we generated the data ourselves, but please never skip the EDA!), we can start building our model. One thing we immediately see is the seasonality and the trend component. We can generate features ourselves as control variables, for example using a uniformly increasing straight line to model the trend component. In addition, we include dummy variables to encode the event_1 and event_2 contributions.

For the seasonality component we use Fourier modes (similar as in Prophet). We do not need to add the Fourier modes by hand as they are handled by the model API through the yearly_seasonality argument (see below). We use 2 modes for the seasonality component.

# trend feature
data["t"] = range(n)

data.head()

	date_week	y	x1	dayofyear	t
0	2018-04-02	3.984662	0.318580	92	0
1	2018-04-09	3.762872	0.112388	99	1
2	2018-04-16	4.466967	0.292400	106	2
3	2018-04-23	3.864219	0.071399	113	3
4	2018-04-30	4.441625	0.386745	120	4

2. Model Specification#

We can specify the model structure using the MMM class. This class handles a lot of internal boilerplate code for us such as scaling the data (see details below) and handy diagnostics and reporting plots. One great feature is that we can specify the channel priors distributions ourselves, which is a fundamental component of the bayesian workflow as we can incorporate our prior knowledge into the model. This is one of the most important advantages of using a bayesian approach. Let’s see how we can do it.

As we do not know much more about the channels, we start with a simple heuristic:

The channel contributions should be positive, so we can for example use a HalfNormal distribution as prior. We need to set the sigma parameter per channel. The higher the sigma, the more “freedom” it has to fit the data. To specify sigma we can use the following point.
We expect channels where we spend the most to have more attributed sales, before seeing the data. This is a very reasonable assumption (note that we are not imposing anything at the level of efficiency!).

How to incorporate this heuristic into the model? To begin with, it is important to note that the MMM class scales the target and input variables through an MaxAbsScaler transformer from scikit-learn, its important to specify the priors in the scaled space (i.e. between 0 and 1). One way to do it is to use the spend share as the sigma parameter for the HalfNormal distribution. We can actually add a scaling factor to take into account the support of the distribution.

First, let’s compute the share of spend per channel:

total_spend_per_channel = data[["x1", "x2"]].sum(axis=0)

spend_share = total_spend_per_channel / total_spend_per_channel.sum()

spend_share

x1    0.65632
x2    0.34368
dtype: float64

Next, we specify the sigma parameter per channel:

n_channels = 2

prior_sigma = n_channels * spend_share.to_numpy()

prior_sigma.tolist()

../../_images/c0decaded39c74f9a5dfb1a44ecf563829cb41ef88fb71b3568452d799e1af11.png

Delayed Saturated MMM follows sklearn convention, so we need to split our data into X (predictors) and y (target value)

X = data.drop("y", axis=1)
y = data["y"]

You can use the optional parameter ‘model_config’ to apply your own priors to the model. Each entry in the ‘model_config’ contains a key that corresponds to a registered distribution name in our model. The value of the key is a dictionary that describes the input parameters of that specific distribution.

If you’re unsure how to define your own priors, you can use the ‘default_model_config’ property of MMM to see the required structure.

dummy_model = MMM(
    date_column="",
    channel_columns=[""],
    adstock=GeometricAdstock(l_max=4),
    saturation=LogisticSaturation(),
)
dummy_model.default_model_config

{'intercept': Prior("Normal", mu=0, sigma=2, dims=()),
 'likelihood': Prior("Normal", sigma=Prior("HalfNormal", sigma=2, dims=()), dims="date"),
 'gamma_control': Prior("Normal", mu=0, sigma=2, dims="control"),
 'gamma_fourier': Prior("Laplace", mu=0, b=1, dims="fourier_mode"),
 'adstock_alpha': Prior("Beta", alpha=1, beta=3, dims="channel"),
 'saturation_lam': Prior("Gamma", alpha=3, beta=1, dims="channel"),
 'saturation_beta': Prior("HalfNormal", sigma=2, dims="channel")}

You can change only the prior parameters that you wish, no need to alter all of them, unless you’d like to!

my_model_config = {
    "intercept": Prior("Normal", mu=0.5, sigma=0.2),
    "saturation_beta": Prior("HalfNormal", sigma=prior_sigma, dims="channel"),
    "gamma_control": Prior("Normal", mu=0, sigma=0.05, dims="control"),
    "gamma_fourier": Prior("Laplace", mu=0, b=0.2, dims="fourier_mode"),
    "likelihood": Prior("Normal", sigma=Prior("HalfNormal", sigma=6)),
}

Remark: For the prior specification there is no right or wrong answer. It all depends on the data, the context and the assumptions you are willing to make. It is always recommended to do some prior predictive sampling and sensitivity analysis to check the impact of the priors on the posterior. We skip this here for the sake of simplicity. If you are not sure about specific priors, the MMM class has some default priors that you can use as a starting point.

Model sampler allows specifying set of parameters that will be passed to fit the same way as the kwargs are getting passed so far. It doesn’t disable the fit kwargs, but rather extend them, to enable customizable and preservable configuration. By default the sampler_config for MMM is empty. But if you’d like to use it, you can define it like shown below:

my_sampler_config = {"progressbar": True}

Now we are ready to use the MMM class to define the model.

mmm = MMM(
    model_config=my_model_config,
    sampler_config=my_sampler_config,
    date_column="date_week",
    adstock=GeometricAdstock(l_max=8),
    saturation=LogisticSaturation(),
    channel_columns=["x1", "x2"],
    control_columns=["event_1", "event_2", "t"],
    yearly_seasonality=2,
)

# Build the model and add contribution variables in original scale
mmm.build_model(X, y)
mmm.add_original_scale_contribution_variable(
    var=[
        "channel_contribution",
        "control_contribution",
        "intercept_contribution",
        "yearly_seasonality_contribution",
        "y",
    ]
)

pm.model_to_graphviz(mmm.model)

/Users/juanitorduz/micromamba/envs/pymc-marketing-dev/lib/python3.13/site-packages/pymc_extras/prior.py:822: UserWarning: Implicit conversion of array-like parameter sigma to DataArray with dims ('channel',). Use DataArray with explicit dims to avoid this warning
  return _param_value_with_dims(param, value, dims=self.dims)

../../_images/67c310b8573f20c1491161ed0a83d5fdadda727f619fcc6b859c8f7d17bee1d5.svg

Observe how the media transformations were handled by the class MMM.

To assess the model prior parameters we can look into the prior predictive plot:

# Generate prior predictive samples
mmm.sample_prior_predictive(X, y, samples=2_000)
fig, axes = mmm.plot.prior_predictive()

Sampling: [adstock_alpha, gamma_control, gamma_fourier, intercept_contribution, saturation_beta, saturation_lam, y, y_sigma]

../../_images/4083f50f7cd948397f661c3b481c3e109a6b3bf4cac6826e1ed2b6c748253910.png

The prior predictive plot shows that the priors are not too informative.

Note that the prior predictive plot is not in the original scale. The reason is that we handle scaling of the media variables and the target variable in the model class. Scaling is important for the model to sample efficiently. We will go deeper into this topic later. For now, we can show how to reproduce the plot in the original scale:

# Custom plot for prior predictive checks
fig, ax = plt.subplots()
for i, hdi_prob in enumerate([0.94, 0.5]):
    az.plot_hdi(
        x=mmm.model.coords["date"],
        y=mmm.idata["prior"]["y_original_scale"].unstack().transpose(..., "date"),
        smooth=False,
        color="C0",
        hdi_prob=hdi_prob,
        fill_kwargs={"alpha": 0.3 + i * 0.1, "label": f"{hdi_prob:.0%} HDI"},
        ax=ax,
    )
sns.lineplot(data=df, x="date_week", y="y", color="black", label="Observed", ax=ax)
ax.legend(loc="upper left")
ax.set(xlabel="date", ylabel="y")
ax.set_title("Prior Predictive Checks", fontsize=18, fontweight="bold");

../../_images/a51ab0fa5e49fd13ff75a6d65f78451b52bbd5800d11696035cab95da4d0eb44.png

4. Model Diagnostics#

A good place to start assessing the model quality is by looking if the model had any divergences:

# Number of diverging samples
mmm.idata["sample_stats"]["diverging"].sum().item()

../../_images/702ba12f4bf2cef7092842f0bdd4dc2b8bfdc07d13bc79cc481075cd1cc7c88c.png

We got none! 🙌

The fit_result attribute contains the pymc trace object.

We can therefore use all the pymc machinery to run model diagnostics. First, let’s see the summary of the trace:

az.summary(
    data=mmm.fit_result,
    var_names=[
        "adstock_alpha",
        "gamma_control",
        "gamma_fourier",
        "intercept_contribution",
        "saturation_beta",
        "saturation_lam",
        "y_sigma",
    ],
)

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
adstock_alpha[x1]	0.402	0.032	0.345	0.463	0.001	0.001	2489.0	2528.0	1.0
adstock_alpha[x2]	0.187	0.040	0.113	0.266	0.001	0.001	2466.0	2742.0	1.0
gamma_control[event_1]	0.176	0.027	0.126	0.230	0.000	0.000	4024.0	3039.0	1.0
gamma_control[event_2]	0.231	0.028	0.175	0.281	0.000	0.000	3807.0	3169.0	1.0
gamma_control[t]	0.001	0.000	0.001	0.001	0.000	0.000	2916.0	2655.0	1.0
gamma_fourier[sin_1]	0.003	0.003	-0.004	0.009	0.000	0.000	3893.0	2753.0	1.0
gamma_fourier[sin_2]	-0.058	0.004	-0.064	-0.051	0.000	0.000	4266.0	3090.0	1.0
gamma_fourier[cos_1]	0.062	0.003	0.056	0.069	0.000	0.000	5107.0	2645.0	1.0
gamma_fourier[cos_2]	0.001	0.004	-0.006	0.008	0.000	0.000	3875.0	2875.0	1.0
intercept_contribution	0.355	0.013	0.331	0.381	0.000	0.000	2177.0	2481.0	1.0
saturation_beta[x1]	0.362	0.020	0.325	0.401	0.000	0.000	1994.0	2179.0	1.0
saturation_beta[x2]	0.265	0.073	0.192	0.368	0.002	0.005	1986.0	1564.0	1.0
saturation_lam[x1]	3.945	0.384	3.226	4.684	0.008	0.006	2539.0	2107.0	1.0
saturation_lam[x2]	3.175	1.188	1.210	5.405	0.026	0.029	1902.0	1675.0	1.0
y_sigma	0.031	0.002	0.028	0.035	0.000	0.000	3524.0	2912.0	1.0

Observe that the estimated parameters for \(\alpha\) and \(\lambda\) are very close to the ones we set in the data generation process! Let’s plot the trace for the parameters:

_ = az.plot_trace(
    data=mmm.fit_result,
    var_names=[
        "adstock_alpha",
        "gamma_control",
        "gamma_fourier",
        "intercept_contribution",
        "saturation_beta",
        "saturation_lam",
        "y_sigma",
    ],
    compact=True,
    backend_kwargs={"figsize": (12, 10), "layout": "constrained"},
)
plt.gcf().suptitle("Model Trace", fontsize=18, fontweight="bold");

../../_images/5cb04991d5108210de09e9c2c09a056f311119f911b64c0ee5e3c8214021e151.png

Overall we see a good chain mixing.

Now we sample from the posterior predictive distribution. That is, we sample from the posterior distribution to get predictions for the target variable.

We can now plot the posterior predictive distribution for the target variable. By default, the plot_posterior_predictive method will plot the mean prediction along with a \(94\%\) HDI.

fig, axes = mmm.plot.posterior_predictive(var=["y_original_scale"], hdi_prob=0.94)
sns.lineplot(
    data=df, x="date_week", y="y", color="black", label="Observed", ax=axes[0][0]
);

../../_images/3d06153e91f6dd8165256ecd7870baf71a42879b7050221629fb9c2d19cb6b88.png

The fit looks very good (as expected)!

We can inspect the model errors:

../../_images/a4d542f1ea8eb0113ebe689e3bd4c95890c4582da83dd73d40ea201d38e12551.png

We do not see any pattern in the errors, which is a good sign.

Next, we can decompose the posterior predictive distribution into the different components. We start by looking at the channel contributions:

# Component contributions (scaled space)
fig, axes = mmm.plot.contributions_over_time(
    var=["channel_contribution"], hdi_prob=0.94
)

../../_images/5d3ddfd6109feaa923447b79f6211967b3b4cad718bdc173fd0108f49d199b9b.png

We can plot the contributions in the original scale:

# Component contributions (original scale)
mmm.plot.contributions_over_time(
    var=["channel_contribution_original_scale"],
    hdi_prob=0.94,
);

../../_images/3d9c54b6794923cea4b104e1df6dc3d23fd19ecd36ae3fdd196c40e31ecfe1e7.png

Note

The scalers attribute contains the scaling information for the target variable and the media variables. There are simple numbers (stored in a xarray.Dataset) that we can use to scale the variables back to the original scale.

mmm.scalers

<xarray.Dataset> Size: 40B
Dimensions:   (channel: 2)
Coordinates:
  * channel   (channel) object 16B 'x1' 'x2'
Data variables:
    _channel  (channel) float64 16B 0.9967 0.9944
    _target   float64 8B 8.312

Let’s check that the scaling is correct:

# Channel contributions (x1)
np.testing.assert_allclose(
    mmm.idata["posterior"]["channel_contribution"].sel(channel="x1")
    * mmm.scalers["_target"],
    mmm.idata["posterior"]["channel_contribution_original_scale"].sel(channel="x1"),
)

# Channel contributions (x2)
np.testing.assert_allclose(
    mmm.idata["posterior"]["channel_contribution"].sel(channel="x2")
    * mmm.scalers["_target"],
    mmm.idata["posterior"]["channel_contribution_original_scale"].sel(channel="x2"),
)

# Intercept contribution
np.testing.assert_allclose(
    mmm.idata["posterior"]["intercept_contribution"] * mmm.scalers["_target"],
    mmm.idata["posterior"]["intercept_contribution_original_scale"],
)

We can now plot all the contributions in the original scale:

# Component contributions (original scale)
fig, axes = mmm.plot.contributions_over_time(
    var=[
        "channel_contribution_original_scale",
        "control_contribution_original_scale",
        "intercept_contribution_original_scale",
        "yearly_seasonality_contribution_original_scale",
    ],
    dims={"channel": ["x1", "x2"]},
    hdi_prob=0.94,
)

axes = axes.flatten()

for ax in axes:
    legend = ax.get_legend()
    legend.set_bbox_to_anchor((0.5, -0.1))

../../_images/e830789032c354ef6aaa7557d4604f76299cdc381fed90913b19eb2d1147f5d9.png

We can combine these plots as:

# Component contributions (original scale)
fig, ax = mmm.plot.contributions_over_time(
    var=[
        "channel_contribution_original_scale",
        "control_contribution_original_scale",
        "intercept_contribution_original_scale",
        "yearly_seasonality_contribution_original_scale",
    ],
    dims={"channel": ["x1", "x2"]},
    combine_dims=True,
    hdi_prob=0.94,
    figsize=(12, 7),
)


legend = ax[0, 0].get_legend()
legend.set_bbox_to_anchor((0.8, -0.12))

../../_images/56d0bfad135af4def0af41bfa1cf3a1163a4d2ac9aafc7c8ba0dbf831d1d8d7e.png

The following code shows how to manually generate the aggregated channel contribution against the other components:

../../_images/587d8cfd6da263b9195a90367f942cf631434df083902425f18265ee0699d168.png

A similar decomposition can be achieved using an area plot:

../../_images/7a53ae700b9157f5a2c8ad750d3c73b2fc03a0274fb318631c7fa8c2db8b3d1f.png

Here the base means the sum of the intercept, control and seasonal components. Note that this only works if the contributions of the channel or control variable are strictly positive.

Next, we look into the absolute historical contributions of each component as a waterfall plot. This type of visualization is very useful to present to a non-technical audience and decision makers.

mmm.plot.waterfall_components_decomposition();

../../_images/75d7ed75266f2d043d28e3e04ae95de413f64a841a15a34a460fdf70764fe3b9.png

Note that we have recovered the true values for all the parameters! Well, in fact the contributions of the intercept and t are not exactly the same as in the data generating process, but the aggregate does match the true values of intercept + trend. The reason is that the true latent trend is not completely linear. One could use the time-varying intercept feature to capture this effect.

We can extract the mean contributions over time directly from the model:

	date	x1	x2	event_1	event_2	t	yearly_seasonality	intercept
0	2018-04-02	1.079970	0.000000	0.0	0.0	0.000000	0.021160	2.950151
1	2018-04-09	0.830757	0.000000	0.0	0.0	0.005126	0.073151	2.950151
2	2018-04-16	1.290704	0.000000	0.0	0.0	0.010251	0.118963	2.950151
3	2018-04-23	0.790082	0.000000	0.0	0.0	0.015377	0.153282	2.950151
4	2018-04-30	1.536806	0.000000	0.0	0.0	0.020502	0.171528	2.950151
...	...	...	...	...	...	...	...	...
174	2021-08-02	0.335762	0.003322	0.0	0.0	0.891853	-0.875931	2.950151
175	2021-08-09	0.710576	1.603175	0.0	0.0	0.896979	-0.886478	2.950151
176	2021-08-16	0.875334	0.407119	0.0	0.0	0.902105	-0.864161	2.950151
177	2021-08-23	1.270923	0.077905	0.0	0.0	0.907230	-0.808582	2.950151
178	2021-08-30	1.812030	0.015387	0.0	0.0	0.912356	-0.721022	2.950151

179 rows × 8 columns

5. Media Parameters#

We can deep-dive into the media transformation parameters. We want to compare the posterior distributions against the true values.

fig, ax = plt.subplots(
    nrows=2,
    ncols=1,
    sharex=True,
    sharey=True,
    figsize=(12, 7),
    layout="constrained",
)
az.plot_posterior(
    mmm.idata["posterior"],
    var_names=["adstock_alpha"],
    ref_val={
        "adstock_alpha": [
            {"channel": "x1", "ref_val": alpha1},
            {"channel": "x2", "ref_val": alpha2},
        ],
    },
    ax=ax,
)

fig.suptitle("Adstock Alpha Posterior", fontsize=18, fontweight="bold");

../../_images/16ef1e060d49bb8bb9746813593f55398a770ea50a7fe2ecad5addd9dbbed490.png

fig, ax = plt.subplots(
    nrows=2,
    ncols=1,
    sharex=True,
    sharey=True,
    figsize=(12, 7),
    layout="constrained",
)
az.plot_posterior(
    mmm.idata["posterior"],
    var_names=["saturation_lam"],
    ref_val={
        "saturation_lam": [
            {"channel": "x1", "ref_val": lam1},
            {"channel": "x2", "ref_val": lam2},
        ],
    },
    ax=ax,
)

fig.suptitle("Saturation Lambda Posterior", fontsize=18, fontweight="bold");

../../_images/0cf26dcb6492489c151dee85db60dc7c4d042b3aa728cfde6d9d9f5cfc443e5c.png

We indeed see that our media parameters were successfully recovered!

6. Media Deep-Dive#

First we can compute the relative contribution of each channel to the target variable. Note that we recover the true values!

fig, ax = mmm.plot.channel_contribution_share_hdi(figsize=(10, 6))
ax.axvline(
    x=contribution_share_x1,
    color="C1",
    linestyle="--",
    label="true contribution share ($x_1$)",
)
ax.axvline(
    x=contribution_share_x2,
    color="C2",
    linestyle="--",
    label="true contribution share ($x_2$)",
)
ax.legend(loc="upper center", bbox_to_anchor=(0.5, -0.05), ncol=1);

../../_images/fbb48937c111490e866d361c6d4b241c08f2fa53daf44684790c70ff1061acee.png

Next, we can plot the relative contribution of each channel to the target variable.

First we plot the direct contribution per channel. Again, we get very close values as the ones obtained in Part I.

fig, axes = mmm.plot.saturation_scatterplot(original_scale=True)
[ax.set(xlabel="x") for ax in axes.flatten()];

../../_images/cf6f9755315d3fb3be4d0cc01605fe8627175f61f66acc00896c183e5e970af1.png

Note that trying to get the delayed cumulative contribution is not that easy as contributions from the past leak into the future. Specifically, note that we apply the saturation function to the aggregation. As the saturation function is non-linear. This is not the same as taking the sum of the saturation contributions Hence, it is very hard to reverse engineer the contribution after carryover and saturation composition this way.

A more transparent alternative is to evaluate the channel contribution at different share spend levels for the complete training period. Concretely, if we denote by \(\delta\) (we call it sweep factor) the input channel data percentage level, so that for \(\delta = 1\) we have the model input spend data and for \(\delta = 1.5\) we have a \(50\%\) increase in the spend, then we can compute the channel contribution at a grid of \(\delta\)-values and plot the results:

# Run sensitivity analysis sweep
sweeps = np.linspace(0, 1.5, 12)
mmm.sensitivity.run_sweep(
    sweep_values=sweeps,
    var_input="channel_data",
    var_names="channel_contribution_original_scale",
    extend_idata=True,
)

# Plot sensitivity analysis
ax = mmm.plot.sensitivity_analysis(
    xlabel="Sweep multiplicative",
    ylabel="Total contribution over training period",
    hue_dim="channel",
    x_sweep_axis="relative",
)
ax.axvline(1.0, color="black", linestyle="--", linewidth=1);

../../_images/18fe8f7f5b1e10b4e9d9e31f441e7670342f67a46fd56f76eeb8bdc0b5811a70.png

Here the black dashed line represents the case where the spend is at the historical level.

This plot does account for carryover (adstock) and saturation effects.
We see that when we have no spend, the contribution is zero (assuming there was no spend in the past, otherwise the carryover effect would be non-zero).

Observe that these grid values serve as inputs for an optimization step.

We can also plot the same contribution using the x-axis as the total channel input (e.g. total spend in EUR).

# Plot sensitivity analysis with absolute x-axis
ax = mmm.plot.sensitivity_analysis(
    xlabel="Sweep absolute spend",
    ylabel="Total contribution over training period",
    hue_dim="channel",
    x_sweep_axis="absolute",
)

for i, channel in enumerate(["x1", "x2"]):
    ax.axvline(
        X[channel].sum(),
        color=f"C{i}",
        linestyle="--",
        label=f"historical total spend ({channel})",
    )

ax.legend(loc="upper left");

../../_images/49285e9cf9a51623e677700bc51245a7bc9e88cb23c1ee1cdc13077dfdb8afcf.png

All of these visualizations are very useful to understand the contribution of each channel to the target variable and the effect of the saturation and adstock effects. For more details on how to interpret these plots, please refer to tutorial Understanding Media Saturation in Marketing Mix Models.

7. Contribution Recovery#

Next, we can plot the direct contribution of each channel to the target variable over time.

# Component contributions (original scale)
fig, axes = mmm.plot.contributions_over_time(
    var=["channel_contribution_original_scale"],
    hdi_prob=0.94,
)

axes = axes.flatten()

for i, x in enumerate(["x1", "x2"]):
    # Estimate true contribution in the original scale from the data generating process
    sns.lineplot(
        x=df["date_week"],
        y=amplitude * betas[i] * df[f"{x}_adstock_saturated"],
        color="black",
        label=f"{x} true contribution",
        linestyle="--",
        alpha=0.5,
        ax=axes[i],
    )

[ax.legend(loc="upper left") for ax in axes]

fig.suptitle("Contribution Recovery", fontsize=18, fontweight="bold");

../../_images/dedcc1f7a19fcfc131bb85b3323fce337d3b1af3986deafa42ef015b7e3f14e4.png

The results look great! We therefore successfully recovered the true values from the data generation process. We have also seen how easy it is to use the MMM class to fit media mix models! It takes over the model specification and the media transformations, while having all the flexibility of pymc!

8. ROAS#

Finally, we can compute the (approximate) ROAS posterior distribution for each channel.

roas = mmm.incrementality.contribution_over_spend(frequency="all_time").rename("roas")

fig, axes = plt.subplots(
    nrows=2, ncols=1, figsize=(12, 7), sharex=True, sharey=False, layout="constrained"
)
az.plot_posterior(roas, ref_val=[roas_1, roas_2], ax=axes)
axes[0].set(title="Channel $x_{1}$")
axes[1].set(title="Channel $x_{2}$", xlabel="ROAS")
fig.suptitle("ROAS Posterior Distributions", fontsize=18, fontweight="bold", y=1.06);

../../_images/045b7d0f89c6ad80dd5255273396ef8321df7e45350fe27161fd577e4b4b324a.png

We see that the ROAS posterior distributions are centered around the true values! We also see that, even considering the uncertainty, channel \(x_{1}\) is more efficient than channel \(x_{2}\).

It is also useful to compare the ROAS and the contribution share. In the next plot we plot these two inferred estimates per channel.

../../_images/2c13adcbfd432ad8c111ec050723af35a32246c8f9f2dce377b2188e44461b20.png

This plot is very effective at summarizing channel efficiency. In this example, it turns out that the most efficient channel \(x_1\) has a higher contribution share than the less efficient channel \(x_2\).

9. Out of Sample Predictions#

Out of sample predictions are done with the predict and posterior_predictive methods. These include

sample_posterior_predictive : Get the full posterior predictive distribution
predict: Get the mean of the posterior predictive distribution

These methods take new data, X, and some additional kwargs for new predictions. Namely,

include_last_observations : boolean flag in order to carry adstock effects from last observations in the training dataset

The new data needs to have all the features that are specified in the model. There is no need to worry about:

input scaling of channel spends
creating fourier transformations on the date_column
inverse scaling back to target domain

That will be done automatically! However, please note that control variables are NOT automatically scaled - if needed, you must scale them before passing the data to the model.

last_date = X["date_week"].max()

# New dates starting from last in dataset
n_new = 5
new_dates = pd.date_range(start=last_date, periods=1 + n_new, freq="W-MON")[1:]

X_out_of_sample = pd.DataFrame(
    {
        "date_week": new_dates,
    }
)

# Same channel spends as last day
X_out_of_sample["x1"] = X["x1"].iloc[-1]
X_out_of_sample["x2"] = X["x2"].iloc[-1]

# Other features
X_out_of_sample["event_1"] = 0
X_out_of_sample["event_2"] = 0

X_out_of_sample["t"] = range(len(X), len(X) + n_new)

X_out_of_sample

	date_week	x1	t
0	2021-09-06	0.438857	179
1	2021-09-13	0.438857	180
2	2021-09-20	0.438857	181
3	2021-09-27	0.438857	182
4	2021-10-04	0.438857	183

Call the desired method to get the new samples! The new coordinates will be from the new dates

X_out_of_sample.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   date_week  5 non-null      datetime64[ns]
 1   x1         5 non-null      float64       
 2   x2         5 non-null      float64       
 3   event_1    5 non-null      int64         
 4   event_2    5 non-null      int64         
 5   t          5 non-null      int64         
dtypes: datetime64[ns](1), float64(2), int64(3)
memory usage: 372.0 bytes

Note

If the method is being called multiple times, set the extend_idata argument to False in order to not overwrite the observed_data in the InferenceData

The new predictions are transformed back to the original scale of the target by default. That can be seen below:

def plot_in_sample(X, y, ax, n_points: int = 15):
    sns.lineplot(
        x=X["date_week"][-n_points:],
        y=y[-n_points:],
        marker="o",
        markersize=7,
        color="black",
        label="actuals",
        ax=ax,
    )
    return ax


def plot_out_of_sample(X_out_of_sample, y_out_of_sample, ax, color, label):
    y_out_original_scale = (
        y_out_of_sample["y_original_scale"].unstack().transpose(..., "date")
    )
    az.plot_hdi(
        X_out_of_sample["date_week"].dt.to_pydatetime(),
        y_out_original_scale,
        smooth=False,
        fill_kwargs={"alpha": 0.25, "color": color},
        ax=ax,
    )

    mean = y_out_original_scale.mean(dim=("chain", "draw"))
    mean.plot(ax=ax, marker="o", markersize=7, label=label, color=color, linestyle="--")
    ax.set(ylabel="Original Target Scale")
    ax.set_title("Out of sample predictions for MMM", fontsize=18, fontweight="bold")
    return ax


_, ax = plt.subplots()
plot_in_sample(X, y, ax=ax)
plot_out_of_sample(
    X_out_of_sample, y_out_of_sample, ax=ax, label="out of sample", color="C0"
)
ax.legend(loc="upper left");

../../_images/c554e6575aed2a256f0b7ab405b63fc2c59d04cc5e83a932c3d98189d2dcc6f4.png

If the out of sample data is being extended from the original predictions, consider setting the include_last_observations to True in order to carry over the effects from the last channel spends in the training set.

The predictions are higher since the channel contributions from the final spends still have an impact that eventually subside.

y_out_of_sample_with_adstock = mmm.sample_posterior_predictive(
    X_out_of_sample, extend_idata=False, include_last_observations=True
)

Sampling: [y]

_, ax = plt.subplots()
plot_in_sample(X, y, ax=ax)
plot_out_of_sample(
    X_out_of_sample, y_out_of_sample, ax=ax, label="out of sample", color="C0"
)
plot_out_of_sample(
    X_out_of_sample,
    y_out_of_sample_with_adstock,
    ax=ax,
    label="adstock out of sample",
    color="C1",
)
ax.legend();

../../_images/3f6048826f2e3311ea6407088981f619a50b6ccda22d18abaf38733f2511c4de.png

10. Save Model#

After your model is trained, you can quickly save it using the save method. For more information about model deployment see Model deployment.

mmm.save("model.nc", engine="h5netcdf")

%load_ext watermark
%watermark -n -u -v -iv -w -p pymc_marketing,pytensor

Last updated: Wed, 18 Mar 2026

Python implementation: CPython
Python version       : 3.13.12
IPython version      : 9.11.0

pymc_marketing: 0.18.2
pytensor      : 2.38.2

arviz         : 0.23.4
graphviz      : 0.21
matplotlib    : 3.10.8
numpy         : 2.4.2
pandas        : 2.3.3
pymc          : 5.28.1
pymc_extras   : 0.9.3
pymc_marketing: 0.18.2
seaborn       : 0.13.2

Watermark: 2.6.0

MMM Example Notebook#

Business Problem#

References:#

Part I: Data Generation Process#

Prepare Notebook#

Generate Data#

1. Date Range#

2. Media Costs Data#

3. Trend & Seasonal Components#

4. Control Variables#

5. Target Variable#

6. Media Contribution Interpretation#

7. Data Output#

Part II: Modeling#

1. Feature Engineering#

2. Model Specification#

3. Model Fitting#

4. Model Diagnostics#

5. Media Parameters#

6. Media Deep-Dive#

7. Contribution Recovery#

8. ROAS#

9. Out of Sample Predictions#

10. Save Model#