rfm_train_test_split#

pymc_marketing.clv.utils.rfm_train_test_split(transactions, customer_id_col, datetime_col, train_period_end, test_period_end=None, time_unit='D', time_scaler=1, datetime_format=None, monetary_value_col=None, include_first_transaction=False, sort_transactions=True)[source]#

Summarize transaction data and split into training and tests datasets for CLV modeling. This can also be used to evaluate the impact of a time-based intervention like a marketing campaign.

This transforms a DataFrame of transaction data of the form:

customer_id, datetime [, monetary_value]

to a DataFrame of the form:

customer_id, frequency, recency, T [, monetary_value], test_frequency [, test_monetary_value], test_T

Note this function will exclude new customers whose first transactions occurred during the test period.

Adapted from lifetimes package CamDavidsonPilon/lifetimes

Parameters:
  • transactions (DataFrame) – A Pandas DataFrame that contains the customer_id col and the datetime col.

  • customer_id_col (string) – Column in the transactions DataFrame that denotes the customer_id.

  • datetime_col (string) – Column in the transactions DataFrame that denotes the datetime the purchase was made.

  • train_period_end (Union[str, pd.Period, datetime], optional) – A string or datetime to denote the final time period for the training data. Events after this time period are used for the test data.

  • test_period_end (Union[str, pd.Period, datetime], optional) – A string or datetime to denote the final time period of the study. Events after this date are truncated. If not given, defaults to the max of ‘datetime_col’.

  • time_unit (string, optional) – Time granularity for study. Default: ‘D’ for days. Possible values listed here: https://numpy.org/devdocs/reference/arrays.datetime.html#datetime-units

  • time_scaler (int, optional) – Default: 1. Useful for scaling recency & T to a different time granularity. Example: With freq=’D’ and freq_multiplier=1, we get recency=591 and T=632 With freq=’h’ and freq_multiplier=24, we get recency=590.125 and T=631.375 This is useful if predictions in months or years are desired, and can also help with model convergence for study periods of many years.

  • datetime_format (string, optional) – A string that represents the timestamp format. Useful if Pandas can’t understand the provided format.

  • monetary_value_col (string, optional) – Column in the transactions DataFrame that denotes the monetary value of the transaction. Optional; only needed for spend estimation models like the Gamma-Gamma model.

  • include_first_transaction (bool, optional) – Default: False For predictive CLV modeling, this should be False. Set to True if performing RFM segmentation.

  • sort_transactions (bool, optional) – Default: True If raw data is already sorted in chronological order, set to False to improve computational efficiency.

Returns:

customer_id, frequency, recency, T, test_frequency, test_T [, monetary_value, test_monetary_value]

Return type:

obj: DataFrame: