How do we extract meaningful information from (potential) clients' data to drive personalized marketing strategies that improve client acquisition/revenue by an order of magnitude?
All (good) technology company now faces the unique challenge of distilling information from a plethora of data. Consumer tech products are able to collect such verbose & intricate user information that it’s often hard to know what to do with all of it. Everything from user demographics, to network & IP characteristics, is logged & stored.
How do they make sense of all this information?
Uber aims to design targeting marketing campaigns. To build models that classifies users by the likelihood of 1 of 3 events:
- User acquisition: how likely a new individual is to purchase a product.
- Cross-or-up-selling: how propensity of an existing user to purchase a related product.
- User churn: how likely an existing user is to cancel the purchase.
Presented with thousands of features per individual and exceedingly large data — this seems a simple task for deep learning. In this application, however, this solution is invalid for copious reasons:
- Parameters are indiscernible from one another.
- The feature space is sparse— leading to overfitting.
- Compute needed grows exponentially.
- Model interpretation & diagnostics are cumbersome.
Marketers need to use the data to design succinct, actionable, strategies based on individual attributes. Verbose, superfluous, information ought to be refined.
mRMR: Maximum Relevance & Minimum Redundancy
A simple solution to this problem can be derived from information theory.
mRMR is a simple formulation that allows us to approximate the marginal explanatory value of a feature. Consider:
- A matrix of m features: X
- A mutual information function: I(.)
- A response class vector: Y
mRMR is then given by:
where |S| is the size of the feature space & I(.) is expressed as:
This formulation is highly intuitive as we’re taking the ratio of information captured by the marginal predictor Xi to some proxy of information captured by the rest of the set |S|.
A number of variants of these base equations may be desirable:
- Replacing integrals with summations for discrete features.
- Replacing the quotient with a subtraction where scale-invariant features are used.
- Replacing the probability density estimates p(x,y), p(y) or p(x) with statistical tests (t-tests &/or F-tests) where severe computational limitations arise.
- Non-linear feature dependency can be achieved by mapping the data to some linearly separable space with kernel functions.
- If the downstream classification framework is pre-determined it may be pragmatic to incorporate the related accuracy metric as a proxy for mutual information.
Many variants were tested, it was shown that feature relevance decayed exponentially — albeit at varied rates — across different models.
Post, implementation feature correlation & statistical differences are also tested in the paper.
Uber then utilized this methodology by implementing the underlying algorithm as a module in one of their automated machine learning pipelines.
The YAML configuration can be modularized & readily incorporated. The implementation is also, relatively straightforward — Uber disclosed using Scala Spark to handle the ETL process at scale.
Next time you order your Uber, or dig into uber eats, just remember it probably has less to do with your decision-making & more to do with your mutual information with sales channels 😉.