# How Uber uses machine learning to achieve hyper-growth in sales.

Personalized Marketing

How do we extract meaningful information from (potential) clients' data to drive personalized marketing strategies that improve client acquisition/revenue by an order of magnitude?

All (good) technology company now faces the unique challenge of distilling information from a plethora of data. Consumer tech products are able to collect such verbose & intricate user information that it’s often hard to know what to do with all of it. Everything from user demographics, to network & IP characteristics, is logged & stored.

How do they make sense of all this information?

# Application

Uber aims to design targeting marketing campaigns. To build models that classifies users by the likelihood of 1 of 3 events:

**User acquisition**: how likely a new individual is to purchase a product.**Cross-or-up-selling**: how propensity of an existing user to purchase a related product.**User churn**: how likely an existing user is to cancel the purchase.

# Problem

Presented with thousands of features per individual and exceedingly large data — this seems a simple task for deep learning. In this application, however, this solution is invalid for copious reasons:

## Mathematically

- Parameters are indiscernible from one another.
- The feature space is sparse— leading to overfitting.
- Compute needed grows exponentially.
- Model interpretation & diagnostics are cumbersome.

## Pragmatically

Marketers need to use the data to design succinct, actionable, strategies based on individual attributes. Verbose, superfluous, information ought to be refined.

# Solution: mRMR

mRMR: Maximum Relevance & Minimum Redundancy

A simple solution to this problem can be derived from information theory.

mRMR is a simple formulation that allows us to approximate the marginal explanatory value of a feature. Consider:

- A matrix of
*m*features:*X* - A mutual information function:
*I(.)* - A response class vector:
*Y*

mRMR is then given by:

where *|S|* is the size of the feature space & *I(.)* is expressed as:

This formulation is highly intuitive as we’re taking the ratio of information captured by the marginal predictor

Xito some proxy of information captured by the rest of the set|S|.

A number of variants of these base equations may be desirable:

- Replacing integrals with summations for discrete features.
- Replacing the quotient with a subtraction where scale-invariant features are used.
- Replacing the probability density estimates
*p(x,y), p(y) or p(x)*with statistical tests (t-tests &/or F-tests) where severe computational limitations arise. - Non-linear feature dependency can be achieved by mapping the data to some linearly separable space with kernel functions.
- If the downstream classification framework is pre-determined it may be pragmatic to incorporate the related accuracy metric as a proxy for mutual information.

# Empirical Findings

Many variants were tested, it was shown that feature relevance decayed exponentially — albeit at varied rates — across different models.

Post, implementation feature correlation & statistical differences are also tested in the paper.

# Production Implementation

Uber then utilized this methodology by implementing the underlying algorithm as a module in one of their automated machine learning pipelines.

The YAML configuration can be modularized & readily incorporated. The implementation is also, relatively straightforward — Uber disclosed using Scala Spark to handle the ETL process at scale.

# Conclusion

Next time you order your Uber, or dig into uber eats, just remember it probably has less to do with your decision-making & more to do with your mutual information with sales channels 😉.