Modeling Search Term Revenue: Using Embedding Layers to Manage High Cardinality Categorical Data


At System1, data scientists are faced with the task of predicting revenue and cost per click across millions of unique keywords that drive traffic to our sites or monetize on a pay per click basis. This talk will show a variety of techniques we use to extract the most information from categorical variables, especially anonymized, sparse, high cardinality categorical variables, like search terms.

Categorical variables are easily interpretable by data scientists and non- technical people, but they can also be difficult to translate into machine learning algorithms. Categorical variables need to be converted to quantitative values to be used in machine learning models and can very quickly explode the feature space of a model, add noise or unintended signals to the data, or simply not include all the meaning and predictive power that feature provides for the dependent variable. There are many popular and effective libraries that abstract categorical variable feature creation. However, if a model is sensitive from a financial, data ethics, or some level of public visibility standpoint, or simply prone to overfitting, it is vital to understand how the model is capturing all features and how to tune model parameters or input data. Furthermore, if dealing with personal or sensitive data, machines need to be able to handle anonymized categories while still allowing a human to interpret the source data. One of the problems we face at System1 is that individual keywords can receive very little traffic, sometimes less than a click per day; however, across millions of keywords, these long- tail keywords comprise significant revenue. Furthermore, data science models need to be proactive and adjust bids and traffic based on seasonal components even if there is no data from the prior season. This talk will present a variety of practical techniques to extract and retain information and predictive power for categorical variables. We will talk about model selection, feature creation and techniques for converting categorical variables to quantitative values for modeling. Finally, the talk will present an interesting technique that utilizes embedding layers and transfer learning in a neural network framework to predict cost per click values on search terms.


