Reinforcement Learning

Content

Content
Beginner’s Guide:
Multi Arm bandit
How can we design a systematic strategy that adapts to these stochastic rewards?
Bandit Problem
Application of Bandit Algorithm
Online Learning
Counterfactual Regret Minimization
Reinforcement Learning
CTR
Exercise:

Beginner’s Guide:

Multi Arm bandit

The multi-armed bandit problem is a classic problem that well demonstrates the exploration vs exploitation dilemma. Imagine you are in a casino facing multiple slot machines and each is configured with an unknown probability of how likely you can get a reward at one play. The question is: What is the best strategy to achieve highest long-term rewards?

A naive approach can be that you continue to playing with one machine for many many rounds so as to eventually estimate the “true” reward probability according to the law of large numbers. However, this is quite wasteful and surely does not guarantee the best long-term reward.

Reference:

Content

How can we design a systematic strategy that adapts to these stochastic rewards?

This is our goal for the multi-armed bandit problem, and having such a strategy would prove very useful in many real-world situations where one would like to select the “best” bandit out of a group of bandits i.e. A/B testing, line-up optimization, evaluating social media influence.

we approach the multi-armed bandit problem with a classical reinforcement learning technique of an epsilon-greedy agent with a learning framework of reward-average sampling to compute the action-value $Q(a)$ to help the agent improve its future action decisions for long-term reward maximization.

Reference:

multi-armed-bandit

Content

Bandit Problem

This tutorial intends to be an introduction to stochastic and adversarial multi-armed bandit algorithms and to survey some of the recent advances. In the multi-armed bandit problem, at each stage, an agent (or decision maker) chooses one action (or arm), and receives a reward from it. The agent aims at maximizing his rewards. Since he does not know the process generating the rewards, he needs to explore (try) the different actions and yet, exploit (concentrate its draws on) the seemingly most rewarding arms.

Let’s say the Bandit had $K$ arms and $N$ rounds.

if $K \lt N$: Small set of actions
if $K \gt N$: Large set of actions

Reference:

Content

Application of Bandit Algorithm

Clinical trials/dose discovery
Recommendation systems (movies/news/etc)
Advert placement
A/B testing
Network routing
Dynamic pricing (eg., for Amazon products)
Waiting problems (when to auto-logout your computer)
Ranking (eg., for search)
A component of game-playing algorithms (MCTS)
Resource allocation
A way of isolating one interesting part of reinforcement learning

Reference:

Slide 1
Book

Content

Online Learning

An important area with a rich literature and multiple connections with game theory and optimization that is increasingly influencing the theoretical and algorithmic advances in machine learning.

The online scenario involves multiple rounds where training and testing phases are intermixed.
At each round, the learner receives an unlabeled training point, makes a prediction, receives the true label, and incurs a loss.
The objective in the on-line setting is to minimize the cumulative loss over all rounds or to minimize the regret, that is the difference of the cumulative loss incurred and that of the best expert in hindsight (i.e with experience).
Unlike the traditional ML settings, no distributional assumption is made in on-line learning. In fact, instances and their labels may be chosen adversarially within this scenario.
These algorithms process one sample at a time with an update per iteration that is often computationally cheap and easy to implement.

# No distributional assumption.
# Worst-case analysis (adversarial).
# Mixed training and test.
# Performance measure: mistake model, regret.

How Online learning is different from traditional machine learning?

Traditional machine learning (PAC learning) follows the key assumption that the distribution over data points is fixed over time, both for training and test points, and that points are sampled in an i.i.d. fashion. Under this assumption, the natural goal is to learn a hypothesis with a small expected loss or generalization error.

In contrast, with on-line learning, no distributional assumption is made, and thus there is no notion of generalization. Instead, the performance of on-line learning algorithms is measured using a mistake model and the notion of regret.

What is Regret?

Definition: The regret at time $T$ is the difference between the loss incurred up to $T$ by the algorithm and that of the best expert in hindsight:

$ R_T = L_T - L_T^{min} $

for best regret minimization algorithms: $R_T \leq O(\sqrt{T \log N})$

*expert in hindsight: means the expert learns with more experiment

Prediction with expert advice

In this setting, at the $t$ th round, in addition to receiving $x_t \in X,$ the algorithm also receives advice $y_{t,i} \in Y, i \in [N],$ from $N$ experts. Following the general framework of on-line algorithms, it then makes a prediction, receives the true label and incurs a loss. After $T$ rounds, the algorithm has incurred a cumulative loss. The objective in this setting is to minimize the regret $R_T$, also called external regret, which compares the cumulative loss of the algorithm to that of the best expert in hindsight after $T$ rounds:

$ R_T = \Sigma_{t=1}^T L(\hat {y_t}, y_t) - min_{i=1}^N \Sigma_{t=1}^T L(\hat {y_{t,i}}, y_t) $

Conclusion

On-line learning, regret minimization:

rich branch of machine learning.
connections with game theory.
simple and minimal assumptions.
algorithms easy to implement.
scale to very large data sets.

Reference:

Using Online Learning Algorithms for Computational Advertising
Book: Foundation of Machine Learning by Mehryar Mohri

Content

Counterfactual Regret Minimization

Reference:

Content

Reinforcement Learning

The training and testing phases are also intermixed in reinforcement learning.
To collect information, the learner actively interacts with the environment and in some cases affects the environment, and receives an immediate reward for each action.
The object of the learner is to maximize his reward over a course of actions and iterations with the environment.
However, no long-term reward feedback is provided by the environment, and the learner is faced with the exploration versus exploitation dilemma, since he must choose between exploring unknown actions to gain more information versus exploiting the information already collected.

Reference:

My Journey to Reinforcement Learning — Part 2: Multi-Armed Bandit Problem
Book: Foundation of Machine Learning by Mehryar Mohri

Content

CTR

What is Click Through Rate?

It can be thought of as the number of times the link is clicked by the user divided by the total number of times it’s viewed by the user.

$ CTR = \frac{n_{clicks}}{n_{display}} $

Reference:

Content

CTR Prediction

Motivation: The key task for a search engine advertising system is to

Determine what advertisements should be displayed from a pool of advertisement
In what order

for each query that the search engine receives.

Typically, advertisers have already specified the circumstances under which their ads may be shown (e.g., only for certain queries, or when certain words appear in a query), so the search engine only needs to rank the reduced set of ads that are matches [1].

As with search results, the probability that a user clicks on an advertisement declines rapidly, as much as $90\%$, with display position.

The number of eligible advertisements matching a given query usually far exceeds the number of valuable slots.

As per [13],

The core of CTR prediction problem is a simple regression. We have a feature vector, usually sparse, that we use to predict the outcome of an interaction, in this case the probability of a click.
An ad’s CTR is basically the average probability of a click of each impression of the ad. Having a binary click/no-click classification is not enough as the real-world system we have the problem we are solving is selecting an ad from a set of candidate ads, and the reasonable way to make that selection is to predict how likely each one of them is to yield a click and pick the one with the highest probability.
An easy model will be logistic regression which will give a probability between 0 and 1. Though in the year 2007, people used this kind of model but not now.
Currently more meaningful model will be anything related to online learning or reinforcement learning.

Data pattern

Historical bidding and Engagement Data
- Semantic analysis of the campaign materials
- Campaign id
- Ad’s past performance
- Ad id
Semantic profile of the ad campaign material
- Timestamp
- Ad placement id
- Site domain
- Media name / ID
- Page URL
- Bid amount
User context data
- Country where the request originated from
- City where the request originated from
- The type of device used

Highly imbalanced data:

1 click data against 2000 non-clicked data

Useful definition:

Buyer: Someone who has ads to show and money to pay for it. [Example: Nike, Addidas]
Clear Price: The final price for the impression (i.e. for displaying the ad.).
- First price auction: This is the price the winner of the auction bid
- Second price auction: it is the second place bid plus one cent.
Impression: The displaying of the ad in the user’s browser.
Click Through Rate: The rate at which users click an ad. Nowadays it is usually around 0.06% or so.
Cost-Per-Acquisition (CPA): The price the advertiser pays for each customer acquired.
Cost-Per-Impression (CPI): The most common basic pricing method of online advertising. The advertiser pays for each impression their ad is shown to a user. The price is usually given as thousand impressions, cost per mille or CPM.
Cost-Per-Click (CPC): The price the advertiser pays for every user clicking on their ad. $CPC=(CPM*CTR)/1000$
Pay-Per-Click (PPC): A billing method where the advertiser pays for each time a user clicks the ad. In the pay-per-click (PPC) model, the click price is predetermined and is indicated as CPC (Cost Per Click).
Placement: The actual ad frame or similar. Also known as the tag. This is usually a piece of JavaScript code that adds an iframe element to the page in question. The iframe contents are the actual ad that are loaded from the ad exchange.
Exchange: An ad server that can do an auction between different buyers/DSPs (demand side platform) and select the ad to send to seller. Some notable exchanges are AppNexus and Google DoubleClick.
Real Time Bidding (RTB): A digital ad delivery system where the ad impression on a webpage is sold in an auction just as it is being displayed (the “Real Time” in RTB), and the buyers in the auction decide the price they will bid as it happens based on any information they have about the user and placement of the particular impressions.

The RTB process in brief:

A user browses a page with ads
Ad placement code loads and requests an ad from an exchange
Exchange requests bids from DSPs (demand side platform, i.e. from different buyers)
Each buyer decides what they want to bid for the impression based on different kinds of criteria including, but not limited, to user profile, ad profile, placement profile etc
Exchange picks the highest bid(s) that meet some internal criteria (such as image size, text length, text content etc.) and sends it forward to the placement
Ad is shown in the placement and a beacon pixel is fired to notify the buyer, usually with the final impression price.

Modelling

In order to maximize ad quality (as measured by user clicks) and total revenue, most search engines today order their ads primarily based on expected revenue:

$ E_{ad} [revenue] = p_{ad}(click)* CPC_{ad} $

Thus, to ideally order a set of ads, it is important to be able to accurately estimate the $p(click)*(CTR)$ for a given ad. For ads that have been shown to users many times (ads that have many impressions), this estimate is simply the binomial MLE (maximum likelihood estimation), $n_{clicks} / n_{impressions}$.

Whenever an ad is displayed on the search results page, it has some chance of being seen by the user. The farther down the page an ad is displayed, the less likely it is to be viewed. As a simplification, we consider the probability that an ad is clicked on to be dependent on two factors:

The probability that it is viewed
The probability that it is clicked on, given that it is viewed:

$p(click \vert ad, pos) = p(click \vert seen) * p(seen \vert ad, pos)$

Goal: To create a model which can predict this $p(click \vert ad, pos)$ for new ads.

Simple Model

According to [1], we can model them with simple logistic regression as follows:

$ CTR = \frac{1}{1+e^{-Z}} $

where $Z= \Sigma_i w_i f_i(ad)$, where $f_i(ad)$ is the value of the $i^{th}$ feature for the ad, and $w_i$ is the learned weight for that feature. Features may be anything, such as the number of words in the title, the existence of a word, etc.

The logistic regression was trained using the limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) method. Use a cross-entropy loss function, with zero-mean Gaussian weight priors with a standard-deviation of $\sigma$. For more details read the paper [1].

Reference:

[1] ** Base Paper 2007: Predicting Clicks: Estimating the Click-Through Rate for New Ads
[2] An Effective Budget Management Framework for Real-time Bidding in Online Advertising
[3] A CTR prediction model based on user interest via attention mechanism
[4] A Context-Aware Click Model for Web Search
[5] WSDM 2020: Interpretable Click-Through Rate Prediction through Hierarchical Attention
[6] Learning Compositional, Visual and Relational Representations for CTR Prediction in Sponsored Search
[7] TOIS 2019: Constructing Click Model for Mobile Search with Viewport Time
[8] Improving Ad Click Prediction by Considering Non-displayed Events
[9] KDD 2019: Generating Better Search Engine Text Advertisements with Deep Reinforcement Learning
[10] **Selecting multiple web adverts: A contextual multiarmed bandit with state uncertainty
[11] KDD 2013: Ad Click Prediction: a View from the Trenches - from **Google
[12] ADKDD 2014: Practical Lessons from Predicting Clicks on Ads at Facebook
[13] Thesis 2019: Click-Through Rate Prediction in Practice by Kurt-Eerik Ståhlberg4
[14] **JMLR 2013: Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising

** must read papers

Content

Exercise:

What is PAC learning? [Foundations of Machine Learning by Mehryar Mohri]
What is Counterfactual Regret Minimization (Source)
How online learning algorithms are connected to game theory
What is game theory and min-max algorithm

Published on August 27, 2019

Reinforcement Learning

Content

Beginner’s Guide:

Multi Arm bandit

How can we design a systematic strategy that adapts to these stochastic rewards?

Bandit Problem

Application of Bandit Algorithm

Online Learning

Similar Problems

How Online learning is different from traditional machine learning?

What is Regret?

Prediction with expert advice

Conclusion

Counterfactual Regret Minimization

Reinforcement Learning

CTR

What is Click Through Rate?

CTR Prediction

Data pattern

Useful definition:

Modelling

Simple Model

Exercise: