What is Reinforcement Learning?
Reinforcement Learning is a branch of machine learning dealing with problem of sequential decision making.
In its most general form, it studies the problem of an agent that is interacting with the outside world (i.e. the environment) by taking an action at each step. The choice of action will have consequences: first, it leads to a new state for the agent; and second, the agent receives a reward signal from the environment, telling it how good or bad the action was. The goal of the agent is to figure how to behave, that is, what is the best action at each state, such that in the long term, it gets to collect the highest possible amount of rewards.
The keyword here is ‘long-term’: the action that results in immediate satisfaction is not necessarily good for long-term success. That’s part of the complexity that the RL learning algorithms try to address: initially the agent has no idea how good or bad an action is (e.g. how much immediate reward they will generate), and what next states they will produce. The agent has to, in a balanced way, explore the action space to experience the effects of each action, and at the same time figure out the best action strategy that leads to the highest possible long-term reward. The process of figuring out the best action policy using a training set of such agent experiences is the ultimate goal of Reinforcement Learning.
Think of a self-driving car, for example: while driving, at every moment it needs to know what actions to take (should I break? Should I turn right? Should I keep going straight? Why not just accelerate?). To get started, we need to define a reward value to tell us how good or bad each action was (am I closer to my destination? Am I safe enough? Did I get to avoid the curb while I was parking?). RL algorithms have the power to teach the car how to maximize its long term rewards by taking optimal actions (that is, how to drive).
Reinforcement Learning has a long history as a branch of Computer Science and Machine Learning. Its core ideas were developed over the past 30 years, but given its complexity, it could only be applied to problems with fairly small state and action spaces.
Incorporation of deep learning into RL opened the door to solving real-world problems with RL, where state and action spaces can be very large.
DeepMind was the first group that showed the power of Deep RL, when in 2016 the game-playing agent they trained beat the world champion in the game of Go. There is also a vast set of use cases for RL in various industries, such as finance, health care, and digital advertising.
A STRATEGY FOR BIDDING
Real-time bidding (RTB) is a mechanism for connecting advertisers with online publishers. The goal of publishers is to monetize the content they generate. The goal of the advertisers is to spend their budgets optimally, such that some pre-specified goals are reached. The process of how to allocate advertising budgets is determined at a highly granular level, impression by impression, in real time auction processes happening billions of times per day.
An advertising campaign’s bidding strategy determines, in real time, how much to bid for a chance to show an advertiser’s message to any particular user. This bid has to be determined based on all sorts of features for the ad opportunity, e.g : what is the web page or app? What is the geographical location? What is the time of day or day of week?
And as if it is not a complex enough problem, the bidding strategy should also deliver pre-defined Key Performance Indicators (KPIs) targets (think of total budgets, or performance goals expressed by CPA, CPC, etc), that is set on behalf of the advertiser.
We at the Copilot group at Xaxis use machine learning/AI as the core of our bidding strategies: it helps us learn from historical data clues on how to set the bid values. Our vision is that the goal of the optimal bidding strategy should be to get as close as possible to the advertiser’s pre-defined goals. To do this, we need to dynamically adjust the parameters of the bidding strategy, such that it continuously moves the KPIs towards the right direction.
It turns out that RL is an ideal tool to address the challenge of dynamically managing an advertising campaign. This problem, at its core, is a sequential decision making problem: How best should one adjust the campaign attributes – sequentially- such that the goal of full KPI delivery is achieved?
A rough setup of dynamically adjusting bidding strategies as an RL problem will look like this:
- State: current campaign KPIs
- Action: change in bidding strategy parameters
- Reward: how ‘well’ a campaign is performing
- Optima Policy: how to set bidding strategy parameters such that KPI targets are delivered successfully by the campaign.
Our Copilot team at Xaxis is actively researching and testing the application of Deep RL for training campaigns. We will present further results in an upcoming post.