/
...
/
/
13. Is production RL at a tipping point? Waleed Kadous, Anyscale
Search
Duplicate
Try Notion

13. Is production RL at a tipping point? Waleed Kadous, Anyscale

There’s a very standard way of doing supervised ML, but reinforcement learning (RL) challenges that
Research has shown that RL does super well on real-world tasks
Yet we rarely see RL in prod. Why? Sometimes we get freaked out by RL because it's new, but we’ll show that it’s a natural extension.
I’ll give some tips and traps to watch out for
Using RLlib, popular open-source distributed library
Understanding RL structure
How to escalate from Bandit
Deployment will be discussed
Bandit
You have slots machines giving a payout based on an unknown proba. A good example if UI treatment: 5 different ways of showing UI. You don’t want to test them all uniformly
The challenge with Bandit in production is the explore/exploit tradeoff. How to balance it? Epsilon greedy algorithm is a tradeoff. 50% of the time you act random and 50% you maximize your gain based on your knowledge
Contextual bandit leverage metadata: is it sunny? A very natural extension of bandit. based on this user profile, and watch this episode last, I’ll suggest this user watch this
RL is bandit with states (or sequential)
Order of actions in chess
A sequence of steps to a payout, how to distribute the reward to the last move?
You need a temporal credit assignment.
There is also a delay problem. What happens if the reward is delayed? Even more complex, like a sacrifice move in chess, negative reward in the short term, but a reward 50 moves later.
If we expand the state and action spaces
Instead of 4 bandit machines, I have 32
State-space grows exponentially
For so many real-world applications, I got a log (here’s where people clicked yesterday), learning RL policy, how to run it without experiment?
When you move to offline RL, you are stuck with the historic.
Multi-agent RL (MRL)
How do you share probability between different users?
Is it a cooperative or competitive scenario?
The stock market is a sum of very small players
MRL get way more complex
How do you model reward between all actors
Is RL at a tipping point?
All companies use RL for recommenders, so it’s started to cross the boundary, but why not more popular? 4 factors before prod ready. Recent progress in each
Huge amount of training: alpha go played 5 million games. Only huge tech can afford that. We started to see transfer learning, you don’t always have to start to scratch. Imitation learning mimics human behaviour.
The default implementation is online: it’s designed to run live. Changing a dynamic model in production is pretty scary. Hard to get data to train them. Offline learning can help.
Temporal credit assignment: which action to reward? Contextual bandit is RL without the temporal credit assignment, limited but simple to deploy and start getting adopted
Large action and state space. Recently, high fidelity simulator, deep learning approach to learning the state space, embedding approaches to learn action space (candidate selection then ranking), offline learning doesn’t require relearning
3 Common patterns in successful RL applications
a good simulator is 50% of the work
running a lot of simulations at once using distributed RL (RLlib)
batching by merging results for many experiments
getting close with a simulator, then fine-tuning in real-world
ex1: games are good simulators!
ex2: markets. simulations don’t need to be perfect.
low temporality: do you really need temporal assignment?
ex1: last played game + user profile (the contextual part)
what if millions of users and hundred games? use embedding to reduce the dimensionality of users and embedding to find the game
optimization: the next generation. Linear programming can be used with RL.
RL is optimization in a data-driven way, does not require modelling, but many experiments. Obviously, it takes a lot more computation but often plug and play with optimization
2 tips for production RL
Keep it simple. start with stateless, then add context and up forward. online vs offline small discrete state and action spaces vs large and continuous single-agent vs multi-agent shared policy vs true multi-agent
RLOps? Workflow is different Validate? Update? Monitoring? Retrain when it's offline data? Real problems with RL in production
Conclusion: a tipping point in some areas, some early adopters.
Q&A
How much training data does a simple recommender system need?
⇒ Think about embeddings: you don’t need RL to build them. 10k examples can be enough (ideally 100k or 1m). Contextual bandit is quite off the shelf now.