26. Evaluating RecSys in production, Federico, Tooso
https://www.youtube.com/watch?v=FfJKFbgdSSo (opens in a new tab)
- 
Issue: RecSys are vital for the user, but classic evaluation metrics are not enough Guessing when a recommendation is not relevant is tricky  
- 
RecSys fails silently, with dramatic consequences and pseudo-feedback is tricky 
- 
Do your test capture these issues? Is there a way to evaluate RecSys? 
- 
Introducing behavioural testing - 
Doesn’t focus on a single metrics - Inherent properties of uses cases, enforce input/output pairs
- Similar items are symmetric
- Complementary items are asymmetric
  
- 
Quantify mistakes  “Terminator” is a much worse mistake than “Got Mail” as a recommendation 
- 
Slice the data Imbalanced datapoint (power log distribution), good aggregate measure (like MRR) even if low represented data are worse 
 
- 
- 
Solution: RecList is a behavioural test at scale Main abstractions - RecDataset: wraps dataset
- RecModel: wrap predictions from models, allowing “black box” testing of APIs
- RecList: declarative style, made of- RecTests
 from reclist.metrics.standard_metrics import statistics class MyRecList(RecList): @rec_test(test_type="stats") def basic_stats(self): return statistics( self._x_train, self._y_train, self._x_test, self._y_test, self._y_preds)- Lot of pre-made behavioural tests to use
  - Can be use for research with a new model
- and also for prod system, using CI/CD (check if behavioral symmetry is verified)
 
- 
Q&A - 
End game with reclist? a company? Keep it open-source, make it useful for research and also for production tool 
- 
Is it for recommender system only? Based mainly on rec for now 
 
- 
Behavioral testing of recommender systems with RecList