On the Factory Floor ML Engineering for industrial-scale Ads Recommendation Models

2022-01-02

Online Optimization Single pass of training data, the metric calculated before training.
ML efficiency Bandwidth(number of models can train concurrently); Latency(end-to-end evaluation time for a new model), throughput(models that can be trained per unit time)
1. bottlenecks: wider is better, but reduce the embedding dimension is enough. Replace HW to HUV matrix factorization.
2. AutoML. Weight-sharing network, RL controller, constraints.
3. Data sampling. Re-balancing/Loss-based sampling.
Loss Engineering Distillatiion/Shampoo/DCN.
1. Rankloss.Pariwise logic; combining rankloss with logitic loss.
2. Distillation.
3. Curriculums of losses. 2nd order optimization: Shampoo.
Irreproducibility ReLU -> SmeLU