For a model-based RL algorithm, even if the trained environment model gives a very accurate prediction,
the error could still be amplified by the so-called value-aware model error [2] if the agent's value function
has a very bad Lipschitz condition.
- Value functions trained based on data generated from probabilistic ensemble model
have much smaller Lipschitz constants (the figure on the left).
- We can view data generated from the probabilistic ensemble model as an implicit augmentation.
The augmentation comes from two sources, either variation of predictions across different models in the ensemble, or
the noise added by the variance of each probabilistic distribution (e.x. Gaussian distribution in MBPO[1]).
- We verified our hypothesis empirically, and indeed we find that Lipschitz constant of the value function trained from probabilistic ensemble model
is significantly smaller (the figure on the right).
Our Proposed Methods
The probabilistic model ensemble achieves great empirical performance. However, it is expensive both in terms of computational time and cost as well as memory overhead.
Given that the key functionality of model-ensemble is to regularize the Lipschitz constant of value functions,
we devise two different approaches to directly regularize the Lipschitz constant of the value function.
- Spectral Normalization: We add spectral normalization to every layer of the value network to
control the upper bound of the global Lipschitz constant.
- Robust Regularization: We add a robust loss with adversarial perturbation to guarantee that the variation of the value function locally is small.
Results
Improved Asymptotic Performance
Using just a single deterministic model, MBPO with our two Lipschitz regularization mechanisms achieves a
comparable and even better performance across all five tasks than MBPO with a probabilistic ensemble model.
In particular, the proposed robust regularization technique shows a larger advantage on three more sophisticated tasks: Humanoid, Ant,
and Walker.
Improved Time Efficiency
Compared with MBPO using an ensemble of probabilistic models, our proposed mechanisms, especially robust regularization, is more time efficient.
Poster
Reference
[1] Janner et al. When to Trust Your Model: Model-Based Policy Optimization. NeurIPS 2019
[2] Farahmand et al. Iterative Value-Aware Model Learning. NeurIPS 2018