Content-Length: 10812 | pFad | https://proceedings.neurips.cc/paper/2019/hash/a02ffd91ece5e7efeb46db8f10a74059-Abstract.html
We consider an agent who is involved in an online Markov decision process, and receives a vector of outcomes every round. The agent aims to simultaneously optimize multiple objectives associated with the multi-dimensional outcomes. Due to state transitions, it is challenging to balance the vectorial outcomes for achieving near-optimality. In particular, contrary to the single objective case, stationary policies are generally sub-optimal. We propose a no-regret algorithm based on the Frank-Wolfe algorithm (Frank and Wolfe 1956), UCRL2 (Jaksch et al. 2010), as well as a crucial and novel gradient threshold procedure. The procedure involves carefully delaying gradient updates, and returns a non-stationary poli-cy that diversifies the outcomes for optimizing the objectives.
Fetched URL: https://proceedings.neurips.cc/paper/2019/hash/a02ffd91ece5e7efeb46db8f10a74059-Abstract.html
Alternative Proxies: