Optimal Scheduling of Entropy Regularizer for Continuous-Time Linear-Quadratic Reinforcement Learning

From MaRDI portal
Publication:6180253

DOI10.1137/22M1515744arXiv2208.04466WikidataQ129754889 ScholiaQ129754889MaRDI QIDQ6180253FDOQ6180253

Lukasz Szpruch, Tanut Treetanthiploet, Yu-Fei Zhang

Publication date: 19 January 2024

Published in: SIAM Journal on Control and Optimization (Search for Journal in Brave)

Abstract: This work uses the entropy-regularised relaxed stochastic control perspective as a principled framework for designing reinforcement learning (RL) algorithms. Herein agent interacts with the environment by generating noisy controls distributed according to the optimal relaxed policy. The noisy policies, on the one hand, explore the space and hence facilitate learning but, on the other hand, introduce bias by assigning a positive probability to non-optimal actions. This exploration-exploitation trade-off is determined by the strength of entropy regularisation. We study algorithms resulting from two entropy regularisation formulations: the exploratory control approach, where entropy is added to the cost objective, and the proximal policy update approach, where entropy penalises the divergence of policies between two consecutive episodes. We analyse the finite horizon continuous-time linear-quadratic (LQ) RL problem for which both algorithms yield a Gaussian relaxed policy. We quantify the precise difference between the value functions of a Gaussian policy and its noisy evaluation and show that the execution noise must be independent across time. By tuning the frequency of sampling from relaxed policies and the parameter governing the strength of entropy regularisation, we prove that the regret, for both learning algorithms, is of the order mathcalO(sqrtN) (up to a logarithmic factor) over N episodes, matching the best known result from the literature.


Full work available at URL: https://arxiv.org/abs/2208.04466







Cites Work


Cited In (1)





This page was built for publication: Optimal Scheduling of Entropy Regularizer for Continuous-Time Linear-Quadratic Reinforcement Learning

Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q6180253)