Learning in structured MDPs with convex cost functions: improved regret bounds for inventory management
From MaRDI portal
Publication:5095166
Abstract: We consider a stochastic inventory control problem under censored demands, lost sales, and positive lead times. This is a fundamental problem in inventory management, with significant literature establishing near-optimality of a simple class of policies called ``base-stock policies for the underlying Markov Decision Process (MDP), as well as convexity of long run average-cost under those policies. We consider the relatively less studied problem of designing a learning algorithm for this problem when the underlying demand distribution is unknown. The goal is to bound regret of the algorithm when compared to the best base-stock policy. We utilize the convexity properties and a newly derived bound on bias of base-stock policies to establish a connection to stochastic convex bandit optimization. Our main contribution is a learning algorithm with a regret bound of for the inventory control problem. Here is the fixed and known lead time, and is an unknown parameter of the demand distribution described roughly as the number of time steps needed to generate enough demand for depleting one unit of inventory. Notably, even though the state space of the underlying MDP is continuous and -dimensional, our regret bounds depend linearly on . Our results significantly improve the previously best known regret bounds for this problem where the dependence on was exponential and many further assumptions on demand distribution were required. The techniques presented here may be of independent interest for other settings that involve large structured MDPs but with convex cost functions.
Recommendations
- Technical note: Perishable inventory systems: convexity results for base-stock policies and learning algorithms under censored demand
- Dynamic Inventory Control with Fixed Setup Costs and Unknown Discrete Demand Distribution
- Near-optimal regret bounds for reinforcement learning
- Regret bounds for restless Markov bandits
- Nonparametric learning algorithms for joint pricing and inventory control with lost sales and censored demand
Cites work
- A nonparametric asymptotic analysis of inventory planning with censored demand
- A note on the convexity of performance measures of M/M/c queueing systems
- An adaptive algorithm for finding the optimal base-stock policy in lost sales inventory systems with censored demand
- Asymptotic optimality of order-up-to policies in lost sales inventory systems
- Foundations of inventory management
- Lost-Sales Problems with Stochastic Lead Times: Convexity Results for Base-Stock Policies
- Lost-sales inventory theory: a review
- Near-optimal regret bounds for reinforcement learning
- Non-stationary stochastic optimization
- Note—On the Marginal Benefit of Adding Servers to G/GI/m Queues
- Old and New Methods for Lost-Sales Inventory Systems
- Optimal Server Allocation in a System of Multi-Server Stations
- Partial monitoring -- classification, regret bounds, and algorithms
Cited in
(5)- Deep reinforcement learning for inventory control: a roadmap
- UCB-type learning algorithms with Kaplan-Meier estimator for lost-sales inventory models with lead times
- A least squares temporal difference actor–critic algorithm with applications to warehouse management
- scientific article; zbMATH DE number 7626725 (Why is no real title available?)
- Reward shaping to improve the performance of deep reinforcement learning in perishable inventory management
This page was built for publication: Learning in structured MDPs with convex cost functions: improved regret bounds for inventory management
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q5095166)