Abstract
Suppose that π is a policy for resource allocation in a stochastic environment andπ ∗is an optimal policy. Two existing procedures for policy evaluation are described and compared. Both of these evaluateπby means of upper bounds onR(π ∗)– R(π), the total reward lost when making resource allocations according toπrather than π∗. The bounds developed by these two methods are called Type 1 and Type 2. We demonstrate by example that neither of these procedures dominates the other in the sense of always yielding tighter bounds. A modification to Type 2 bounds is proposed resulting in an improved procedure which always dominates the Type 1 approach.
Publisher
Cambridge University Press (CUP)
Subject
Statistics, Probability and Uncertainty,General Mathematics,Statistics and Probability
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献