A Universal Scheme for Learning

Vivek Farias -- Stanford University

Consider an agent interacting with an unmodeled environment. At each time, the agent makes an observation, takes an action and incurs a cost. Its actions can influence future observations and costs. The goal is to choose actions so as to minimize the long- run average cost. We propose an algorithm based on ideas from the Lempel-Ziv scheme for universal data compression combined with dynamic programming. We establish that, if there exists an integer K such that the future is conditionally independent of the past given a window of K consecutive actions and observations, the average cost under this algorithm converges to the optimum.

This is joint work with Ciamac Moallemi, Ben Van Roy and Tsachy Weissman at Stanford University.