In [1]:
from IPython.display import HTML
style = "<style> p,ul { font-size: 18px; color: blue;}</style>"
HTML(style)
Out[1]:
In [ ]:
 

Markov Notebooks

Chris Connolly, SRI International

1. Introduction

This set of notebooks explores the connections across Markov chains, potential theory, operant conditioning, and reinforcement learning. In particular, the harmonic functions for a Markov chain provide the backbone for these connections. The harmonic functions for Markov chains are answers to the question of long-term ("infinite horizon") behavior. While Markov chains are defined over graphs, the notebooks provide specific examples using grid worlds and binary trees. These notebooks are motivated by two issues:

  • Perceived discrepancies between machine and biological reinforcement learning: Machine reinforcement learning (RL) appears to violate behavioral observations. Specifically, temporal differencing (TD) is an exponential discounting credit-assignment procedure, but behavioral studies indicate that mammalian behavior is better modeled by hyperbolic discounting. Potential theory could help to reconcile this discrepancy.
  • The connection between probability theory and potential theory raises design possibilities for analog or other specialized hardware for reinforcement learning. How practical is this possibility?
  • Folding these two themes together, what are some realistic biological substrates for harmonic function (or equivalently, value function) computation?

A larger question is whether potential theory helps to develop useful intuitions about value and Q functions in machine RL. This particular notebook serves as a tour through the correspondences between probability theory and potential theory, since each harmonic function is a kind of "classical potential" as defined using the Laplace operator. Going further, deep RL appears to encode (after learning) the value function corresponding to a given problem / task. Does potential theory offer any insight into deep RL behavior or design? Harmonic functions obey certain hard constraints. Is it possible to exploit these for constructing formal abstractions of neural processing?

Note: The potential-theory viewpoint may provide an RL update rule that converges to the theoretical value function more rapidly than TD. The RL notebook (section 5) explores this.

One major goal of these notebooks is to arrive at a mathematical / computational abstraction for biological trial-and-error learning that, in principle, describes aspects of reward prediction error and action selection. The mathematical object that appears to serve as a good abstraction is the Markov chain. As an abstract model of biological learning, Markov chains could correspond to electrically coupled interneuron networks in the striatum (the parvalbumin interneurons, or the astrocytes). The reward prediction error, of course, is signaled by the dopamine neurons that project in a broadband way to the striatum and supply a kind of training signal.

Contents

  • Markov Chains & hitting probabilities
  • Tree worlds
  • Agents
  • Reinforcement Learning:
    • The Absorbing Case
    • The Non-Absorbing Case
  • Biological Relevance
  • Note: A comparison of algorithm behavior can be found here, comparing eligibility traces / TD to eligibility blankets / potential theory methods for learning value functions.
In [ ]: