Learning in Hybrid Active Inference Models (2024)

font=footnotesize,labelfont=footnotesize

11institutetext: School of Engineering and Informatics, University of Sussex, Brighton, UK
11email: {pzc20, rs773, p.kinghorn, c.l.buckley}@sussex.ac.uk,
22institutetext: VERSES AI Research Lab, Los Angeles, California, USA

Poppy CollisCorresponding author11††Ryan Singh1122 † †Paul F Kinghorn11Christopher L Buckley1122

Abstract

An open problem in artificial intelligence is how systems can flexibly learn discrete abstractions that are useful for solving inherently continuous problems. Previous work in computational neuroscience has considered this functional integration of discrete and continuous variables during decision-making under the formalism of active inference [13, 29].However, their focus is on the expressive physical implementation of categorical decisions and the hierarchical mixed generative model is assumed to be known. As a consequence, it is unclear how this framework might be extended to the learning of appropriate coarse-grained variables for a given task. In light of this, we present a novel hierarchical hybrid active inference agent in which a high-level discrete active inference planner sits above a low-level continuous active inference controller. We make use of recent work in recurrent switching linear dynamical systems (rSLDS) which learn meaningful discrete representations of complex continuous dynamics via piecewise linear decomposition [22]. The representations learnt by the rSLDS inform the structure of the hybrid decision-making agent and allow us to (1) lift decision-making into the discrete domain enabling us to exploit information-theoretic exploration bonuses (2) specify temporally-abstracted sub-goals in a method reminiscent of the options framework [34] and (3) ‘cache’ the approximate solutions to low-level problems in the discrete planner. We apply our model to the sparse Continuous Mountain Car task, demonstrating fast system identification via enhanced exploration and successful planning through the delineation of abstract sub-goals.

Keywords:

hybrid state-space models, decision-making, piecewise affine systems

- {\dagger} Equal contribution

1 Introduction

In a world that is inherently high-dimensional and continuous, the brain’s capacity to distil and reason about discrete concepts represents a highly desirable feature in the design of autonomous systems. Humans are able to flexibly specify abstract sub-goals during planning, thereby reducing complex problems into manageable chunks [26, 16]. Indeed, translating problems into discrete space offers distinct advantages in decision-making systems. For one, discrete states admit the direct implementation of classical techniques from decision theory such as dynamic programming [21]. Furthermore, we also find the computationally feasible application of information-theoretic measures (e.g. information-gain) in discrete spaces. Such measures (generally) require approximations in continuous settings but these have closed-form solutions in the discrete case [12]. While the prevailing method for translating continuous variables into discrete representations involves the simple grid-based discretisation of the state-space, this becomes extremely costly as the dimensionality increases [7, 24]. We therefore seek to develop a framework which is able to smoothly handle the presence of continuous variables whilst maintaining the benefits of decision-making in the discrete domain.

Learning in Hybrid Active Inference Models (1)

1.1 Hybrid Active Inference

Here, we draw on recent work in active inference (AIF) which has foregrounded the utility of decision-making in discrete state-spaces [8, 12]. Additionally, discrete AIF has been successfully combined with low-level continuous representations and used to model a range of complex behaviour including speech production, oculomotion and tool use [13, 29, 14, 30, 31]. As detailed in [13], such mixed generative models focus on the physical implementation of categorical decisions. This treatment begins with the premise that the world can be described by a set of discrete states evolving autonomously and driving the low-level continuous states by indexing a set of attractors (c.f. subgoals) encoded through priors which have been built into the model (see Fig.1). While the emphasis of the above work is on mapping categorical decision-making to the continuous physical world, here, we approach the question of learning the generative model. Specifically, we seek the complete learning of appropriate discrete representations of the underlying dynamics and their manifestation in continuous space. Importantly, unlike the previous work mentioned here, we focus on instances in which the mapping between the discrete states and the continuous states is not assumed to be known. In this case, however, the assumption that higher-level discrete states autonomously drive lower-level continuous states (i.e. downward causation) becomes problematic. Any failure of the continuous system to carry out a discrete preference must be treated as an autonomous failure at the discrete level. Although useful for planning, this decoupling of the discrete from the continuous components makes it difficult to represent complex dynamics, which in turn creates difficulties in learning.

Learning in Hybrid Active Inference Models (2)

1.2 Recurrent Switching Systems

Previous work has demonstrated that models involving autonomous switching systems are often not sufficiently expressive to approximate realistic generative processes [22]. They study this problem in the context of a class of hybrid state-space model known as switching linear dynamical systems (SLDS). These models have been shown to discover meaningful behavioural modes and their causal states via the piecewise linear decomposition of complex continuous dynamics [15, 11]. The authors of [22] remedy the problem associated with limited expressivity by introducing recurrent switching linear dynamical systems (rSLDS) (see Fig.2). These models importantly include a dependency from the underlying continuous variables in the high-level discrete transition probabilities. By providing an understanding of the continuous latent causes of switches between the discrete states via this additional dependency, the authors demonstrate improved generative capacity and predictive performance. We propose this richer representation can be useful for decision making and control. This recurrent transition structure can be exploited such that continuous priors can be flexibly specified for a low-level controller in order to drive the system into a desired region of the state space. Using statistical methods to fit these models not only liberates us from the need to explicitly specify a mapping between discrete and continuous states a priori, but enables effective online discovery of useful non-grid discretisations of the state-space.

1.3 Emergent descriptions for planning

Unfortunately, the inclusion of recurrent dependencies also destroys the neat separation of discrete planning from continuous control, creating unique challenges in performing roll-outs. Our central insight is to re-instate the separation by lifting the dynamical system into the discrete domain only during planning. We do this by approximately integrating out the continuous variables, naturally leading to spatio-temporally abstracted actions and sub-goals. Our discrete planner therefore operates purely at the level of a re-description of the discrete latents, modelling nothing of the autonomous transition probabilities but rather reflecting transitions that are possible given the discretisation of the continuous state-space. In short, we describe a novel hybrid hierarchical active inference agent [28] in which a discrete Markov decision process (MDP), informed by the representations of an rSLDS, interfaces with a continuous active inference controller implementing closed-loop control. We demonstrate the efficacy of this algorithm by applying it to the classic control task of Continuous Mountain Car [27]. We show that the exploratory bonuses afforded by the emergent discrete piecewise description of the task-space facilitates fast system identification. Moreover, the learnt representations enable the agent to successfully solve this non-trivial planning problem by specifying a series of abstract subgoals.

2 Related work

Such temporal abstractions are the focus of Hierarchical reinforcement learning (HRL), where high-level controllers provide the means for reasoning beyond the clock-rate of the low-level controllers primitive actions. [10, 34, 9, 18]. The majority of HRL methods, however, depend on domain expertise to construct tasks, often through manually predefined subgoals as seen in [35]. Further, efforts to learn hierarchies directly in a sparse environment have typically been unsuccessful [36]. In contrast, our abstractions are a natural consequence of lifting the problem into the discrete domain and can be learnt independently of reward. In the context of control, hybrid models in the form of piecewise affine (PWA) systems have been rigorously examined and are widely applied in real-world scenarios [33, 3, 6]. Previous work has applied a variant on rSLDS (recurrent autoregressive hidden Markov models) to the optimal control of general nonlinear systems [2, 1]. The authors use these models to the approximate expert controllers in a closed-loop behavioural cloning context. While their algorithm focuses on value function approximation, in contrast, we learn online without expert data and focus on flexible discrete planning.

3 Framework

The following sections detail the components of our Hierachical Hybrid Agent (HHA). For additional information, please refer to Appendix.0.A.

3.1 Generative Model: rSLDS(ro)

In the recurrent-only (ro) formulation of the rSLDS (see Fig.2), the discrete latent states zt{1,2,,K}subscript𝑧𝑡12𝐾z_{t}\in\{1,2,...,K\}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { 1 , 2 , … , italic_K } are generated as a function of the continuous latents xtMsubscript𝑥𝑡superscript𝑀x_{t}\in\mathbb{R}^{M}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT and the control input utNsubscript𝑢𝑡superscript𝑁u_{t}\in\mathbb{R}^{N}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT (specified by some controller) via a softmax regression model,

P(zt+1|xt,ut)=softmax(Wxxt+Wuut+r)𝑃conditionalsubscript𝑧𝑡1subscript𝑥𝑡subscript𝑢𝑡𝑠𝑜𝑓𝑡𝑚𝑎𝑥subscript𝑊𝑥subscript𝑥𝑡subscript𝑊𝑢subscript𝑢𝑡𝑟\displaystyle P(z_{t+1}|x_{t},u_{t})=softmax(W_{x}x_{t}+W_{u}u_{t}+r)italic_P ( italic_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( italic_W start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_W start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_r )(1)

whereby WxK×Msubscript𝑊𝑥superscript𝐾𝑀W_{x}\in\mathbb{R}^{K\times M}italic_W start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_M end_POSTSUPERSCRIPT and WuK×Nsubscript𝑊𝑢superscript𝐾𝑁W_{u}\in\mathbb{R}^{K\times N}italic_W start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_N end_POSTSUPERSCRIPT are weight matrices and r𝑟ritalic_r is a bias of size Ksuperscript𝐾\mathbb{R}^{K}blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT. The continuous latent states xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT evolve according to a linear dynamical system indexed by the current discrete state ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

xt+1|xt,ut,zt=Aztxt+Bztut+bzt+νt,νt𝒩(0,Qzt)formulae-sequenceconditionalsubscript𝑥𝑡1subscript𝑥𝑡subscript𝑢𝑡subscript𝑧𝑡subscript𝐴subscript𝑧𝑡subscript𝑥𝑡subscript𝐵subscript𝑧𝑡subscript𝑢𝑡subscript𝑏subscript𝑧𝑡subscript𝜈𝑡similar-tosubscript𝜈𝑡𝒩0subscript𝑄subscript𝑧𝑡\begin{split}x_{t+1}|x_{t},u_{t},z_{t}&=A_{z_{t}}x_{t}+B_{z_{t}}u_{t}+b_{z_{t}%}+\nu_{t},\\&\nu_{t}\sim\mathcal{N}(0,Q_{z_{t}})\end{split}start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL = italic_A start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_B start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_Q start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_CELL end_ROW(2)
yt|xt=Cztxt+ωt,ωt𝒩(0,Szt)formulae-sequenceconditionalsubscript𝑦𝑡subscript𝑥𝑡subscript𝐶subscript𝑧𝑡subscript𝑥𝑡subscript𝜔𝑡similar-tosubscript𝜔𝑡𝒩0subscript𝑆subscript𝑧𝑡y_{t}|x_{t}=C_{z_{t}}x_{t}+\omega_{t},\;\;\;\;\omega_{t}\sim\mathcal{N}(0,S_{z%_{t}})italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_C start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_S start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT )(3)

Aztsubscript𝐴subscript𝑧𝑡A_{z_{t}}italic_A start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the state transition matrix, which defines how the state xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT evolves in the absence of input. Bztsubscript𝐵subscript𝑧𝑡B_{z_{t}}italic_B start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the control matrix which defines how external inputs influence the state of the system while bztsubscript𝑏subscript𝑧𝑡b_{z_{t}}italic_b start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT is an offset vector. At each time-step t𝑡titalic_t, we observe an observation ytMsubscript𝑦𝑡superscript𝑀y_{t}\in\mathbb{R}^{M}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT produced by a simple linear-Gaussian emission model with an identity matrix Cztsubscript𝐶subscript𝑧𝑡C_{z_{t}}italic_C start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Both the dynamics of the continuous latents and the observations are perturbed by zero-mean Gaussian noise with covariance matrices of Qztsubscript𝑄subscript𝑧𝑡Q_{z_{t}}italic_Q start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT and Sztsubscript𝑆subscript𝑧𝑡S_{z_{t}}italic_S start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT respectively.

Inference requires approximate methods given that the recurrent connections break conjugacy rendering the conditional likelihoods non-Gaussian. Therefore, a Laplace Variational Expectation Maximisation (EM) algorithm is used to approximate the posterior distribution over the latent variables by a mean-field factorisation into separate distributions for the discrete states q(z)𝑞𝑧q(z)italic_q ( italic_z ) and the continuous states q(x)𝑞𝑥q(x)italic_q ( italic_x ). The discrete state is updated via a coordinate ascent variational inference (CAVI) approach by leveraging the forward-backward algorithm. The continuous state distribution is updated using a Laplace approximation around the mode of the expected log joint probability. This involves finding the most likely continuous latent states by maximizing the expected log joint probability and computing the Hessian to approximate the posterior. Full details of the Laplace Variational EM used for learning are given in [37].

The rSLDS is initialised according to the procedure outlined in [22]. In order to learn the rSLDS parameters using Bayesian updates, conjugate matrix normal inverse Wishart (MNIW) priors are placed on the parameters of the dynamical system and recurrence weights. We learn the parameters online via observing the behavioural trajectories of the agent and updating the parameters in batches (every 1000 timesteps of the environment).

3.2 Active Inference

Equipped with a generative model, active inference specifies how an agent can solve decision making tasks [28]. Policy selection is formulated as a search procedure in which a free energy functional of predicted states is evaluated for each possible policy. Formally, we use an upper bound on the expected free energy (𝒢𝒢\mathcal{G}caligraphic_G) given by:

𝒢1:T(π)subscript𝒢:1𝑇𝜋absent\displaystyle\mathcal{G}_{1:T}(\pi)\leqcaligraphic_G start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ( italic_π ) ≤𝔼Q(𝕠|π)[DKL[Q(𝕤|𝕠,π)Q(𝕤|π)]]State Information Gain\displaystyle\underbrace{-\mathbb{E}_{Q(\mathbb{o}|\pi)}[D_{KL}\left[Q(\mathbb%{s}|\mathbb{o},\pi)\parallel Q(\mathbb{s}|\pi)\right]]}_{\text{State %Information Gain}}under⏟ start_ARG - blackboard_E start_POSTSUBSCRIPT italic_Q ( blackboard_o | italic_π ) end_POSTSUBSCRIPT [ italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_Q ( blackboard_s | blackboard_o , italic_π ) ∥ italic_Q ( blackboard_s | italic_π ) ] ] end_ARG start_POSTSUBSCRIPT State Information Gain end_POSTSUBSCRIPT(4)
𝔼Q(𝕠|π)[DKL[Q(θ|𝕠,π)Q(θ|π)]]Parameter Information Gain\displaystyle\underbrace{-\mathbb{E}_{Q(\mathbb{o}|\pi)}[D_{KL}\left[Q(\theta|%\mathbb{o},\pi)\parallel Q(\theta|\pi)\right]]}_{\text{Parameter Information %Gain}}under⏟ start_ARG - blackboard_E start_POSTSUBSCRIPT italic_Q ( blackboard_o | italic_π ) end_POSTSUBSCRIPT [ italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_Q ( italic_θ | blackboard_o , italic_π ) ∥ italic_Q ( italic_θ | italic_π ) ] ] end_ARG start_POSTSUBSCRIPT Parameter Information Gain end_POSTSUBSCRIPT
𝔼Q(𝕠|π)[lnp~(o)]Utility.subscriptsubscript𝔼𝑄conditional𝕠𝜋delimited-[]~𝑝𝑜Utility\displaystyle\underbrace{-\mathbb{E}_{Q(\mathbb{o}|\pi)}[\ln\tilde{p}(o)]}_{%\text{Utility}}.under⏟ start_ARG - blackboard_E start_POSTSUBSCRIPT italic_Q ( blackboard_o | italic_π ) end_POSTSUBSCRIPT [ roman_ln over~ start_ARG italic_p end_ARG ( italic_o ) ] end_ARG start_POSTSUBSCRIPT Utility end_POSTSUBSCRIPT .

Where 𝕤={s1,,sT}𝕤subscript𝑠1subscript𝑠𝑇\mathbb{s}=\{s_{1},...,s_{T}\}blackboard_s = { italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT } and 𝕠={o1,,oT}𝕠subscript𝑜1subscript𝑜𝑇\mathbb{o}=\{o_{1},...,o_{T}\}blackboard_o = { italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_o start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT } are the states and observations being evaluated under a particular policy or sequence of actions, π=a1:T𝜋subscript𝑎:1𝑇\pi=a_{1:T}italic_π = italic_a start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT. The integration of rewards in the inference procedure is achieved by biasing the agent’s generative model with an optimistic prior over observing desired outcomes p~(o)~𝑝𝑜\tilde{p}(o)over~ start_ARG italic_p end_ARG ( italic_o ). Action selection then involves converting this into a probability distribution over the set of policies and sampling from this distribution accordingly.

Learning in Hybrid Active Inference Models (3)

3.3 Discrete Planner

In order to create approximate plans at the discrete level, we derive a high-level planner based on a re-description of the discrete latent states found by the rSLDS by approximately ‘integrating out’ the continuous variables and the continuous prior. This process involves calculating the expected free energy (𝒢𝒢\mathcal{G}caligraphic_G) for a continuous controller to drive the system from one mode to another. Importantly, the structure of the lifted discrete state transition model has been constrained by the polyhedral partition of the continuous state space extracted from the parameters of the rSLDS 111For a visualisation of this partitioning of the state space, see Fig.4(a): invalid transitions are assigned zero probability while valid transitions are assigned a high probability. In order to generate the possible transitions from the rSLDS, we calculate the set of active constraints for each region from the softmax representation, p(zx)=σ(Wx+b)𝑝conditional𝑧𝑥𝜎𝑊𝑥𝑏p(z\mid x)=\sigma(Wx+b)italic_p ( italic_z ∣ italic_x ) = italic_σ ( italic_W italic_x + italic_b ). Specifically, to check that the region i𝑖iitalic_i is adjacent to region j𝑗jitalic_j, we verify the solution using a linear program,

bj=subscript𝑏𝑗absent\displaystyle-b_{j}=- italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT =min(WiWj)xsubscript𝑊𝑖subscript𝑊𝑗𝑥\displaystyle\min(W_{i}-W_{j})xroman_min ( italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_x(5)
s.t.(WiWk)x(bibk)k[K]s.t.subscript𝑊𝑖subscript𝑊𝑘𝑥subscript𝑏𝑖subscript𝑏𝑘for-all𝑘delimited-[]𝐾\displaystyle\text{s.t. }(W_{i}-W_{k})x\leq(b_{i}-b_{k})\text{ }\forall k\in[K]s.t. ( italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_x ≤ ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∀ italic_k ∈ [ italic_K ](6)
s.t.x(xlb,xub)s.t.𝑥subscript𝑥𝑙𝑏subscript𝑥𝑢𝑏\displaystyle\text{s.t. }x\in(x_{lb},x_{ub})s.t. italic_x ∈ ( italic_x start_POSTSUBSCRIPT italic_l italic_b end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_u italic_b end_POSTSUBSCRIPT )(7)

where (xlb,xub)subscript𝑥𝑙𝑏subscript𝑥𝑢𝑏(x_{lb},x_{ub})( italic_x start_POSTSUBSCRIPT italic_l italic_b end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_u italic_b end_POSTSUBSCRIPT ) are bounds chosen to reflect realistic values for the problem. This ensures we only lift transitions to the discrete model if they are possible. After integration, we are left with a discrete MDP which contains averaged information about all of the underlying continuous quantities. This includes information about the transitions that the structure of the task space allows, and their corresponding approximate control costs (see 0.A.2). Note that after each batch update of the rSLDS parameters, this discrete planner must be refitted accordingly.

The lifted discrete generative model has all the components of a standard POMDP in the active inference framework:

P(o1:T,s1:T,A,B,π)=P(π)P(A)P(B)P(s0)tP(stst1,B,π)P(otst,A)𝑃subscript𝑜:1𝑇subscript𝑠:1𝑇𝐴𝐵𝜋𝑃𝜋𝑃𝐴𝑃𝐵𝑃subscript𝑠0subscriptproduct𝑡𝑃conditionalsubscript𝑠𝑡subscript𝑠𝑡1𝐵𝜋𝑃conditionalsubscript𝑜𝑡subscript𝑠𝑡𝐴P(o_{1:T},s_{1:T},A,B,\pi)=P(\pi)P(A)P(B)P(s_{0})\prod_{t}P(s_{t}\mid s_{t-1},%B,\pi)P(o_{t}\mid s_{t},A)italic_P ( italic_o start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , italic_A , italic_B , italic_π ) = italic_P ( italic_π ) italic_P ( italic_A ) italic_P ( italic_B ) italic_P ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_P ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_B , italic_π ) italic_P ( italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A )(8)

along with prior over policies P(π)=Cat(E)𝑃𝜋𝐶𝑎𝑡𝐸P(\pi)=Cat(E)italic_P ( italic_π ) = italic_C italic_a italic_t ( italic_E ), and preference distribution P~(ot)=Cat(C)~𝑃subscript𝑜𝑡𝐶𝑎𝑡𝐶\tilde{P}(o_{t})=Cat(C)over~ start_ARG italic_P end_ARG ( italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_C italic_a italic_t ( italic_C ). Specifically our lifted P(π)𝑃𝜋P(\pi)italic_P ( italic_π ) reflects the approximate control costs of each continuous transition and P~(ot)~𝑃subscript𝑜𝑡\tilde{P}(o_{t})over~ start_ARG italic_P end_ARG ( italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) reflects the reward available in each mode. We assume an identity mapping between states and observation meaning the state information gain term in Eq.4 collapses into a maximum entropy regulariser, while we maintain Dirichlet priors over the transition parameters B𝐵Bitalic_B, facilitating directed exploration. Due to conjugate structure Bayesian updates amount to a simple count-based update of the Dirichlet parameters [25]. At each time step, the discrete planner selects a policy by sampling from the following distribution:

Q(π)𝑄𝜋\displaystyle Q(\pi)italic_Q ( italic_π )=softmax(G(π)+lnP(π)).absent𝑠𝑜𝑓𝑡𝑚𝑎𝑥𝐺𝜋𝑃𝜋\displaystyle=softmax(-G(\pi)+\ln P(\pi)).= italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( - italic_G ( italic_π ) + roman_ln italic_P ( italic_π ) ) .(9)

The policy is then communicated to the continuous controller. Specifically, the first action of the selected policy is a requested transition ij𝑖𝑗i\rightarrow jitalic_i → italic_j and is translated into a continuous control prior p~(x)N(xj,Σj)similar-to~𝑝𝑥𝑁subscript𝑥𝑗subscriptΣ𝑗\tilde{p}(x)\sim N(x_{j},\Sigma_{j})over~ start_ARG italic_p end_ARG ( italic_x ) ∼ italic_N ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) via the following link function,

xjsubscript𝑥𝑗\displaystyle x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT=argmax𝑥P(z=jx,u)absent𝑥𝑃𝑧conditional𝑗𝑥𝑢\displaystyle=\underset{x}{\arg\max}\,P(z=j\mid x,u)= underitalic_x start_ARG roman_arg roman_max end_ARG italic_P ( italic_z = italic_j ∣ italic_x , italic_u )(10)

whereby we numerically optimise for a point in space up to some probability threshold, T𝑇Titalic_T (for details on this optimisation, see 0.A.6). These priors represents an approximately central point in the desired discrete region j𝑗jitalic_j requested by the action ajsuperscript𝑎𝑗a^{j}italic_a start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT. Note that these priors only need to be calculated once per refit of the rSLDS. The discrete planner infers its current state sτsubscript𝑠𝜏s_{\tau}italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT from observing ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Importantly, the discrete planner is only triggered when the system switches into a new mode222Or a maximum dwell-time (hyperparameter) is reached.. In this sense, discrete actions are temporally abstracted and decoupled from continuous clock-time in a method reminiscent of the options framework [34].


Learning in Hybrid Active Inference Models (4)

3.4 Continuous controller

Continuous closed-loop control is handled by a set of continuous active inference controllers. For controlling the transition from mode i𝑖iitalic_i to mode j𝑗jitalic_j (xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT), the objective of the controller is to minimise the following (discrete-time) expected free energy functional 333As shown in [20] linear state space models preclude state information gain terms leaving the simplified form seen here.:

Gij(π)=𝔼q(x0=xi,π)[(xSxj)TQf(xSxj)+t=0SutT(RΠtu)ut]+lndetΠG_{ij}(\pi)=\mathbb{E}_{q(\cdot\mid x_{0}=x_{i},\pi)}[(x_{S}-x_{j})^{T}Q_{f}(x%_{S}-x_{j})+\sum_{t=0}^{S}u_{t}^{T}(R-\Pi_{t}^{u})u_{t}]+\ln\det\Piitalic_G start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_π ) = blackboard_E start_POSTSUBSCRIPT italic_q ( ⋅ ∣ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_π ) end_POSTSUBSCRIPT [ ( italic_x start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_R - roman_Π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ) italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] + roman_ln roman_det roman_Π(11)

Where S𝑆Sitalic_S is the finite time horizon and the quadratic terms derive from Gaussian preferences about the final state p~j(xS)N(xj,Qf1)similar-tosubscript~𝑝𝑗subscript𝑥𝑆𝑁subscript𝑥𝑗superscriptsubscript𝑄𝑓1\tilde{p}_{j}(x_{S})\sim N(x_{j},Q_{f}^{-1})over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) ∼ italic_N ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) and time invariant control input prior p(ut)N(0,R1)similar-to𝑝subscript𝑢𝑡𝑁0superscript𝑅1p(u_{t})\sim N(0,R^{-1})italic_p ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∼ italic_N ( 0 , italic_R start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) (0.A.4). Importantly we design the control priors such that the controller only provides solutions within the environments given constraints (for further discussion, see Sec.5). The approximate closed-loop solution to each of these sub-problems is computed offline each time the rSLDS is refitted (see 0.A.3) using the updated parameters of the linear dynamical systems, allowing for fast discrete-only planning when online.

4 Results


Learning in Hybrid Active Inference Models (5)

To evaluate the performance of our (HHA) model, we applied it to the classic control problem of Continuous Mountain Car. This problem is particularly relevant for our purposes due to the sparse nature of the rewards, necessitating effective exploration strategies to achieve good performance. We find that the HHA finds piecewise affine approximations of the task-space and uses these discrete modes effectively to solve the task. Fig.4 shows that while the rSLDS has divided up the space according to position, velocity and control input, the useful modes for solving the task are those found in the position space. Once the goal and a good approximation to the system has been found, the HHA successfully and consistently navigates to the reward. This can be seen in the example trajectories (in Fig.4b) where the agent starts at the central position [0,0]00[0,0][ 0 , 0 ] and proceeds to rock back and forth within the valley until enough momentum is gained for the car to reach the flag position at a position of 0.5. The episode terminates once the reward has been reached and the agent is re-spawned at the origin before repeating the same successful solution.

Fig.5 shows that the HHA performs a comprehensive exploration of the state-space and significant gains in the state-space coverage are observed when using information-gain drive in policy selection compared to without. Indeed, our model competes with the state-space coverage achieved by model-based algorithms with exploratory enhancements in the discrete Mountain Car task, which is inherently easier to solve.

We compare the performance of the HHA to model-free reinforcement learning baselines (Actor-Critic and Soft Actor-Critic) and find that the HHA both finds the reward and capitalises on its experience significantly quicker than the other models (see Fig.6). Given both the sparse nature of the task and the poor exploratory performance of random action in the continuous space, these RL baselines struggle to find the goal within 20 episodes without the implementation of reward-shaping techniques. With reference to the high sample complexity of these algorithms, our model significantly outperforms other baselines in this task.

Learning in Hybrid Active Inference Models (6)

5 Discussion

The emergence of non-grid discretisations of the state-space allows us to perform fast systems identification via enhanced exploration, and successful non-trivial planning through the delineation of abstract sub-goals. Hence, the time spent exploring each region is not based on euclidean volume which helps mitigate the curse of dimensionality that other grid-based methods suffer from. Interestingly, even without information-gain, the area covered by our hybrid hierarchical agent is still notably better than that of the random continuous action control (see Fig.5c). This is because the agent is still operating at the level of the non-grid discretisation of the state-space which acts to significantly reduce the dimensionality of the search space in a behaviourally relevant way.

Such a piecewise affine approximation of the space will incur some loss of optimality in the long run when pitted against black-box approximators. This is due to the nature of caching only approximate closed-loop solutions to control within each piecewise region, whilst the discrete planner implements open-loop control. However, this approach eases the online computational burden for flexible re-planning. Hence, in the presence of noise or perturbations within a region, the controller may adapt without any new computation. This is in contrast to other nonlinear model-based algorithms like model-predictive control where reacting to disturbances requires expensive trajectory optimisation at every step [32]. By using the piecewise affine framework, we maintain functional simplicity and interpretability through structured representation. We therefore suggest that this method is amenable to future alignment with a control-theoretic approach to safety guarantees for ensuring robust system performance and reliability. Indeed, such use of discrete approximations to continuous trajectories has been shown to improve the ability to handle uncertainty. Evidence of the efficacy of this kind of approach in machine learning applications has been exhibited in recent work by [5], which examined the problem of compounding error in imitation learning from expert demonstration. The authors demonstrated that applying a set of primitive controllers to discrete approximations of the expert trajectory effectively mitigated the accumulation of error by ensuring local stability within each chunk.

We acknowledge there may be better solutions to dealing with control input constraints than the one given in Sec.3.4. Different approaches have been taken to the problem of implementing constrained-LQR control, such as further piecewise approximation based on defining reachability regions for the controller [4].

6 Conclusion

In summary, the successful application of our hybrid hierarchical active inference agent in the Continuous Mountain Car problem showcases the potential of recurrent switching linear dynamical systems (rSLDS) for enhancing decision-making and control in complex environments. By leveraging rSLDS to discover meaningful coarse-grained representations of continuous dynamics, our approach facilitates efficient system identification and the formulation of abstract sub-goals that drive effective planning. This method reveals a promising pathway for the end-to-end learning of hierarchical mixed generative models for active inference, providing a framework for tackling a broad range of decision-making tasks that require the integration of discrete and continuous variables. The success of our agent in this control task demonstrates the value of such hybrid models in achieving both computational efficiency and flexibility in dynamic, high-dimensional settings.

Acknowledgements

This work was supported by The Leverhulme Trust through the be.AI Doctoral Scholarship Programme in Biomimetic Embodied AI. Additionally, this research received funding from the European Innovation Council via the UKRI Horizon Europe Guarantee scheme as part of the MetaTool project. We gratefully acknowledge both funding sources for their support.

Disclosure of Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Appendix 0.A Appendix

0.A.1 Framework

Optimal Control To motivate our approximate hierarchical decomposition, we adopt the optimal control framework,specifically we consider discrete time state space dynamics of the form:

xt+1=f(xt,ut,ηt)subscript𝑥𝑡1𝑓subscript𝑥𝑡subscript𝑢𝑡subscript𝜂𝑡x_{t+1}=f(x_{t},u_{t},\eta_{t})italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )(12)

with known initial condition x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and noise ηtsubscript𝜂𝑡\eta_{t}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT drawn from some time invariant distribution ηtDsimilar-tosubscript𝜂𝑡𝐷\eta_{t}\sim Ditalic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_D, where we assume f𝑓fitalic_f to be p(xt+1xt,ut)𝑝conditionalsubscript𝑥𝑡1subscript𝑥𝑡subscript𝑢𝑡p(x_{t+1}\mid x_{t},u_{t})italic_p ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and is a valid probability density throughout. We use ct:X×U:subscript𝑐𝑡𝑋𝑈c_{t}:X\times U\rightarrow\mathbb{R}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_X × italic_U → blackboard_R for the control cost function at time t𝑡titalic_t and let 𝕌𝕌\mathbb{U}blackboard_U be the set of admissible (non-anticipative, continuous) feedback control laws, possibly restricted by affine constraints. The optimal control law for the finite horizon problem is given as:

J(π)𝐽𝜋\displaystyle J(\pi)italic_J ( italic_π )=𝔼x0,π[t=0Tct(xt,ut)]absentsubscript𝔼subscript𝑥0𝜋delimited-[]superscriptsubscript𝑡0𝑇subscript𝑐𝑡subscript𝑥𝑡subscript𝑢𝑡\displaystyle=\mathbb{E}_{x_{0},\pi}[\sum_{t=0}^{T}c_{t}(x_{t},u_{t})]= blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ](13)
πsuperscript𝜋\displaystyle\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT=argminπ𝕌J(π)absentsubscript𝜋𝕌𝐽𝜋\displaystyle=\arg\min_{\pi\in\mathbb{U}}J(\pi)= roman_arg roman_min start_POSTSUBSCRIPT italic_π ∈ blackboard_U end_POSTSUBSCRIPT italic_J ( italic_π )(14)

PWA Optimal Control The fact we do not have access to the true dynamical system f𝑓fitalic_f motivates the use of a piecewise affine (PWA) approximation. Also known as hybrid systems:

xt+1subscript𝑥𝑡1\displaystyle x_{t+1}italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT=Aixt+Biut+ϵtabsentsubscript𝐴𝑖subscript𝑥𝑡subscript𝐵𝑖subscript𝑢𝑡subscriptitalic-ϵ𝑡\displaystyle=A_{i}x_{t}+B_{i}u_{t}+\epsilon_{t}= italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT(15)
when(xt,ut)Hiwhensubscript𝑥𝑡subscript𝑢𝑡subscript𝐻𝑖\displaystyle\text{when }(x_{t},u_{t})\in H_{i}when ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∈ italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT(16)

Where ={Hi:i[K]}conditional-setsubscript𝐻𝑖𝑖delimited-[]𝐾\mathbb{H}=\{H_{i}:i\in[K]\}blackboard_H = { italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : italic_i ∈ [ italic_K ] } is a polyhedral partition of the space X×U𝑋𝑈X\times Uitalic_X × italic_U. In the case of a quadratic cost function, it can be shown the optimal control law for such a system is peicewise linear. Further there exist many completeness (universal approximation) type theorems for peicewise linear approximations implying if the original system is controllable, there will exist a peicewise affine approximation through which the system is still controllable [3, 6].

Relationship to rSLDS We perform a canonical decomposition of the control objective J𝐽Jitalic_J in terms of the components or modes of the system. By slight abuse of notation [xt=i]:=[(xt,ut)Hi]assigndelimited-[]subscript𝑥𝑡𝑖delimited-[]subscript𝑥𝑡subscript𝑢𝑡subscript𝐻𝑖[x_{t}=i]:=[(x_{t},u_{t})\in H_{i}][ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_i ] := [ ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∈ italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] represent the Iverson bracket.

J(π)𝐽𝜋\displaystyle J(\pi)italic_J ( italic_π )=tpπ(xtxt1,ut)ct(xt,ut)𝑑xt𝑑xt1absentsubscript𝑡subscript𝑝𝜋conditionalsubscript𝑥𝑡subscript𝑥𝑡1subscript𝑢𝑡subscript𝑐𝑡subscript𝑥𝑡subscript𝑢𝑡differential-dsubscript𝑥𝑡differential-dsubscript𝑥𝑡1\displaystyle=\sum_{t}\int p_{\pi}(x_{t}\mid x_{t-1},u_{t})c_{t}(x_{t},u_{t})%dx_{t}dx_{t-1}= ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∫ italic_p start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_d italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_d italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT(17)
=ti[K][xt1=i]pπ(xtxt1,ut)ct(xt,ut)dxtdxt1absentsubscript𝑡subscript𝑖delimited-[]𝐾delimited-[]subscript𝑥𝑡1𝑖subscript𝑝𝜋conditionalsubscript𝑥𝑡subscript𝑥𝑡1subscript𝑢𝑡subscript𝑐𝑡subscript𝑥𝑡subscript𝑢𝑡𝑑subscript𝑥𝑡𝑑subscript𝑥𝑡1\displaystyle=\sum_{t}\int\sum_{i\in[K]}[x_{t-1}=i]p_{\pi}(x_{t}\mid x_{t-1},u%_{t})c_{t}(x_{t},u_{t})dx_{t}dx_{t-1}= ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∫ ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_K ] end_POSTSUBSCRIPT [ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_i ] italic_p start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_d italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_d italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT(18)

Now let ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT be the random variable on [K]delimited-[]𝐾[K][ italic_K ] induced by Zt=isubscript𝑍𝑡𝑖Z_{t}=iitalic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_i if [xt=i]delimited-[]subscript𝑥𝑡𝑖[x_{t}=i][ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_i ] we can rewrite the above more concisely as,

J(π)𝐽𝜋\displaystyle J(\pi)italic_J ( italic_π )=ti[K]pπ(xt,zt1=ixt1,ut)ct(xt,ut)dxtdxt1absentsubscript𝑡subscript𝑖delimited-[]𝐾subscript𝑝𝜋subscript𝑥𝑡subscript𝑧𝑡1conditional𝑖subscript𝑥𝑡1subscript𝑢𝑡subscript𝑐𝑡subscript𝑥𝑡subscript𝑢𝑡𝑑subscript𝑥𝑡𝑑subscript𝑥𝑡1\displaystyle=\sum_{t}\int\sum_{i\in[K]}p_{\pi}(x_{t},z_{t-1}=i\mid x_{t-1},u_%{t})c_{t}(x_{t},u_{t})dx_{t}dx_{t-1}= ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∫ ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_K ] end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_i ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_d italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_d italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT(19)
=i[K]tpπ(xt,zt1=ixt1,ut)ct(xt,ut)𝑑xt𝑑xt1absentsubscript𝑖delimited-[]𝐾subscript𝑡subscript𝑝𝜋subscript𝑥𝑡subscript𝑧𝑡1conditional𝑖subscript𝑥𝑡1subscript𝑢𝑡subscript𝑐𝑡subscript𝑥𝑡subscript𝑢𝑡differential-dsubscript𝑥𝑡differential-dsubscript𝑥𝑡1\displaystyle=\sum_{i\in[K]}\sum_{t}\int p_{\pi}(x_{t},z_{t-1}=i\mid x_{t-1},u%_{t})c_{t}(x_{t},u_{t})dx_{t}dx_{t-1}= ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_K ] end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∫ italic_p start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_i ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_d italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_d italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT(20)
=i[K]t𝔼πi[ct(xt,ut)]absentsubscript𝑖delimited-[]𝐾subscript𝑡subscript𝔼subscript𝜋𝑖delimited-[]subscript𝑐𝑡subscript𝑥𝑡subscript𝑢𝑡\displaystyle=\sum_{i\in[K]}\sum_{t}\mathbb{E}_{\pi_{i}}[c_{t}(x_{t},u_{t})]= ∑ start_POSTSUBSCRIPT italic_i ∈ [ italic_K ] end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ](21)

which is just the expectation under a recurrent dynamical system with deterministic switches. Later (see 0.A.5), we exploit the non-deterministic switches of rSLDS in order to drive exploration. Eq.21 demonstrates the global problem can be partitioned solving problems within each region (inner expectation), and a global discrete problem which decides which sequence of regions to visit. In the next section, we introduce a new set of variables which allows us to approximately decouple the problems.

0.A.2 Hierarchical Decomposition

Our aim was to decouple the discrete planning problem from the fast low-level controller. In order to break down the control objective in this manner, we first create a new discrete variable which simply tracks the transitions of z𝑧zitalic_z, this allows the discrete planner to function in a temporally abstracted manner.

Decoupling from clock time Let the random variable (ζs)s>0subscriptsubscript𝜁𝑠𝑠0(\zeta_{s})_{s>0}( italic_ζ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_s > 0 end_POSTSUBSCRIPT record the transitions of (zt)t>0subscriptsubscript𝑧𝑡𝑡0(z_{t})_{t>0}( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t > 0 end_POSTSUBSCRIPT i.e. let

τs(τs1)=min{t:zt+1zt,t>τs1},τ0=0formulae-sequencesubscript𝜏𝑠subscript𝜏𝑠1:𝑡formulae-sequencesubscript𝑧𝑡1subscript𝑧𝑡𝑡subscript𝜏𝑠1subscript𝜏00\tau_{s}(\tau_{s-1})=\min\{t:z_{t+1}\neq z_{t},t>\tau_{s-1}\},\tau_{0}=0italic_τ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT ) = roman_min { italic_t : italic_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ≠ italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t > italic_τ start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT } , italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0(22)

be the sequence of first exit times, then ζ𝜁\zetaitalic_ζ is given by ζs=zτssubscript𝜁𝑠subscript𝑧subscript𝜏𝑠\zeta_{s}=z_{\tau_{s}}italic_ζ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_z start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT. With these variables in hand, we frame a small section of the global problem as a first exit problem.

Low level problems Consider the first exit problem for exiting region i𝑖iitalic_i and entering j𝑗jitalic_j defined by:

πij(x0)subscript𝜋𝑖𝑗subscript𝑥0\displaystyle\pi_{ij}(x_{0})italic_π start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )=argminπ,SJij(π,x0,S)absentsubscript𝜋𝑆subscript𝐽𝑖𝑗𝜋subscript𝑥0𝑆\displaystyle=\arg\min_{\pi,S}J_{ij}(\pi,x_{0},S)= roman_arg roman_min start_POSTSUBSCRIPT italic_π , italic_S end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_π , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_S )(23)
Jij(π,x0,S)subscript𝐽𝑖𝑗𝜋subscript𝑥0𝑆\displaystyle J_{ij}(\pi,x_{0},S)italic_J start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_π , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_S )=𝔼π,x0[t=0Sc(xt,ut)]absentsubscript𝔼𝜋subscript𝑥0delimited-[]superscriptsubscript𝑡0𝑆𝑐subscript𝑥𝑡subscript𝑢𝑡\displaystyle=\mathbb{E}_{\pi,x_{0}}[\sum_{t=0}^{S}c(x_{t},u_{t})]= blackboard_E start_POSTSUBSCRIPT italic_π , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT italic_c ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ](24)
s.t.(xt,ut)His.t.subscript𝑥𝑡subscript𝑢𝑡subscript𝐻𝑖\displaystyle\text{s.t. }(x_{t},u_{t})\in H_{i}s.t. ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∈ italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT(25)
s.t.c(x,u)=0when(x,u)Hijs.t.𝑐𝑥𝑢0when𝑥𝑢subscript𝐻𝑖𝑗\displaystyle\text{s.t. }c(x,u)=0\text{ when }(x,u)\in\partial H_{ij}s.t. italic_c ( italic_x , italic_u ) = 0 when ( italic_x , italic_u ) ∈ ∂ italic_H start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT(26)

where Hijsubscript𝐻𝑖𝑗\partial H_{ij}∂ italic_H start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the boundary HiHjsubscript𝐻𝑖subscript𝐻𝑗H_{i}\bigcap H_{j}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋂ italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Due, to convexity of the polyhedral partition, the full objective admits the decomposition in terms of these subproblems,

J(π)𝐽𝜋\displaystyle J(\pi)italic_J ( italic_π )=sJζ(s+1),ζ(s)(π,xts,ts+1ts)absentsubscript𝑠subscript𝐽𝜁𝑠1𝜁𝑠𝜋subscript𝑥subscript𝑡𝑠subscript𝑡𝑠1subscript𝑡𝑠\displaystyle=\sum_{s}J_{\zeta(s+1),\zeta(s)}(\pi,x_{t_{s}},t_{s+1}-t_{s})= ∑ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_ζ ( italic_s + 1 ) , italic_ζ ( italic_s ) end_POSTSUBSCRIPT ( italic_π , italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_s + 1 end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT )(27)

Ideally, we would like to simply solve all possible subproblems {Jij(x):i,j[K]×[K]}conditional-setsuperscriptsubscript𝐽𝑖𝑗𝑥𝑖𝑗delimited-[]𝐾delimited-[]𝐾\{J_{ij}^{*}(x):i,j\in[K]\times[K]\}{ italic_J start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) : italic_i , italic_j ∈ [ italic_K ] × [ italic_K ] } and then find a sequence of discrete states, ζ(1),,ζ(S)𝜁1𝜁𝑆\zeta(1),...,\zeta(S)italic_ζ ( 1 ) , … , italic_ζ ( italic_S ), which minimises the sum of the sub-costs, however notice each sub-cost depends on the starting state, and further this is determined by the final state of the previous problem. A pure separation into discrete and continuous problems is not possible without a simplifying assumption.

Slow and fast mode assumption The goal is to tackle the decomposed objectives individually, however the hidden constraint that the trajectories line up presents a computational challenge. Here we make the assumption that the difference in cost induced by different starting positions within a region is much less than expected difference in cost of starting in a different region. This assumption justifies using an average cost for the low-level problems to create the high-level problem.

High level problem we let Jij=minπx0Jij(π,x0)p(x0)superscriptsubscript𝐽𝑖𝑗subscript𝜋subscriptsubscript𝑥0subscript𝐽𝑖𝑗𝜋subscript𝑥0𝑝subscript𝑥0J_{ij}^{*}=\min_{\pi}\int_{x_{0}}J_{ij}(\pi,x_{0})p(x_{0})italic_J start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_min start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_π , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_p ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) be the average cost of each low-level problem. We form a Markov decision process by introducing abstract actions a[K]𝑎delimited-[]𝐾a\in[K]italic_a ∈ [ italic_K ]:

pik(a)=(ζs+1=kζs=i,a=j,πij)p_{ik}(a)=\mathbb{P}(\zeta_{s+1}=k\mid\zeta_{s}=i,a=j,\pi_{ij}^{*})italic_p start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ( italic_a ) = blackboard_P ( italic_ζ start_POSTSUBSCRIPT italic_s + 1 end_POSTSUBSCRIPT = italic_k ∣ italic_ζ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_i , italic_a = italic_j , italic_π start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )(28)

and let pπdsubscript𝑝subscript𝜋𝑑p_{\pi_{d}}italic_p start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT be the associated distribution over trajectories induced by some discrete state feedback policy, along with the discrete state action cost cd(a=j,ζ=i)=Jijsubscript𝑐𝑑formulae-sequence𝑎𝑗𝜁𝑖superscriptsubscript𝐽𝑖𝑗c_{d}(a=j,\zeta=i)=J_{ij}^{*}italic_c start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_a = italic_j , italic_ζ = italic_i ) = italic_J start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT we may write the high level problem:

πdsuperscriptsubscript𝜋𝑑\displaystyle\pi_{d}^{*}italic_π start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT=minπdJd(π,ζ0)absentsubscriptsubscript𝜋𝑑subscript𝐽𝑑𝜋subscript𝜁0\displaystyle=\min_{\pi_{d}}J_{d}(\pi,\zeta_{0})= roman_min start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_π , italic_ζ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )(29)
Jd(π,ζ0)subscript𝐽𝑑𝜋subscript𝜁0\displaystyle J_{d}(\pi,\zeta_{0})italic_J start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_π , italic_ζ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )=𝔼pπd,ζ0[s=0Scd(as,ζs)]absent𝔼subscript𝑝subscript𝜋𝑑subscript𝜁0delimited-[]superscriptsubscript𝑠0𝑆subscript𝑐𝑑subscript𝑎𝑠subscript𝜁𝑠\displaystyle=\mathbb{E}p_{\pi_{d},\zeta_{0}}[\sum_{s=0}^{S}c_{d}(a_{s},\zeta_%{s})]= blackboard_E italic_p start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_ζ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_ζ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ](30)

Our overall approximate control law is then given by choosing the action of the continuous controller πij(x)subscript𝜋𝑖𝑗𝑥\pi_{ij}(x)italic_π start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_x ) suggested by the discrete policy πd(i(x))subscript𝜋𝑑𝑖𝑥\pi_{d}(i(x))italic_π start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_i ( italic_x ) ), or more concisely, π(x)=πi(x),πdi(x)(x)𝜋𝑥subscript𝜋𝑖𝑥superscriptsubscript𝜋𝑑𝑖𝑥𝑥\pi(x)=\pi_{i(x),\pi_{d}^{*}\circ i(x)}(x)italic_π ( italic_x ) = italic_π start_POSTSUBSCRIPT italic_i ( italic_x ) , italic_π start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∘ italic_i ( italic_x ) end_POSTSUBSCRIPT ( italic_x ), where i𝑖iitalic_i is calculates the discrete label (MAP estimate) for the continuous state x𝑥xitalic_x. In the next sections we describe the methods used to solve the high and low level problems.

0.A.3 Offline Low Level Problems: Linear Quadratic Regulator (LQR)

Rather than solve the first-exit problem directly, we formulate an approximate problem by finding trajectories that end at specific ‘control priors’ (see 0.A.6).Recall the low level problem given by:

πij(x0)subscript𝜋𝑖𝑗subscript𝑥0\displaystyle\pi_{ij}(x_{0})italic_π start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )=argminπ,SJij(π,x0,S)absentsubscript𝜋𝑆subscript𝐽𝑖𝑗𝜋subscript𝑥0𝑆\displaystyle=\arg\min_{\pi,S}J_{ij}(\pi,x_{0},S)= roman_arg roman_min start_POSTSUBSCRIPT italic_π , italic_S end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_π , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_S )(31)
Jij(π,x0,S)subscript𝐽𝑖𝑗𝜋subscript𝑥0𝑆\displaystyle J_{ij}(\pi,x_{0},S)italic_J start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_π , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_S )=𝔼π,x0[t=0Sc(xt,ut)]absentsubscript𝔼𝜋subscript𝑥0delimited-[]superscriptsubscript𝑡0𝑆𝑐subscript𝑥𝑡subscript𝑢𝑡\displaystyle=\mathbb{E}_{\pi,x_{0}}[\sum_{t=0}^{S}c(x_{t},u_{t})]= blackboard_E start_POSTSUBSCRIPT italic_π , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT italic_c ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ](32)
s.t.(xt,ut)His.t.subscript𝑥𝑡subscript𝑢𝑡subscript𝐻𝑖\displaystyle\text{s.t. }(x_{t},u_{t})\in H_{i}s.t. ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∈ italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT(33)
s.t.c(x,u)=0when(x,u)Hijs.t.𝑐𝑥𝑢0when𝑥𝑢subscript𝐻𝑖𝑗\displaystyle\text{s.t. }c(x,u)=0\text{ when }(x,u)\in\partial H_{ij}s.t. italic_c ( italic_x , italic_u ) = 0 when ( italic_x , italic_u ) ∈ ∂ italic_H start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT(34)

In order to approximate this problem with one solvable by a finite horizon LQR controller, we adopt a fixed goal state, xHjsuperscript𝑥subscript𝐻𝑗x^{*}\in H_{j}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Imposing costs ct(xt,ut)=utTRutsubscript𝑐𝑡subscript𝑥𝑡subscript𝑢𝑡superscriptsubscript𝑢𝑡𝑇𝑅subscript𝑢𝑡c_{t}(x_{t},u_{t})=u_{t}^{T}Ru_{t}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_R italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and cS(xS,uS)=(xx)Qf(xx)subscript𝑐𝑆subscript𝑥𝑆subscript𝑢𝑆𝑥superscript𝑥subscript𝑄𝑓𝑥superscript𝑥c_{S}(x_{S},u_{S})=(x-x^{*})Q_{f}(x-x^{*})italic_c start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) = ( italic_x - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_Q start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_x - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). Formally we solve,

πij(x0)subscript𝜋𝑖𝑗subscript𝑥0\displaystyle\pi_{ij}(x_{0})italic_π start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )=argminπ,SJij(π,x0,S)absentsubscript𝜋𝑆subscript𝐽𝑖𝑗𝜋subscript𝑥0𝑆\displaystyle=\arg\min_{\pi,S}J_{ij}(\pi,x_{0},S)= roman_arg roman_min start_POSTSUBSCRIPT italic_π , italic_S end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_π , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_S )(35)
Jij(π,x0,S)subscript𝐽𝑖𝑗𝜋subscript𝑥0𝑆\displaystyle J_{ij}(\pi,x_{0},S)italic_J start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_π , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_S )=𝔼π,x0[(xSx)TQf(xSx)+t=0S1utTRut]absentsubscript𝔼𝜋subscript𝑥0delimited-[]superscriptsubscript𝑥𝑆superscript𝑥𝑇subscript𝑄𝑓subscript𝑥𝑆superscript𝑥superscriptsubscript𝑡0𝑆1superscriptsubscript𝑢𝑡𝑇𝑅subscript𝑢𝑡\displaystyle=\mathbb{E}_{\pi,x_{0}}[(x_{S}-x^{*})^{T}Q_{f}(x_{S}-x^{*})+\sum_%{t=0}^{S-1}u_{t}^{T}Ru_{t}]= blackboard_E start_POSTSUBSCRIPT italic_π , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_x start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S - 1 end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_R italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ](36)

by integrating the discrete Ricatti equation backwards. Numerically, we found optimising over different time horizons made little difference to the solution, so we opted to instead specify a fixed horizon (hyperparameter). These solutions are recomputed offline every time the linear system matrices change.

Designing the cost matricesInstead of imposing the state constraints explicitly, we record a high cost which informs the discrete controller to avoid them. In order to approximate the constrained input we choose a suitably large control cost R=rI𝑅𝑟𝐼R=rIitalic_R = italic_r italic_I. We adopted this approach for the sake of simplicity, potentially accepting a good deal of sub-optimality. However, we believe more involved methods for solving input constrained LQR could be used in future, e.g. [3], especially because we compute these solutions offline.

0.A.4 Active Inference Interpretation

0.A.4.1 Expected Free Energy

Here we express the fully-observed continuous (discrete time) active inference controller, without mean-field assumptions, and show it reduces to a continuous quadratic regulator.Suppose we have a linear state space model:

xt+1=Axt+But+ϵtsubscript𝑥𝑡1𝐴subscript𝑥𝑡𝐵subscript𝑢𝑡subscriptitalic-ϵ𝑡x_{t+1}=Ax_{t}+Bu_{t}+\epsilon_{t}italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_A italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_B italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT(38)

and a prior preference over trajectories p~(x1:T)N(xT;xf,Qf1)similar-to~𝑝subscript𝑥:1𝑇𝑁subscript𝑥𝑇subscript𝑥𝑓superscriptsubscript𝑄𝑓1\tilde{p}(x_{1:T})\sim N(x_{T};x_{f},Q_{f}^{-1})over~ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) ∼ italic_N ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ; italic_x start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ), active inference specifies the agent minimises

G(π)=𝔼q(x1:T,u1:T;π)[lnp~(x1:T,u1:T)+lnq(x1:T,u1:T;π)]𝐺𝜋subscript𝔼𝑞subscript𝑥:1𝑇subscript𝑢:1𝑇𝜋delimited-[]~𝑝subscript𝑥:1𝑇subscript𝑢:1𝑇𝑞subscript𝑥:1𝑇subscript𝑢:1𝑇𝜋G(\pi)=\mathbb{E}_{q(x_{1:T},u_{1:T};\pi)}[-\ln\tilde{p}(x_{1:T},u_{1:T})+\ln q%(x_{1:T},u_{1:T};\pi)]italic_G ( italic_π ) = blackboard_E start_POSTSUBSCRIPT italic_q ( italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ; italic_π ) end_POSTSUBSCRIPT [ - roman_ln over~ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) + roman_ln italic_q ( italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ; italic_π ) ](39)

Note, since all states are fully observed we have no ambiguity term.Where p~(x1:T,u1:T)p~(x1:T)p(x1:Tu1:T)p(u1:T)proportional-to~𝑝subscript𝑥:1𝑇subscript𝑢:1𝑇~𝑝subscript𝑥:1𝑇𝑝conditionalsubscript𝑥:1𝑇subscript𝑢:1𝑇𝑝subscript𝑢:1𝑇\tilde{p}(x_{1:T},u_{1:T})\propto\tilde{p}(x_{1:T})p(x_{1:T}\mid u_{1:T})p(u_{%1:T})over~ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) ∝ over~ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) italic_p ( italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ∣ italic_u start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) italic_p ( italic_u start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ), the central term is the dynamics model and the prior over controls is also gaussian, p(u1:T)=tN(ut;0,R1)𝑝subscript𝑢:1𝑇subscriptproduct𝑡𝑁subscript𝑢𝑡0superscript𝑅1p(u_{1:T})=\prod_{t}N(u_{t};0,R^{-1})italic_p ( italic_u start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_N ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; 0 , italic_R start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ).Finally, we adopt q(x1:T,u1:T;π)=p(x1:Tu1:T)tπt(utxt)𝑞subscript𝑥:1𝑇subscript𝑢:1𝑇𝜋𝑝conditionalsubscript𝑥:1𝑇subscript𝑢:1𝑇subscriptproduct𝑡subscript𝜋𝑡conditionalsubscript𝑢𝑡subscript𝑥𝑡q(x_{1:T},u_{1:T};\pi)=p(x_{1:T}\mid u_{1:T})\prod_{t}\pi_{t}(u_{t}\mid x_{t})italic_q ( italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ; italic_π ) = italic_p ( italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ∣ italic_u start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), where we parametrise the variational distributions as πtN(ut;Ktx,Σtq)similar-tosubscript𝜋𝑡𝑁subscript𝑢𝑡subscript𝐾𝑡𝑥superscriptsubscriptΣ𝑡𝑞\pi_{t}\sim N(u_{t};K_{t}x,\Sigma_{t}^{q})italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_N ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_x , roman_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ) (where Kt,Σt)subscript𝐾𝑡superscriptsubscriptΣ𝑡)K_{t},\Sigma_{t}^{)}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT are parameters to be optimised).The expected free energy thus simplifies to:

G(π)=𝔼q(x1:T,u1:T;π)[(xTxF)TQf(xTxF)tutT(R+Πtu)u]+lndetΠ𝐺𝜋subscript𝔼𝑞subscript𝑥:1𝑇subscript𝑢:1𝑇𝜋delimited-[]superscriptsubscript𝑥𝑇subscript𝑥𝐹𝑇subscript𝑄𝑓subscript𝑥𝑇subscript𝑥𝐹subscript𝑡superscriptsubscript𝑢𝑡𝑇𝑅superscriptsubscriptΠ𝑡𝑢𝑢ΠG(\pi)=\mathbb{E}_{q(x_{1:T},u_{1:T};\pi)}[(x_{T}-x_{F})^{T}Q_{f}(x_{T}-x_{F})%-\sum_{t}u_{t}^{T}(R+\Pi_{t}^{u})u]+\ln\det\Piitalic_G ( italic_π ) = blackboard_E start_POSTSUBSCRIPT italic_q ( italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ; italic_π ) end_POSTSUBSCRIPT [ ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_R + roman_Π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ) italic_u ] + roman_ln roman_det roman_Π(40)

0.A.4.2 Dynamic Programming (HJB)

We proceed by dynamic programming, let the ‘value’ function be

V(xk)=minπ𝔼q(xk+1:T,uk:Txk,π)[(xTxf)TQf(xTxf)+t=kTutT(R+Πtu)ut]+lndetΠ𝑉subscript𝑥𝑘subscript𝜋subscript𝔼𝑞subscript𝑥:𝑘1𝑇conditionalsubscript𝑢:𝑘𝑇subscript𝑥𝑘𝜋delimited-[]superscriptsubscript𝑥𝑇subscript𝑥𝑓𝑇subscript𝑄𝑓subscript𝑥𝑇subscript𝑥𝑓superscriptsubscript𝑡𝑘𝑇superscriptsubscript𝑢𝑡𝑇𝑅superscriptsubscriptΠ𝑡𝑢subscript𝑢𝑡ΠV(x_{k})=\min_{\pi}\mathbb{E}_{q(x_{k+1:T},u_{k:T}\mid x_{k},\pi)}[(x_{T}-x_{f%})^{T}Q_{f}(x_{T}-x_{f})+\sum_{t=k}^{T}u_{t}^{T}(R+\Pi_{t}^{u})u_{t}]+\ln\det\Piitalic_V ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = roman_min start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_q ( italic_x start_POSTSUBSCRIPT italic_k + 1 : italic_T end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k : italic_T end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_π ) end_POSTSUBSCRIPT [ ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_R + roman_Π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ) italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] + roman_ln roman_det roman_Π(41)

As usual the value function satisfies a recursive property:

V(xk)=minπ𝔼q(xk+1,ukxk,π)[ukT(R+Πku)uk+V(xk+1)]+lndetΠ𝑉subscript𝑥𝑘subscript𝜋subscript𝔼𝑞subscript𝑥𝑘1conditionalsubscript𝑢𝑘subscript𝑥𝑘𝜋delimited-[]superscriptsubscript𝑢𝑘𝑇𝑅superscriptsubscriptΠ𝑘𝑢subscript𝑢𝑘𝑉subscript𝑥𝑘1ΠV(x_{k})=\min_{\pi}\mathbb{E}_{q(x_{k+1},u_{k}\mid x_{k},\pi)}[u_{k}^{T}(R+\Pi%_{k}^{u})u_{k}+V(x_{k+1})]+\ln\det\Piitalic_V ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = roman_min start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_q ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_π ) end_POSTSUBSCRIPT [ italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_R + roman_Π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ) italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_V ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) ] + roman_ln roman_det roman_Π(42)

We introduce the ansatz V(xk)=xkTSkxk𝑉subscript𝑥𝑘superscriptsubscript𝑥𝑘𝑇subscript𝑆𝑘subscript𝑥𝑘V(x_{k})=x_{k}^{T}S_{k}x_{k}italic_V ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT leading to,

xkTSkxk=minπ𝔼q(xk+1,ukxk,π)[ukT(R+Πku)uk+xk+1TSk+1xk+1]+lndetΠsuperscriptsubscript𝑥𝑘𝑇subscript𝑆𝑘subscript𝑥𝑘subscript𝜋subscript𝔼𝑞subscript𝑥𝑘1conditionalsubscript𝑢𝑘subscript𝑥𝑘𝜋delimited-[]superscriptsubscript𝑢𝑘𝑇𝑅superscriptsubscriptΠ𝑘𝑢subscript𝑢𝑘superscriptsubscript𝑥𝑘1𝑇subscript𝑆𝑘1subscript𝑥𝑘1Πx_{k}^{T}S_{k}x_{k}=\min_{\pi}\mathbb{E}_{q(x_{k+1},u_{k}\mid x_{k},\pi)}[u_{k%}^{T}(R+\Pi_{k}^{u})u_{k}+x_{k+1}^{T}S_{k+1}x_{k+1}]+\ln\det\Piitalic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = roman_min start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_q ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_π ) end_POSTSUBSCRIPT [ italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_R + roman_Π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ) italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ] + roman_ln roman_det roman_Π(43)

Finally we take expectations, which are available in closed form, and solve for ΣksubscriptΣ𝑘\Sigma_{k}roman_Σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and Kksubscript𝐾𝑘K_{k}italic_K start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT:

xkTSkxk=minKk,Πksuperscriptsubscript𝑥𝑘𝑇subscript𝑆𝑘subscript𝑥𝑘subscriptsubscript𝐾𝑘subscriptΠ𝑘\displaystyle x_{k}^{T}S_{k}x_{k}=\min_{K_{k},\Pi_{k}}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = roman_min start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , roman_Π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPTxkTKkTRKkxk+tr(Σku(R+Πtu))superscriptsubscript𝑥𝑘𝑇superscriptsubscript𝐾𝑘𝑇𝑅subscript𝐾𝑘subscript𝑥𝑘𝑡𝑟superscriptsubscriptΣ𝑘𝑢𝑅superscriptsubscriptΠ𝑡𝑢\displaystyle x_{k}^{T}K_{k}^{T}RK_{k}x_{k}+tr(\Sigma_{k}^{u}(R+\Pi_{t}^{u}))italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_R italic_K start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_t italic_r ( roman_Σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ( italic_R + roman_Π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ) )(44)
+xkT(A+BKk)TSk+1(A+BKk)xk+tr(ΣkxSk+1)+lndetΠsuperscriptsubscript𝑥𝑘𝑇superscript𝐴𝐵subscript𝐾𝑘𝑇subscript𝑆𝑘1𝐴𝐵subscript𝐾𝑘subscript𝑥𝑘𝑡𝑟superscriptsubscriptΣ𝑘𝑥subscript𝑆𝑘1Π\displaystyle+x_{k}^{T}(A+BK_{k})^{T}S_{k+1}(A+BK_{k})x_{k}+tr(\Sigma_{k}^{x}S%_{k+1})+\ln\det\Pi+ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_A + italic_B italic_K start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ( italic_A + italic_B italic_K start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_t italic_r ( roman_Σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) + roman_ln roman_det roman_Π(45)

Solving for ΣksubscriptΣ𝑘\Sigma_{k}roman_Σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and substituting,

Σkq=(R+Πku)1superscriptsubscriptΣ𝑘𝑞superscript𝑅superscriptsubscriptΠ𝑘𝑢1\Sigma_{k}^{q}=(R+\Pi_{k}^{u})^{-1}roman_Σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT = ( italic_R + roman_Π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT(46)
Sk=minKkKkTRKk+(A+BKk)TSk+1(A+BKk)absentsubscript𝑆𝑘subscriptsubscript𝐾𝑘superscriptsubscript𝐾𝑘𝑇𝑅subscript𝐾𝑘superscript𝐴𝐵subscript𝐾𝑘𝑇subscript𝑆𝑘1𝐴𝐵subscript𝐾𝑘\implies S_{k}=\min_{K_{k}}K_{k}^{T}RK_{k}+(A+BK_{k})^{T}S_{k+1}(A+BK_{k})⟹ italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = roman_min start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_R italic_K start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + ( italic_A + italic_B italic_K start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ( italic_A + italic_B italic_K start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )(47)
Kk=(R+BTSK+1B)1BTSK+1Asubscript𝐾𝑘superscript𝑅superscript𝐵𝑇subscript𝑆𝐾1𝐵1superscript𝐵𝑇subscript𝑆𝐾1𝐴K_{k}=-(R+B^{T}S_{K+1}B)^{-1}B^{T}S_{K+1}Aitalic_K start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = - ( italic_R + italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_K + 1 end_POSTSUBSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_K + 1 end_POSTSUBSCRIPT italic_A(48)

Where Sksubscript𝑆𝑘S_{k}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT follows the discrete algebraic Riccati equation (DARE).

Thus we recover πt(ux)N(Ktx,Σk)similar-tosubscript𝜋𝑡conditional𝑢𝑥𝑁subscript𝐾𝑡𝑥subscriptΣ𝑘\pi_{t}(u\mid x)\sim N(K_{t}x,\Sigma_{k})italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u ∣ italic_x ) ∼ italic_N ( italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_x , roman_Σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) where Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the traditional LQR gain, and ΣtsubscriptΣ𝑡\Sigma_{t}roman_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT solves Σk=(R+Πk)1subscriptΣ𝑘superscript𝑅subscriptΠ𝑘1\Sigma_{k}=(R+\Pi_{k})^{-1}roman_Σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ( italic_R + roman_Π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. Here we use the deterministic maximum-a-posterori ‘MAP’ controller Ktxsubscript𝐾𝑡𝑥K_{t}xitalic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_x. However the collection of posterior variance estimates adds a different total cost depending on the variance inherent in the dynamics which can be lifted to the discrete controller.

0.A.4.3 As Belief Propagation

A different perspective is as message passing:we wish to calculate the marginals p(xk)𝑝subscript𝑥𝑘p(x_{k})italic_p ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) and p(xk,uk)𝑝subscript𝑥𝑘subscript𝑢𝑘p(x_{k},u_{k})italic_p ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) tilted by the preference distribution p~(xk)~𝑝subscript𝑥𝑘\tilde{p}(x_{k})over~ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) and control prior p(u)𝑝𝑢p(u)italic_p ( italic_u ) for this we can integrate backwards using the recursive formula

b(xk)𝑏subscript𝑥𝑘\displaystyle b(x_{k})italic_b ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )=b(xk,uk)𝑑xkabsent𝑏subscript𝑥𝑘subscript𝑢𝑘differential-dsubscript𝑥𝑘\displaystyle=\int b(x_{k},u_{k})dx_{k}= ∫ italic_b ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_d italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT(49)
b(xk,uk)𝑏subscript𝑥𝑘subscript𝑢𝑘\displaystyle b(x_{k},u_{k})italic_b ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )=p~(xk)p(xk+1xk,uk)p(uk)b(xk+1)𝑑xk+1absent~𝑝subscript𝑥𝑘𝑝conditionalsubscript𝑥𝑘1subscript𝑥𝑘subscript𝑢𝑘𝑝subscript𝑢𝑘𝑏subscript𝑥𝑘1differential-dsubscript𝑥𝑘1\displaystyle=\int\tilde{p}(x_{k})p(x_{k+1}\mid x_{k},u_{k})p(u_{k})b(x_{k+1})%dx_{k+1}= ∫ over~ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_p ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_p ( italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_b ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) italic_d italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT(50)

from which we can extract the control law p(ukxk)=b(xk,uk)/b(xk)𝑝conditionalsubscript𝑢𝑘subscript𝑥𝑘𝑏subscript𝑥𝑘subscript𝑢𝑘𝑏subscript𝑥𝑘p(u_{k}\mid x_{k})=b(x_{k},u_{k})/b(x_{k})italic_p ( italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = italic_b ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) / italic_b ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). To proceed we use the variational method to marginalise:

lnb(xk)=minq𝔼q[lnp~(xk,uk)p(xk+1xk,uk)b(xk+1)+lnq(xk+1,ukxk)]𝑏subscript𝑥𝑘subscript𝑞subscript𝔼𝑞delimited-[]~𝑝subscript𝑥𝑘subscript𝑢𝑘𝑝conditionalsubscript𝑥𝑘1subscript𝑥𝑘subscript𝑢𝑘𝑏subscript𝑥𝑘1𝑞subscript𝑥𝑘1conditionalsubscript𝑢𝑘subscript𝑥𝑘-\ln b(x_{k})=\min_{q}\mathbb{E}_{q}[-\ln\tilde{p}(x_{k},u_{k})p(x_{k+1}\mid x%_{k},u_{k})b(x_{k+1})+\ln q(x_{k+1},u_{k}\mid x_{k})]- roman_ln italic_b ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = roman_min start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ - roman_ln over~ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_p ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_b ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) + roman_ln italic_q ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ](51)

making the same assumption as above about variational distributions, and introducing the ansatz b(xk)N(xk;0,Sk)similar-to𝑏subscript𝑥𝑘𝑁subscript𝑥𝑘0subscript𝑆𝑘b(x_{k})\sim N(x_{k};0,S_{k})italic_b ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∼ italic_N ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; 0 , italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) leads to the same equation as 43 up to irrelevant constants.

0.A.5 Online high level problem

The high level problem is a discrete MDP with a ‘known’ model, so the usual RL techniques (approximate dynamic programming, policy iteration) apply. Here, however we choose to use a model-based algorithm with a receding horizon inspired by Active Inference, allowing us to easily incorporate exploration bonuses.

Let the Bayesian MDP be given by B=(S,A,Pa,R,Pθ)subscript𝐵𝑆𝐴subscript𝑃𝑎𝑅subscript𝑃𝜃\mathcal{M}_{B}=(S,A,P_{a},R,P_{\theta})caligraphic_M start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = ( italic_S , italic_A , italic_P start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_R , italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) be the MDP, where pa(st+1st,at,θ)Cat(θas)similar-tosubscript𝑝𝑎conditionalsubscript𝑠𝑡1subscript𝑠𝑡subscript𝑎𝑡𝜃𝐶𝑎𝑡subscript𝜃𝑎𝑠p_{a}(s_{t+1}\mid s_{t},a_{t},\theta)\sim Cat(\theta_{as})italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_θ ) ∼ italic_C italic_a italic_t ( italic_θ start_POSTSUBSCRIPT italic_a italic_s end_POSTSUBSCRIPT ) and p(θas)Dir(α)similar-to𝑝subscript𝜃𝑎𝑠𝐷𝑖𝑟𝛼p(\theta_{as})\sim Dir(\alpha)italic_p ( italic_θ start_POSTSUBSCRIPT italic_a italic_s end_POSTSUBSCRIPT ) ∼ italic_D italic_i italic_r ( italic_α ). We estimate the open-loop reward plus optimistic information-theoretic exploration bonuses.

Active Inference conversionWe adopt the Active Inference framework for dealing with exploration. Accordingly we adopt the notation lnp~(st,at)=R(st,at)~𝑝subscript𝑠𝑡subscript𝑎𝑡𝑅subscript𝑠𝑡subscript𝑎𝑡\ln\tilde{p}(s_{t},a_{t})=R(s_{t},a_{t})roman_ln over~ start_ARG italic_p end_ARG ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and refer to this‘distribution’ as the goal prior [23], and optimise over open loop policies π=(a0,,aT)𝜋subscript𝑎0subscript𝑎𝑇\pi=(a_{0},...,a_{T})italic_π = ( italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ).

G(a1:T,s0)=𝔼[t=0TR(st,at)+IGp+IGss0,a1:T]𝐺subscript𝑎:1𝑇subscript𝑠0𝔼delimited-[]superscriptsubscript𝑡0𝑇𝑅subscript𝑠𝑡subscript𝑎𝑡𝐼subscript𝐺𝑝conditional𝐼subscript𝐺𝑠subscript𝑠0subscript𝑎:1𝑇G(a_{1:T},s_{0})=\mathbb{E}[\sum_{t=0}^{T}R(s_{t},a_{t})+IG_{p}+IG_{s}\mid s_{%0},{a_{1:T}}]italic_G ( italic_a start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_I italic_G start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + italic_I italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ](52)

where parameter information-gain is given by IGp=DKL[pt+1(θ)pt(θ)]IG_{p}=D_{KL}[p_{t+1}(\theta)\mid\mid p_{t}(\theta)]italic_I italic_G start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_p start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_θ ) ∣ ∣ italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) ], with pt(θ)=p(θs0:t)subscript𝑝𝑡𝜃𝑝conditional𝜃subscript𝑠:0𝑡p_{t}(\theta)=p(\theta\mid s_{0:t})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) = italic_p ( italic_θ ∣ italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ). In other words, we add a bonus when we expect the posterior to diverge from the prior, which is exactly the transitions we have observed least [19].

We also have a state information-gain term, IGs=DKL[pt+1(st+1)pt(st+1)]IG_{s}=D_{KL}[p_{t+1}(s_{t+1})\mid\mid p_{t}(s_{t+1})]italic_I italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_p start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∣ ∣ italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ]. In this case (fully observed), pt+1(st+1)=δssubscript𝑝𝑡1subscript𝑠𝑡1subscript𝛿𝑠p_{t+1}(s_{t+1})=\delta_{s}italic_p start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) = italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is a one-hot vector. Leaving the term 𝔼t[lnpt(st+1)]subscript𝔼𝑡delimited-[]subscript𝑝𝑡subscript𝑠𝑡1\mathbb{E}_{t}[-\ln p_{t}(s_{t+1})]blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ - roman_ln italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] leading to a maximum entropy term [19].

We calculate the above with Monte Carlo sampling which is possible due to the relatively small number of modes. Local approximations such as Monte Carlo Tree Search could easily be integrated in order to scale up to more realistic problems. Alternatively, for relatively stationary environments we could instead adopt approximate dynamic programming methods for more habitual actions.

0.A.6 Generating continuous control priors

In order to generate control priors for the LQR controller which correspond to each of the discrete states we must find a continuous state xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT which maximises the probability of being in a desired z𝑧zitalic_z:

xi=argmax𝑥P(z=i|x,u)subscript𝑥𝑖𝑥𝑃𝑧conditional𝑖𝑥𝑢\displaystyle x_{i}=\underset{x}{\arg\max}\,P(z=i|x,u)italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = underitalic_x start_ARG roman_arg roman_max end_ARG italic_P ( italic_z = italic_i | italic_x , italic_u )(53)

For this we perform a numerical optimisation in order to maximise this probability. Consider that this probability distribution P(z=i|x)𝑃𝑧conditional𝑖𝑥P(z=i|x)italic_P ( italic_z = italic_i | italic_x ) is a softmax function for the i-th class is defined as:

σ(vi)=exp(vi)jexp(vj),vi=wix+riformulae-sequence𝜎subscript𝑣𝑖subscript𝑣𝑖subscript𝑗subscript𝑣𝑗subscript𝑣𝑖subscript𝑤𝑖𝑥subscript𝑟𝑖\displaystyle\sigma(v_{i})=\frac{\exp(v_{i})}{\sum_{j}\exp(v_{j})},v_{i}=w_{i}%\cdot x+r_{i}italic_σ ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG roman_exp ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_exp ( italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_x + italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT(54)

where wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i-th row of the weight matrix, x𝑥xitalic_x is the input and risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i-th bias term. The update function used in the gradient descent optimisation can be described as follows:

xx+ηxσ(vi)𝑥𝑥𝜂subscript𝑥𝜎subscript𝑣𝑖\displaystyle x\leftarrow x+\eta\nabla_{x}\sigma(v_{i})italic_x ← italic_x + italic_η ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_σ ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )(55)

where η𝜂\etaitalic_η is the learning rate and the gradient of the softmax function with respect to the input vector x𝑥xitalic_x is given by:

xσ(vi)=σ(vi)vvx=σ(vi)(𝕖iσ(v))Wsubscript𝑥𝜎subscript𝑣𝑖𝜎subscript𝑣𝑖𝑣𝑣𝑥𝜎subscript𝑣𝑖subscript𝕖𝑖𝜎𝑣𝑊\displaystyle\nabla_{x}\sigma(v_{i})=\frac{\partial\sigma(v_{i})}{\partial v}%\cdot\frac{\partial v}{\partial x}=\sigma(v_{i})(\mathbb{e}_{i}-\sigma(v))\cdotW∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_σ ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG ∂ italic_σ ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_v end_ARG ⋅ divide start_ARG ∂ italic_v end_ARG start_ARG ∂ italic_x end_ARG = italic_σ ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( blackboard_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_σ ( italic_v ) ) ⋅ italic_W(56)

in which σ(v)𝜎𝑣\sigma(v)italic_σ ( italic_v ) is the vector of softmax probabilities, and 𝕖isubscript𝕖𝑖\mathbb{e}_{i}blackboard_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the standard basis vector with 1 in the i-th position and 0 elsewhere. The gradient descent process continues until the probability P(z=i|x)𝑃𝑧conditional𝑖𝑥P(z=i|x)italic_P ( italic_z = italic_i | italic_x ) exceeds a specified threshold θ𝜃\thetaitalic_θ which we set to be 0.7. This threshold enforces a stopping criterion which is required for the cases in which the region z𝑧zitalic_z is unbounded.

0.A.7 Model-free RL baselines


ComponentInput
Q-network3×256×256×256×2
Policy network2×256×256×256×2
Entropy regularization coeff0.2
Learning rates (Qnet + Polnet)3e-4
Batchsize60

ComponentInput
Feature ProcessingStandardScaler, RBF Kernels (4 ×\times× 100)
Value-network4001 parameters (1 dense layer)
Policy network802 parameters (2 dense layers)
Gamma0.95
Lambda1e-5
Learning rates (Policy + Value)0.01

0.A.8 Model-based RL baseline


ComponentInput
Q-network1 hidden-layer, 48 units, ReLU
Dynamics Predictor Network (Fully Connected)2 hidden-layers (each 24 Units), ReLU
ϵitalic-ϵ\epsilonitalic_ϵ minimum0.01
ϵitalic-ϵ\epsilonitalic_ϵ decay0.9995
Reward discount0.99
Learning rates (Qnet / Dynamics-net)0.05 / 0.02
Target Q-network update interval8
Initial exploration only steps10000
Minibatch size (Q-network)16
Minibatch size (dynamics predictor network)64
Number of recent states to fit probability model50

References

  • [1]Abdulsamad, H., Peters, J.: Hierarchical decomposition of nonlinear dynamics and control for system identification and policy distillation. In: Bayen, A.M., Jadbabaie, A., Pappas, G., Parrilo, P.A., Recht, B., Tomlin, C., Zeilinger, M. (eds.) Proceedings of the 2nd Conference on Learning for Dynamics and Control. Proceedings of Machine Learning Research, vol.120, pp. 904–914. PMLR (10–11 Jun 2020)
  • [2]Abdulsamad, H., Peters, J.: Model-based reinforcement learning via stochastic hybrid models. IEEE Open Journal of Control Systems 2, 155–170 (2023)
  • [3]Bemporad, A., Borrelli, F., Morari, M.: Piecewise linear optimal controllers for hybrid systems. In: Proceedings of the 2000 American Control Conference. ACC (IEEE Cat. No.00CH36334). vol.2, pp. 1190–1194 vol.2 (2000)
  • [4]Bemporad, A., Morari, M., Dua, V., Pistikopoulos, E.N.: The explicit linear quadratic regulator for constrained systems. Automatica 38(1), 3–20 (2002)
  • [5]Block, A., Jadbabaie, A., Pfrommer, D., Simchowitz, M., Tedrake, R.: Provable guarantees for generative behavior cloning: Bridging low-level stability and high-level behavior (2023)
  • [6]Borrelli, F., Bemporad, A., Fodor, M., Hrovat, D.: An mpc/hybrid system approach to traction control. IEEE Transactions on Control Systems Technology 14(3), 541–552 (2006)
  • [7]Coulom, R.: Efficient selectivity and backup operators in monte-carlo tree search. In: vanden Herik, H.J., Ciancarini, P., Donkers, H.H.L.M.J. (eds.) Computers and Games. pp. 72–83. Springer Berlin Heidelberg, Berlin, Heidelberg (2007)
  • [8]Da Costa, L., Parr, T., Sajid, N., Veselic, S., Neacsu, V., Friston, K.: Active inference on discrete state-spaces: A synthesis. Journal of Mathematical Psychology 99, 102447 (2020)
  • [9]Daniel, C., van Hoof, H., Peters, J., Neumann, G.: Probabilistic inference for determining options in reinforcement learning. Machine Learning 104(2), 337–357 (Sep 2016)
  • [10]Dayan, P., Hinton, G.E.: Feudal reinforcement learning. In: Hanson, S., Cowan, J., Giles, C. (eds.) Advances in Neural Information Processing Systems. vol.5. Morgan-Kaufmann (1992)
  • [11]Fox, E., Sudderth, E., Jordan, M., Willsky, A.: Nonparametric bayesian learning of switching linear dynamical systems. In: Koller, D., Schuurmans, D., Bengio, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems. vol.21. Curran Associates, Inc. (2008)
  • [12]Friston, K., DaCosta, L., Tschantz, A., Kiefer, A., Salvatori, T., Neacsu, V., Koudahl, M., Heins, R., Sajid, N., Markovic, D., Parr, T., Verbelen, T., Buckley, C.: Supervised structure learning (12 2023)
  • [13]Friston, K.J., Parr, T., deVries, B.: The graphical brain: belief propagation and active inference. Network neuroscience 1(4), 381–414 (2017)
  • [14]Friston, K.J., Sajid, N., Quiroga-Martinez, D.R., Parr, T., Price, C.J., Holmes, E.: Active listening. Hearing research 399, 107998 (2021)
  • [15]Ghahramani, Z., Hinton, G.E.: Variational Learning for Switching State-Space Models. Neural Computation 12(4), 831–864 (04 2000)
  • [16]Gobet, F., Lane, P., Croker, S., Cheng, P., Jones, G., Oliver, I., Pine, J.: Chunking mechanisms in human learning. Trends in cognitive sciences 5, 236–243 (07 2001)
  • [17]Gou, S.Z., Liu, Y.: DQN with model-based exploration: efficient learning on environments with sparse rewards. CoRR abs/1903.09295 (2019)
  • [18]Hafner, D., Lee, K.H., Fischer, I., Abbeel, P.: Deep hierarchical planning from pixels (2022)
  • [19]Heins, C., Millidge, B., Demekas, D., Klein, B., Friston, K., Couzin, I., Tschantz, A.: pymdp: A python library for active inference in discrete state spaces. arXiv preprint arXiv:2201.03904 (2022)
  • [20]Koudahl, M.T., Kouw, W.M., de Vries, B.: On Epistemics in Expected Free Energy for Linear Gaussian State Space Models. Entropy 23(12), 1565 (Dec 2021)
  • [21]LaValle, S.M.: Planning Algorithms, chap.2. Cambridge University Press, Cambridge (2006)
  • [22]Linderman, S.W., Miller, A.C., Adams, R.P., Blei, D.M., Paninski, L., Johnson, M.J.: Recurrent switching linear dynamical systems (2016)
  • [23]Millidge, B., Tschantz, A., Seth, A.K., Buckley, C.L.: On the relationship between active inference and control as inference (2020)
  • [24]Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M.A., Fidjeland, A.K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., Hassabis, D.: Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015)
  • [25]Murphy, K.P.: Machine learning: a probabilistic perspective. MIT press (2012)
  • [26]Newell, A., Simon, H.A.: Human Problem Solving. Prentice-Hall, Englewood Cliffs, NJ (1972)
  • [27]OpenAI: Continuous mountain car environment (2021), accessed: 2024-05-25
  • [28]Parr, T., Pezzulo, G., Friston, K.: Active Inference: The Free Energy Principle in Mind, Brain, and Behavior. MIT Press (2022)
  • [29]Parr, T., Friston, K.J.: The discrete and continuous brain: from decisions to movement—and back again. Neural computation 30(9), 2319–2347 (2018)
  • [30]Parr, T., Friston, K.J.: The computational pharmacology of oculomotion. Psychopharmacology 236(8), 2473–2484 (2019)
  • [31]Priorelli, M., Stoianov, I.P.: Hierarchical hybrid modeling for flexible tool use (2024)
  • [32]Schwenzer, M., Ay, M., Bergs, T., Abel, D.: Review on model predictive control: an engineering perspective. The International Journal of Advanced Manufacturing Technology 117(5), 1327–1349 (Nov 2021)
  • [33]Sontag, E.: Nonlinear regulation: The piecewise linear approach. IEEE Transactions on Automatic Control 26(2), 346–358 (1981)
  • [34]Sutton, R.S., Precup, D., Singh, S.: Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112(1), 181–211 (1999)
  • [35]Tessler, C., Givony, S., Zahavy, T., Mankowitz, D., Mannor, S.: A deep hierarchical approach to lifelong learning in minecraft. Proceedings of the AAAI Conference on Artificial Intelligence 31(1) (Feb 2017)
  • [36]Vezhnevets, A.S., Osindero, S., Schaul, T., Heess, N., Jaderberg, M., Silver, D., Kavukcuoglu, K.: FeUdal networks for hierarchical reinforcement learning. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol.70, pp. 3540–3549. PMLR (06–11 Aug 2017)
  • [37]Zoltowski, D.M., Pillow, J.W., Linderman, S.W.: Unifying and generalizing models of neural dynamics during decision-making (2020)
Learning in Hybrid Active Inference Models (2024)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Ms. Lucile Johns

Last Updated:

Views: 5267

Rating: 4 / 5 (61 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Ms. Lucile Johns

Birthday: 1999-11-16

Address: Suite 237 56046 Walsh Coves, West Enid, VT 46557

Phone: +59115435987187

Job: Education Supervisor

Hobby: Genealogy, Stone skipping, Skydiving, Nordic skating, Couponing, Coloring, Gardening

Introduction: My name is Ms. Lucile Johns, I am a successful, friendly, friendly, homely, adventurous, handsome, delightful person who loves writing and wants to share my knowledge and understanding with you.