Action-consequence as a learning policy

I've always been extremely interested in psychology, and in particular behavioural, social, and biological psychology perspectives. Furthermore, I come from a computational neuroscience research background before starting my AI/ML and CS work in earnest. As a result of this, a lot of ideas from these two fields tend to carry over when I brainstorm about problems that I come across in ML. Combined with the fact that I feel like I'm generally pretty good at learning and adapting to different scenarios or new material (+ enjoy thinking metacognitively about this in particular), it was thus a natural progression to wonder how to apply biologically-inspired ideas of learning to the zeitgeist of continual learning.

Setting up the environment

If you've ever seen a Markov Decision Process or a partially observable MDP (POMDP), the definition of the action-consequence environment will be very similar to the arrangement of the latter in particular. Specifically, we can begin by considering a simple closed system that only consists of an actor and an environment, with no external actions, updates, or stimuli.

Recall firstly that the set of possible actions $\mathcal{A}_\text{possible}$ is conditioned on the current state that the actor is in. For the purpose of our action-consequence model, we will introduce the additional idea that any action that the actor chooses to execute will immediately generate a consequence, which is information conditioned on the transition $\mathcal{T}$ from the current state to the new state that occurs as a direct result of the action chosen. Since all consequences are direct functions of the transition function between states, we can write $\mathcal{T}_\text{complete} = \{ \mathcal{C}, \mathcal{T} \}$ for any transition. Thus, under this formalisation, one can consider any task as a sequence of states, actions and transitions as follows: $(s_0, \{ c_0, t_0 \}, a_0), (s_1, \{ c_1, t_1 \}, a_1), (s_2, \{ c_2, t_2 \}, a_2), ...$

To take this one step further as well, we can consider human learning process as an update across their general action policy $\pi$ supported on $\mathcal{A}$ based on the consequence $c_t$ at time $t$ .

Considering human learning

One can consider a general heuristic of human learning where, given an action and a consequence, the actor can more or less glean some sort of lesson (whether it's useful or relevant is another question) from their combination. The most simple example of this is a baby putting their hand on a hot stove; because of the inherent understanding that's present in humans (as a result of evolution) of pain being negative and the consequence of pain that arises from the action of touching the hot stove, the brain quickly and instinctively draws the conclusion that this stove is not viable to be touched and should not be touched.

A quick primer on associative learning

The idea of associative learning is something that I've written about in a past post as another aspect of the human learning paradigm. Specifically, it addresses the idea that any new learning can be distilled into an association between two pieces of information that are either both already within the learner's field of awareness or that are partially or completely outside of the learner's previous field of awareness. In this sense, as inspired by this blogpost on continual learning paradigms, learning can essentially be reduced to the idea of effectively extracting information, searching across your previous atoms of information, and making associations either between them or to the new extract from the current episode.

Associating consequences

Taking this idea of associative learning together with action-consequence essentially distills the whole idea of learning from these actions and consequences into the problems of ingesting a consequence and evaluating it to extract maximum amounts of information, then correctly searching through past knowledge and actions to figure out both how to assign credit to them and integrate these new credits either as updates to past learnings or inductive biases or as completely new information.

On existing ML and RL

I believe that it is no secret that today's models suck at reasoning within an action-consequence paradigm.

One pressing example within the research that I'm doing on compositional and self-differentiating agents can be found in this paper about collaborative LLM assistants. The idea is that because LLMs are currently optimised at each response to be as closely aligned with the "humanlike speech" RLHF policy that they're given as possible, they have trouble reasoning over longer-horizon tasks and multiple turns (because they are never explicitly tuned to those behaviours). This, in my opinion, is as a direct function of the fact that they struggle to correctly reason across multiple turns and leverage all of those tokens to correctly and accurately assign credit; as a result, they're unable to reason about past and external actions (whether explicitly introduced within their context window or not), and struggle to integrate the consequences in the right way and produce the right actions as a result.

In some sense, this claim is just a more nuanced version of the hypothesis that LLMs are simply "stochastic parrots" where because the LLMs do not have the ability to understand the world that they describe with their tokens, they simply make use of the learned statistical associations within their training dataset as well as the sparse reward from post-training in order to pattern-match to something resembling effective first-principles reasoning. Indeed, this is part of the core argument that I personally believe in regards to model architectures at this point (and also why I believe more in recurrent/SSM-based/non-embedding architectures that are able to represent the world more powerfully): how can we build a system that effectively distills, explores, and betters our world if it doesn't have a simple understanding of how effects propagate within it in the first place?

Beyond

Perhaps this is why a large contingent of researchers today are strongly focussed on dynamics modelling and world modelling (which I tend to agree with). I use the action-consequence paradigm as a vehicle by which to inject one more addition that I believe is relevant to the model improvement policy, which is the idea of planning and the "chess game of the world".

To elucidate this final claim, again return to a metacognitive reflection: humans tend to decide by imagining ourselves in a particular scenario and, based on our internal models, planning based on what we think will happen. Despite this, human planning and information ingression is imperfect at best. So does it not make sense why, from first-principles, it appears that the capability in AI systems that we seek will come from a more thorough, in-depth understanding of the actions and consequences we seek to propagate in the world?