Two weeks ago, I was in Seattle for a workshop hosted by the Gates Foundation at the AWS Skills Centre that focussed on the discussion and prototyping of novel memory architectures, specifically focussed on use cases in education and healthcare. The workshop was actually a callback to some problems I had been thinking about in the vein of memory as a whole a few months back towards the end of 2025, in that as we move away from static, modular software into dynamic and self-evolving systems, efficient memory, personalisation, and evolution policies will increasingly begin taking centre stage; this post, then, is essentially a cleaned up, formalised version of a draft I had written back in November but never got around to properly publishing. It represents a postmortem of both the prototype that we built (which is linked in this repo) and of the learnings from the discussions we had in the workshop as a whole. It's also a fun comparison between my writing ~4 months ago (lightly edited) and now.
Introduction: on personalised software in the age of AI
Software in the status quo has developed to the point where it can, on its own, generate new content to entertain its users, reason effectively to complete complex tasks that push the envelope of human expertise, and, in some limited capacities, learn and improve from itself. Indeed, it is becoming increasingly apparent that the growing capabilities of AI have made it possible to personalise any given user’s experience to an unprecedented degree.
In this vein, it follows that any human-AI interaction, in order to obey this trend towards personalisation and individuality, must index strongly on specific, granular context that surrounds each and every user in their everyday workflows. This idea of “continuous holistic context” is one of the core principles that I believe will inform agentic assistants and applications in the future, and is consistent with the trends at large in the field of AI development (see e.g. memory-as-a-service, self-evolving and even self-training systems).
These ideas clearly suggest that some sort of proprietary memory system for each user is clearly needed, some sort of machinery that allows agents to carefully understand and curate each of your actions into a more complete picture of your computer use. The rest of this post is essentially a first-principles derivation of one form of memory architecture that allows today's systems to extract and utilise the most optimal context from a human's computer use.
Our motivation: true, explicit adaptation of user preferences
We begin by providing a generalised picture of the architecture of agents and agentic systems that exist today; we will also go a bit more in-depth in regards to the role that user preferences play in the performance of these systems.
A standard agent architecture
In most basic agent architectures today, there are 3 major subsystems to highlight within the larger system that each play a part in ensuring the system is able to accomplish its goal overall:
- The executor: Traditionally, all agents require a specific executor, which is essentially an oracle that generates the responses and, if necessary, executes any actions that the user requests. Almost all executor systems can also be further decomposed into the actual generative large language model (e.g. GPT-5.1-mini, Claude 4.5 Sonnet, or any custom fine-tuned local model the agent chooses to implement) and the system prompts and guardrails in place to dictate that large language model’s behaviour.
- The tools: In order to be able to generate side effects outside of the executor inference environment (i.e. in order to be able to “do” as the executor “says”), executors are often given tools. Though these tools come in various forms such as direct function calls or MCPsX, the principle is the same; tools provide a structured interface between the executor and the outside world that allow the executor to produce the actions the user might request.
- The execution loop: Today’s more powerful models have in-built reasoning and planning capabilities and often, like GPT-5.1REF, also have the ability to tune how much they reason through their Mixture-of-Experts (MoE) architectureREF. However, even despite this, most real-world tasks require more complex interleaving between reasoning, acting, observing, and replanning, requiring executors to operate within specific execution loop formats like ReActREF in order to be able to complete tasks that are more long-horizon or that require taking side effects from previous actions into account.
By varying the construction and composition modalities of each of these three different subsystems, we can create an extremely wide range of agents with different specialties and areas of expertise, obeying a general schema somewhat as follows:
This simple agent topology has become the foundation for a large number of extremely effective agents, including chat assistants like ChatGPT, coding agents like Cursor and Claude Code, and numerous other app layer products. Despite the power of this simple construction, however, it is important to note that in more domain- or user-specific tasks, even well-known or LLM-targeted ones like drafting an email or summarising a web search, there will exist a myriad of idiosyncrasies that often provides important information about the context behind the task and how it should be executed. In particular, if ignored, agents’ accuracy and utility would then become significantly hindered, and it is this key observation that emphasises the importance of memory in the efficacy of agentic assistants.
Where agents begin to break down
Within any particular domain, a highly effective and well-versed expert maintains his or her pedigree in the field because of their capability to make accurate decisions based on both their extensive learned knowledge and their deep intuition informed by past experience. In a word, it is the expert’s context that makes them an expert. Thus, it is a relatively simple step forward to posit that any expert agent deployed in a similar environment should be afforded the same context in order to most effectively serve as an assistant within that task. In this vein, there are two major relevant arguments that inform the importance of user-specific context and “stateful” (memory-based) agents, both of which will take as a ground truth the idea that the foundational environment of any agent is to understand and become an extension of its user, which we will henceforth refer to as the User Environment Hypothesis.
Firstly, we will more deeply examine the idea of learned knowledge in the context of this User Environment Hypothesis. Specifically, we can see that each time the agent goes through an execution loop in order to complete a task, it is forced to make a series of decisions, both implicit and explicit, that will ultimately inform the output. It is also a simple enough assumption that within each of these decisions, as a person would execute them, there are a bunch of direct, episodic (i.e. clear, discrete pieces of information) that inform the decisions and how they are executed (for a more concrete example, consider the idea of writing an email: any particular user will have a specific tone and style, outside context on why this email is being written and to who, and other relevant information about the reasoning and process behind this email as a whole). Thus, when we combine these two, we can make the deduction that any agent that seeks to assist someone or autonomously take over for them must at the very least have access to the same discrete, episodic context that the user has in order to make the same decisions informed in the same way.
Secondly, in the idea of intuition and “long-horizons preferences”, we note that most general pre-training and post-training processes for any executor language model out of the box are sufficiently vague in principle so as to introduce minimal inductive bias into the model. Because the model is trained generally on a corpus of text from the Internet and optimised on the policy of being a helpful, human-sounding assistant, there are no specific intuitions that the model is given about any particular use case or user short of unreliable, uninterpretable latent biases that may exist within the training set. From this claim, we can once again take a very short step to the point that any model that is a sufficient expert within the User Environment Hypothesis should have some sort of context, whether through prompt or through post-training, on the user that it currently serves.
Memory systems in the present day
As a result of the recent “stateful agents” thesis, AI labs and agent providers have taken a myriad of different approaches today in order to provide each agent with the context that it needs, and there are a couple overall themes to be explored in order to understand the limitations of existing memory systems in understanding the user.
OpenAI/ChatGPT
ChatGPT’s paradigm on memory follows a more proactive pattern than most other applications and systems. The system focuses on the idea of context and, based on multiple layers of user knowledge memories, user conversation summaries, explicit memory management, and interaction metadata, tailors the responses to the user aggressively.
Anthropic/Claude
In a stark contrast to the explicit, structured outputs that OpenAI has implemented, Anthropic’s system is a simple extension of an existing framework: the file system. Through tool calls provided to the agent, Claude is encouraged to modify its own memories based on instructions provided to it through the system prompt, making the system extremely conscious and explicit and relying on the model’s ability to filter out the user inputs it is given.
Google DeepMind/Gemini
Finally, Gemini chooses to use a more subconscious, minimally invasive structured system built on top of a grounded set of memories on the user in general supplemented with short-term working memory based on past conversations. Notably, each memory is tied to a specific user conversation, timestamp, and rationale as to why the memory remains valid meaning while this drastically blows up the context usage, memories remain up to date and clearly interpretable for both the model and the user.
Limitations of existing memory systems
One interesting indicator that seems apparent for all of these labs despite the diverse takes on how memory should work is that these three systems all rely on the performance of the models themselves as well as an increase in context window. This is borne out of the fact that all three labs’ systems involve either some sort of long-form context notepad, an atomic but agent-driven system, or a combination of both, and this suggests that while the sample size is small, industry leaders are still favouring the scaling of models and their context windows over a more specific deep dive into the memory architecture powering them.
In more general agentic applications, while it is not out of the question to bet on the hyperscaling of LLM capabilities, having an agent-first approach quickly becomes unreliable and nondeterministic. Even when introducing a myriad of user conversations, user-agent interaction metadata, and more general context across the user’s history as a whole, the decisions for what is important will still be generally informed by the intuition for what the user is trying to accomplish as described above and because by our User Environment Hypothesis the LLM does not have a sufficiently curated context by which to make these decisions yet, the performance for extracting memories out of large blocks of context will be unreliable and nondeterministic. Especially in contexts where context and instructions blow up quickly (such as multiagent or complex agentic loops), these memory systems will become extremely complex to maintain for the agent.
One other major problem that was highlighted is the idea that these models and their memories only have access to “what they’re told”, or what is in the conversation context. For a concrete example, assume that a user is looking for a new pair of headphones. If the user enters the first half of their preferences in a discussion with an agent then switches to a Google search and Amazon browsing session to fine-tune the rest of their preferences, further conversations with the agent will lose out on the second half of information. This becomes apparent when interacting with systems like ChatGPT, which often use old, irrelevant, or since-invalidated memories from before that were not modified because no updated conversation context was ever provided.
Takeaways from the Gates Workshop
Our Gates Workshop prototype was built on a human-centric design composing sharded, agent-managed memory modules into an overall decentralised architecture that allows each agent to make its own decisions about the memory it controls based on the current user input (see our slides outlining the architecture for more detail). From this, we can see a number of the most notable takeaways from the Workshop that, in general, align with most of the premises we've outlined above surrounding the questions of performance, governance, and accessibility/portability:
- Differential privacy and hierarchical portability. In regards to the ideas of governance and portability, we note that, especially in fields like law and healthcare where differential information sharing and operating on a "need-to-know" basis is extremely important, not every single piece of knowledge an agent has should (a) be stored the same and (b) be shared. In particular, one major heuristic for agent memory today was the idea that bots are (Google) Docs (sharing anything means the whole doc/memory repository is visible). In order to maximise the modularity of control and the privacy, especially for distributed or edge compute systems, we choose to maintain a more decentralised, self-organising architecture that can function whether or not the whole of a user's memory repository is online or visible at any one point.
- Minimise agent decision fatigue. One large idea in performance that was touched upon was the idea of finding the right needles in a haystack. Specifically, in the age of RAG, vector databases, and long-context agent-managed memory, is perfect precision and recall really the metric that we should be optimising on? Can we make more strides toward some sort of distillation, which at least in human memory carries implicit value (i.e. the summaries that you make and the things that you choose to remember are generally more important/weighted more heavily)?
- Biologically-inspired next steps. Finally, our team in particular (myself, Prof. Yilun Du, and Zhenting Qi from the Embodied Minds Lab) has been talking at length about some sort of self-evolving and self-assembling decentralised agent system. Though our prototype is still really only the basic architectural prior behind the creation of the evolution and assembly algorithms, it was extremely interesting to consider the idea of hierarchical, self-determined primitives (as memories are made in the brain) and also think about a stem-cell-esque composition of "superneuron networks" (or, in this case, memory shards).
Open questions and further exploration
We believe that the current iteration of this memory framework is only the beginning in the development process for a more well-established continual learning pipeline for agents. Specifically, we seek to explore the following questions and would welcome further exploration and collaboration in these areas:
- What is the most effective vehicle by which we can manifest stored user preference and agent adaptations? For example:
- Simple and direct injection of preferences into action
- Generating environments or building world models on the user action space
- Continuous learning and online RL on users
- Next-action prediction and “action autocomplete”
- How can we most effectively transform the fire hose of user actions, tool calls, and state changes into crisp, direct training signals for any system to continually learn, evolve, and eventually self-assemble (a la Society of Mind) from each interaction and observation?
- How can we create generalisations in the action space that create more deterministic, accurate representations of human-machine interactions?
- How can we more effectively encode user experience data into a repository of RL environments that inform both general task and domain-specific learning?