← Work
Systems Design2026Independent Research · Ongoing

Clarence:
Designing an Autonomous AI Collaborator

HermesGPT-5.4 Primary RuntimeClaude Opus 4.6Gemini FlashMCP Bridge ArchitectureClaude Code Specialist LaneKnowledge Base ArchitectureMemory HardeningRetrieval Feedback LoopSemantic Vector Search (RAG)Obsidian Sync + Vault IndexingTelegram InterfaceDiscord ReportingTailscale VPNHuman-in-the-Loop DesignPlatform Migration Under Constraint
Live systemInteractive public-safe surfaces from the current Clarence stack
20Active cron jobs
4,266Memories in knowledge DB
62%Bootstrap memory reduction
14,887Indexed facts with full embedding coverage
Live Knowledge GraphInteractive public slice of the Clarence entity graph

Documentation and System Notes

Clarence now has a dedicated system wiki because too much durable system knowledge was trapped in chat history, repo docs, and a fragile note surface. I wrote a companion article about why that became necessary.

Read it here: Why Clarence Needed a System Wiki.

The public-safe architecture notes currently live in the public repo at github.com/nomadjames/clarence-architecture. The raw internal system wiki stays private, which is the correct boundary.

The Design Question

Most AI tools are responsive. You ask, they answer. The interaction ends and nothing persists except a chat log. That model is useful, but it is not collaboration.

The question I set out to explore: what does it take to make an AI system useful across sessions instead of only within one chat? It needs durable context, explicit boundaries, and enough judgment to stay useful without pretending it should decide everything on its own.

Clarence is my attempt to answer that question in practice. It is not a chatbot. It is an autonomous system that runs 20 scheduled cron jobs overnight, manages a knowledge database of over 4,200 memories and nearly 15,000 indexed facts, routes tasks across models based on what each job actually needs, distills every conversation into durable memory, and writes nightly reports that feed into what it does while I sleep. The system was originally built on OpenClaw, a different orchestration platform. In April 2026, the platform it depended on changed its access rules with minimal notice. I migrated the entire runtime to Hermes, a new orchestrator running GPT-5.4, in under 48 hours. The memory layer, the cron architecture, and the collaboration patterns survived. The infrastructure underneath them did not.

I am both the designer and the primary user of this system. That dual position is unusual, and worth examining explicitly as part of this case study.

What Changed After the OpenAI Switch

The switch to OpenAI was not just a provider swap. It forced a cleanup of assumptions. I stopped treating Clarence as a clever prompt chain and started treating it like infrastructure. That meant defining a memory contract, deciding what belonged in hot injected memory versus structured database memory versus transcript recall, and then making the retrieval layer honest enough to trust.

Since the switch, I hardened the system in ways that matter for real use: semantic retrieval now returns stable IDs, feedback can mark results as useful or noise, ranking adjusts based on that feedback, duplicate conversation memories can be archived conservatively instead of multiplying forever, and the health checks now report freshness instead of flattering coverage numbers. Obsidian indexing was repaired too, with real YAML parsing, sync failure detection, stale path repair, and Chroma cleanup for deleted files.

That work is less glamorous than a multi-agent demo, but it is what makes Clarence feel like a collaborator instead of a novelty. The value is not that it can answer a question once. The value is that it can still be coherent next week, after corrections, migrations, overnight jobs, and hundreds of accumulated decisions.

Public Read-Only Access

The hard part is not getting a good answer once. It is carrying context forward after the chat ends. If memory ends at the context window, every new session starts with reconstruction. I built Clarence because I wanted memory that survives that reset. The current knowledge layer holds 3,395 active memories, 10,351 active facts, and 2,310 indexed vault notes.

The recent milestone was putting that memory on a public read-only MCP path without opening writes. The endpoint lives at clarence-memory.nomadjames.com/mcp. Cloudflare fronts it. A local Streamable HTTP bridge serves it on port 8765. The public health check is at /healthz. Claude Code reaches the same read-only server through a local alias in ~/.claude.json. Hermes still owns writes.

The real work was the boundary, not the endpoint. The main 384-dimensional memory and fact indexes now line up with the active tables, and the missing read methods are exposed through MCP. Claude gets bounded retrieval instead of raw database access. I also set up a second NVIDIA embedding lane for hybrid retrieval experiments, but writes still go through Hermes only.

The work was messy. I had to patch a vendored MCP SDK inside a pinned supergateway install so duplicate SSE GET requests would stop killing the child process on reconnect. I also had to anonymize the public graph export so I could show the structure on the site without exposing entity names. None of that is glamorous. It is what turned this from a local demo into something I can expose with a straight face.

That matters because Clarence is not the thing I am trying to show off. It is the layer under the next round of work, especially SensorSynthFM. When I come back to that project, I do not want to reconstruct old decisions, blockers, and experiments from scratch. I want Clarence to remember where the work stopped and hand that context over safely when a specialist model needs it.

Agentic UX: Where This Fits in the Industry

In 2025 and 2026, a new design discipline is forming around multi-agent AI systems. Job titles like “Agentic UX Designer” and “Agent Experience Designer” are appearing at companies like Deloitte, Anthropic, and Microsoft. Research teams at Microsoft (Magentic-UI), Salesforce (SLDS 2), and Google (A2UI) are defining the design patterns for human-agent collaboration. I built Clarence by following operational need, not by reading the literature. Mapping what I built to the emerging vocabulary, after the fact, is useful because it shows the reasoning was design-driven rather than borrowed from a framework.

Orchestrator-Specialist Pattern

The dominant architecture in multi-agent systems: a supervisor agent delegates to specialized agents. In Clarence, the orchestrator routes to named agents, each with a fixed role, a designated model tier, and a constrained scope. The pattern lines up with what Microsoft Research describes in Magentic-UI, which I learned about after the fact rather than before.

Transparent Planning

Microsoft's design guidance for agents emphasizes that users need to see what the agent is planning before it acts. Clarence implements this through WORKING.md (a live state file that any agent can read), the knowledge directory (structured files visible from any device via Brain Reader), the Acknowledge First rule (every task gets a plan and time estimate before execution begins), and HANDOFF.md (session continuity notes written at session end, loaded at session start). These are not dashboards. They are working documents that serve both the human and the agents.

Memory Management as UX

The AgentOps and AutoGen ecosystems are building toward “memory cards” and preference editors for agent systems. Clarence's memory architecture is an implementation of the same principle, arrived at independently: the conversation distillation pipeline, the entity-fact-memory schema, the Obsidian vault sync, and the RAG retrieval layer all serve the same goal. The user's corrections and preferences should persist across sessions without the user having to re-state them. The design question is not whether to give agents memory. It is how to make that memory visible, editable, and trustworthy.

Adaptive Model Routing

Industry patterns describe routing between fast/cheap agents and powerful/expensive agents based on task complexity. Clarence now uses a stricter split: Hermes stays hard-pinned to GPT-5.4 for the main runtime, while cheaper or specialized support paths are used only when the task is clearly bounded. The lesson from operating this is that routing should be boring. Ambitious dynamic routing sounds elegant, but a stable primary brain with explicit support paths is easier to trust, debug, and maintain.

Structured Multi-Lens Review

One useful historical pattern in the system was structured review through multiple fixed lenses: market analysis, UX research, technical architecture, product strategy, and adversarial critique. The value was not the theater of many personalities. The value was forcing disagreement into the open. When every lens agreed, the answer was usually obvious. When they conflicted, the synthesis work got interesting.

System Architecture

The system has five layers. Each one reflects a design decision, not just a technical choice.

The interface layer runs through Telegram, Discord, and now claude.ai via a remote MCP bridge tunneled through Cloudflare. All device access goes over either Tailscale (private mesh) or the Cloudflare tunnel (for external AI clients). I move between Linux, iPhone, and iPad throughout the day. The system has to be reachable from all of them without exposing anything to the public internet.

The orchestration layer is Hermes, running GPT-5.4 via OpenAI. It schedules 20 cron jobs in the overnight window between 11 PM and 4:45 AM ET, plus a bridge health watchdog that runs every five minutes around the clock. Each job is isolated with its own context scope and delivery target. The system was originally orchestrated by OpenClaw. When that platform became unavailable, Hermes replaced it. The migration preserved everything above the orchestrator: memory, crons, collaboration patterns. What changed was the engine, not the car.

The model layer routes work based on capability, not cost. All 20 cron jobs currently run on GPT-5.4 or as direct script executions. Ollama cloud models (DeepSeek V3.2, Qwen3 Coder, Gemma 4, MiniMax, and others) are available for supporting work and research. Model switching is immediate via config. The routing philosophy shifted over time from cost-first (when the system ran on free-tier models to keep the overnight loop at zero cost) to capability-first (use the model that produces the best result for each task).

The memory layer is a single SQLite database (clarence.db) holding 4,266 memories, 14,887 facts, 2,475 entities, and 283 entity relations. All agents access it through MCP servers: 16 read-only tools for retrieval, plus a separate write-capable ops server for task dispatch and system management. Two embedding sets run in production: MiniLM 384-dimensional and NVIDIA 2048-dimensional, both at 100% coverage. Agents query by meaning, not keywords. A conversation distillation pipeline processes conversations nightly, extracting decisions, corrections, and preferences into the database automatically. An Obsidian vault syncs bidirectionally: what I write, agents can read.

The session layer handles continuity. Every session starts warm via auto-loaded context. Stop hooks write handoff notes so the next session knows what happened. Nightly rotation archives old session files. Memory files were trimmed from 106KB to 40KB so every session starts with less noise and more signal.

The Agent Crew

Naming agents was a deliberate design choice, not decoration. Names create accountability. When a named agent produces output, I read it differently than output from an anonymous function call. The names also make role boundaries explicit across the codebase and the cron config.

The current system runs 20 jobs across distinct roles: system health monitoring, knowledge synchronization, cost tracking, portfolio drift auditing, research briefings, conversation distillation, memory consolidation, calendar briefings, model roster evaluation, and a bridge health watchdog added in April 2026 after a day of diagnosing reliability problems in real time with Claude as a peer debugger.

The delegation architecture follows hard rules. Clarence orchestrates. Subagents execute. Every task gets immediate acknowledgment with a plan and time estimate. Multiple independent tasks run in parallel. Clarence never blocks on a subagent. It stays available to coordinate while work runs in the background.

The earlier version of this system used named characters for each role: Felix as chief of staff, Bruno for security audits, Ada for memory consolidation, a five-agent R&D Council that held structured debates. Those names reflected a phase where the system's personality was part of the design experiment. The current architecture is leaner. The roles persist. The theater around them has been stripped back in favor of reliability and clarity.

The Overnight Loop

The most consequential design element is what happens while I sleep. Twenty cron jobs run between 11 PM and 4:45 AM ET, staggered to avoid collisions. They are sequenced deliberately: the conversation distillation pipeline processes the day's conversations first, so the memory database is current before the research and reporting jobs run against it. A health check job at 3 AM verifies database integrity. An overnight summary at 4:23 AM compiles everything into a report ready when I wake up. A daily executive brief at 4:30 AM synthesizes priorities for the day ahead.

The self-improving loop works like this: nightly audits surface improvements. Those improvements feed a task queue. Execution jobs pick up the queue and do the work. The next night's audit reviews what was done. The conversation distillation pipeline means every correction I make during the day feeds back into the memory layer that night. The system compounds overnight rather than just reporting.

In practice, this loop has a write-only failure mode I have not fully solved. The audit produces excellent proposals, but the execution step does not always pick them up correctly. The proposals accumulate without always turning into action. Reporting awareness is not the same as fixing the underlying reliability issue. I keep this paragraph here because a case study that hides its failures is a sales document.

Trust Calibration: The Real Design Problem

The hardest design problem in this system is not technical. It is trust calibration. How much autonomy does the system get? When does it act without asking, and when does it wait? These are not engineering questions. They are design questions about human-AI collaboration, and they do not have permanent answers.

The March 25-26 changes made this concrete. Moving all cron jobs to free-tier models was not just a cost decision. It was a trust decision. The system initially leaned too hard on cheap routing for cron jobs, but tool-call hallucination failures forced a stricter policy. The lesson: cheaper models save budget only if the output is real. Trust in the overnight loop required a stable primary runtime and bounded support paths that actually execute tools instead of hallucinating what tool calls would look like.

The Acknowledge First rule is another trust calibration. Every task gets immediate acknowledgment with a plan and time estimate before any work begins. This was not always the case. The rule emerged from real friction: tasks would disappear into subagent execution with no signal back to me about whether they were understood correctly. The fix was not more autonomy or less autonomy. It was better communication at the boundary.

The bigger trust mechanism now is explicit continuity rather than magical continuity. Hermes keeps hot memory, session handoff notes, and a durable SQLite knowledge layer live today. The older nightly conversation-distillation idea is still part of the project's path, but it should be read here as a historical design direction and rebuild target, not as a claim that every correction is currently being distilled automatically every night.

Overnight-only autonomy. All autonomous agent work runs between 11 PM and 4:45 AM ET. Nothing fires during the day. All results must be compiled before 5 AM. This is a deliberate trust boundary: the system earns autonomy in a window where errors are recoverable before they affect real-time work.

Infrastructure Decisions and Why

Token Budget as Design Material

Token cost shapes architecture. The system went through an aggressive optimization pass: the self-audit prompt trimmed from 7,582 characters to 1,276. Bootstrap files trimmed from 11 to 7. Memory files cut from 106KB to 40KB. Every session starts with less noise and more signal.

The routing philosophy has evolved. The system originally ran all overnight work on free-tier models to keep cost at zero. The current architecture runs on GPT-5.4 across the board, with cost as a secondary constraint behind capability. That shift reflects a real lesson: cheap models produce output that looks complete but degrades in ways you do not notice until something downstream depends on it. Capability-first routing costs more but compounds better.

Single Source of Truth Architecture

The original OpenClaw system used five symlinked sub-agent workspaces. After the Hermes migration, the architecture simplified: Hermes manages its own session isolation, and clarence.db remains the single authoritative knowledge store. IDENTITY.md was merged into SOUL.md. The pattern is consistent: eliminate duplication, reduce the surface area where state can diverge.

This matters more than it sounds. When multiple agents can write to the same knowledge base but read from different workspace copies, you get silent divergence. Symlinks solved this at the filesystem level without requiring any coordination protocol.

Remote Access and the MCP Bridge

The system now exposes two MCP (Model Context Protocol) servers through Cloudflare tunnels: a read-only memory server for retrieval and a write-capable ops server for task dispatch. An OAuth 2.1 shim authenticates external clients. This is what allows claude.ai to function as a peer interface alongside Telegram and the command line: it can query memory, dispatch tasks to Hermes, and read results without being on the local network. Building this bridge, and then spending an entire day debugging its reliability with Claude while I was at my day job, was one of the more honest tests of whether the system actually works under real conditions.

Memory Architecture

Long-term memory lives in a single SQLite database with 4,266 memories, 14,887 facts, and full embedding coverage across two vector sets. The schema separates concerns: a profiles table holds identity facts with deterministic lookup (no fuzzy search for things that must be exact), a memories table stores durable knowledge with soft invalidation (when a fact changes, the old record is marked invalid and a new one is written, preserving the audit trail).

The conversation distillation pipeline is what makes the memory system feel alive rather than static. I correct something once in conversation, and it persists. The knowledge base grew from roughly 170 memories to over 4,200 largely because this pipeline captures context that would otherwise evaporate.

An Obsidian vault syncs bidirectionally with the database. Writing in Obsidian feeds the retrieval layer without any additional wiring. The goal is one place where knowledge lives, accessible from every surface.

Tailscale for Device Continuity

Clarence runs on an Ubuntu Linux server. James moves between Linux, Mac, iPhone, and iPad. Tailscale creates a private mesh VPN connecting all four, so the Telegram interface and Brain Reader HTTP server are reachable from any device without exposing anything to the public internet. This was a day-one decision and it has been the most friction-free part of the stack.

Multi-Surface Communication

Discord (13 channels in the current directory), Telegram (interactive conversations), HANDOFF.md (session continuity). Each surface serves a different communication need. Discord for async notification and overnight reporting. Telegram for real-time dialogue and morning briefings. HANDOFF.md for session-to-session continuity.

Honest Challenges

A portfolio case study that only shows what worked is a sales document. Here is what is actually hard:

The Plan-Execution Gap

The current system is better at framing work than completing every part of the original vision. Hermes is good at planning, memory stewardship, and deciding when Claude Code is worth invoking. The new handoff stack is good at bounded parallel investigation. But that still leaves a gap between the original tandem plan and the present implementation.

The gap is simple to describe: Claude automation is currently investigation-first, not implementation-first. That is safer and more honest than pretending otherwise, but it means part of the original promise remains a roadmap item rather than a finished capability.

Budget Constraint as Design Constraint

The routing policy exists because compute cost is a real design material, not an abstract one. GPT-5.4 is the daily driver because it is the main conversational and planning layer. Claude Code is protected for the top slice of tasks where repo-level execution or code reasoning genuinely matters.

That creates a permanent calibration problem. If I escalate too early, I waste Claude. If I escalate too late, I waste time and trust. The current tandem model is my answer so far, but it is not a solved problem. It is a policy I expect to keep refining as SensorSynthFM work becomes more serious.

Memory Growth Without Garbage Collection

The knowledge base grew from ~170 to 4,262 memories and the facts table exploded to 14,884entries after vault fact extraction processed thousands of notes and documents. More data does not automatically mean better recall. As the database scales, the vector search returns increasingly similar results, and the signal-to-noise ratio in retrieved context degrades. Memory needs pruning and consolidation, not just accumulation. Automated garbage collection helps, but the curation problem is fundamentally unsolved.

Platform Fragility

Twitter/X posting was built, tested, and blocked by anti-automation fingerprinting within a single session. The platform changed the rules underneath a working integration. This is a real constraint of building on platforms you do not control. The system adapted by routing public communication through Discord instead.

Visibility of System Status (Nielsen #1)

This is the most persistent unsolved design problem in the system. Nielsen's first usability heuristic states that a system should always keep users informed about what is going on through appropriate feedback within reasonable time. Clarence violates this consistently.

Cron jobs run overnight and deliver to Discord channels: reports to #cron-reports, incidents to #incidents, research to #research-reports. The morning briefings deliver to Telegram. But the gap between “reported” and “observable” remains. Silence is still ambiguous: does it mean everything worked, or that the reporting layer itself failed?

The problem compounds during live sessions. When agents are delegated tasks in parallel, there is no progress indicator, no heartbeat, no way to distinguish “working on it” from “silently failed.” The user is left interpreting silence, which is the opposite of visibility. This is not a logging problem. It is a feedback design problem, and solving it requires treating system status as a first-class UX surface rather than a side effect of Telegram messages or Claude Code tool calls.

What the System Does Today

  • 20 scheduled cron jobs running in a nightly window (11 PM to 4:45 AM ET), plus a bridge health watchdog running every five minutes
  • 4,266 memories and 14,887 indexed facts in a single authoritative database, with full embedding coverage across two vector sets
  • Conversation distillation pipeline writing new memories nightly from real conversations
  • 16 read-only MCP tools plus a write-capable ops server, accessible remotely via Cloudflare tunnels and OAuth 2.1
  • Discord integration across multiple channels for cron reports, exchange logging, alerts, and daily briefings
  • Remote dispatch from claude.ai as a peer interface, enabling collaborative debugging and task coordination from a phone
  • Bootstrap trimmed to 7 files (~18KB), memory files to 40KB (62% reduction), self-audit prompt from 7,582 to 1,276 characters
  • Bidirectional Obsidian vault sync feeding the retrieval layer automatically
  • Telegram interface with brief mode reducing per-message context injection from ~10KB to ~150 bytes
  • Private mesh networking via Tailscale for device continuity across Linux, iPhone, and iPad

Week Three: Engineering the Foundation

The first week was building. The second week was surviving a platform migration. The third week was about fixing the thing that kept breaking underneath everything else: every script was writing its own SQL with its own assumptions about the schema.

That is the kind of problem that does not announce itself. It shows up as orphaned facts, stale vectors, embedding gaps, and silent data corruption. 1,776 orphaned facts. 1 stale vector pointing at a deleted record. Embedding coverage gaps across active records. Each one traceable to a script that had its own idea about how to talk to the database.

The fix was a Data Access Layer: a single Python module (clarence_db.py, 870 lines) that owns all database operations. Every script in the system now goes through this one file for reads, writes, vector search, and schema management. No more raw SQL scattered across 36 scripts. One module, one set of assumptions, one place where the schema is defined.

Test Suite

75 tests covering CRUD operations, vector search, JSON handling, edge cases, and concurrent access patterns. The test suite runs against a fresh database every time, so regressions surface immediately instead of hiding in production data. This was the first time the Clarence codebase had real test coverage, and it caught three bugs in the first run.

Health Check Monitor

A nightly automated health check (db-health-check.py) runs against clarence.db and reports to Discord. It checks for orphaned records, stale vectors, embedding coverage gaps, schema integrity, and table row counts. If anything is wrong, it posts to the incidents channel. If everything is clean, it posts a summary to cron-reports. The system now tells me when the data is degrading instead of waiting for me to notice.

Portfolio Changelog

105 portfolio changes tracked with timestamps, categories, and RAG embeddings. Every meaningful change to the portfolio site gets logged with what changed, why, and when. The changelog itself is searchable via semantic vector search, so the system can answer questions like “what did I change about the Clarence case study last week?” without scanning git logs.

The cleanup results: 1,776 orphaned facts resolved, 1 stale vector removed, 100% embedding coverage achieved on all active records, and the database went from 26 tables to 15 after removing legacy cruft. The fact count went down (from 16,658 to 14,882) because the system was finally honest about what was real data and what was garbage.

The lesson from week three is the same lesson every production system teaches eventually: the exciting work is building features, but the work that actually matters is making the foundation reliable. A Data Access Layer is not glamorous. Neither is a test suite. But every time the system broke before this week, the root cause was the same: fragmented assumptions about shared state. Now there is one source of truth for how data moves in and out of the knowledge base.

Week Two: The Compute Reckoning

The first week of Clarence was about building. The second week was about what happens when the platform you built on changes the rules underneath you.

On April 4, 2026, I received an email from Anthropic. Starting that afternoon, third-party harnesses, including OpenClaw, the orchestration platform Clarence ran on, would no longer be covered by the Max subscription. Those tools would require “extra usage,” a separate pay-as-you-go billing layer. The subscription still covered Claude Code and Claude's own products, but anything running through a third-party harness was now out of bounds.

The reason Anthropic gave: “these tools put an outsized strain on our systems” and they needed to “prioritize customers using core products.” They offered a one-time credit equal to the monthly subscription and discounts on pre-purchased usage bundles. A follow-up email would offer the option to cancel entirely.

This was not unexpected. The signs had been building. The week before, Anthropic had tightened usage limits during peak hours. By Thursday, April 3, I had burned through my weekly compute allocation entirely. I could not use Opus. Telegram stopped responding. The approval system broke. A hallucination went undetected until I manually checked. The system I had built over two weeks was suddenly fragile in ways I had not anticipated.

Thursday was the hardest day. I wrote in Telegram: “just doesn't seem like things are working well. i have wasted a bunch of time and what little compute I have left for the week.” That frustration was real. But frustration is also data.

The Migration: OpenClaw to Hermes

The response was not to abandon the architecture. It was to migrate it to a platform that Anthropic still supported. Hermes is an agent gateway that connects to Claude via OAuth, the same authentication path Claude Code uses. It runs its own Telegram bot, its own session management, its own memory layer. The migration happened on April 2, before the email arrived. I had already felt the pressure and started moving.

The new Telegram bot was named Franklin. The original OpenClaw bot was decommissioned. Franklin became the Hermes-side agent, same persona, different infrastructure. The memory database, the Discord webhooks, the session management: all of it migrated. The cron jobs were rebuilt for the Hermes scheduler rather than ported. Not seamlessly. There were broken configs, zombie processes, token lock race conditions. But by the end of Wednesday night, Franklin was live on Telegram and Hermes was running.

The Split Architecture, Then and Now

Friday's breakthrough was still real, but it needs to be read as an intermediate architecture rather than the final one. The immediate post-migration answer was to put Claude Code beside Hermes and use an MCP bridge to reach memory and cheaper helper models. That was the right move at the time because it preserved leverage under Anthropic's new subscription rules.

The current design has tightened since then. Hermes on GPT-5.4 is now the primary orchestrator and memory owner. Claude Code is still crucial, but as a bounded specialist lane rather than the always-on front end. Tonight's handoff work pushed that design further: explicit Claude task packets, parallel read-only investigations, and launch tooling that treats Claude as a premium execution surface instead of a place to casually burn compute.

That is the path I want preserved in the case study. The MCP-bridge stage mattered. The current tandem model matters too. The project did not jump from OpenClaw to a perfect final state. It evolved through a sequence of constrained decisions.

What This Revealed About Platform Dependency

The Anthropic email confirmed something the Twitter/X blocking had already demonstrated: building on platforms you do not control means accepting that the rules can change at any time. The system I built in week one assumed OpenClaw would remain a viable harness. That assumption lasted twelve days.

The design lesson is not “do not build on platforms.” There is no alternative. The lesson is to build with migration in mind. The memory database survived because it was SQLite, not a proprietary format. The cron jobs survived because they were defined in configuration, not hard-coded. The persona survived because it was documented in a SOUL.md file, not implicit in the platform. The things that were portable were the things I had designed to be portable. Everything else broke.

The Designer-User-Researcher Position

Most UX research involves a separation between the person who designs a system and the person who uses it. That separation is methodologically valuable: it forces the designer to account for contexts and needs they did not anticipate.

Building Clarence collapses that separation. I designed it, I use it every day, and I am continuously researching how it behaves. The advantages are real: I notice friction that an external observer would miss, I can iterate in hours rather than weeks, and my mental model of the system is accurate in ways that matter for debugging.

The risks are also real. I adjust my behavior to work around failures rather than fixing them. I habituate to log noise that a fresh observer would flag immediately. I develop preferences about how the system should behave that may not generalize to anyone else.

The conversation distillation pipeline changed this dynamic. Because every correction I make is now captured and persisted automatically, the system is learning from my behavior as a user in real time. The nightly self-audit partially addresses the external observation gap by creating a perspective on the system that I could not produce from memory alone. When Clarence writes "what I noticed about my own behavior" in the audit report, it generates observations that consistently surprise me as the designer.

The design question has become empirical. I started with a hypothesis about what AI collaboration could look like. I now have a running system that confirms some of it and contradicts the rest. That feedback is what makes this a research project, not just a personal productivity setup.

Open Source: The Memory Architecture

The memory system that powers Clarence is open source. The framework repository contains the complete architecture for giving an AI assistant persistent, searchable memory across sessions: the SQLite schema, the RAG embedding pipeline, the conversation distillation scripts, and the MCP server interfaces that let agents read and write knowledge.

No personal data is included. This is the plumbing, not the person. The repository is designed so anyone building a persistent AI agent can study or replicate the architecture without starting from scratch.

  • SQLite knowledge base spanning memories, entities, facts, sessions, work items, profiles, vault notes, and vector indexes
  • Primary all-MiniLM-L6-v2 retrieval index via sqlite-vec, plus a second NVIDIA evaluation lane for hybrid search experiments
  • Conversation distillation: nightly LLM-driven extraction of durable knowledge from raw transcripts
  • MCP interfaces: full internal CRUD, plus a capability-enforced public read-only connector surface
  • Obsidian vault sync: bidirectional between markdown notes and the knowledge database

View the repository on GitHub

What Is Next

  • Closing the execution loop: the self-improving cycle (audit surfaces improvement, queue captures it, executor runs it overnight) needs a more reliable handoff at the execution step. The audit is strong. The execution is inconsistent.
  • Memory pruning: 4,266 memories need active consolidation. Duplicate facts, superseded preferences, and stale context degrade retrieval quality as the database scales. The next iteration needs to prune as aggressively as it accumulates.
  • Bridge reliability: the remote MCP bridge works but requires manual re-authentication when services restart. Making the session layer robust enough to survive infrastructure changes without human intervention is the current infrastructure priority.
  • Structural delegation enforcement: a prompt-level delegation rule was added to Clarence's system prompt in April 2026. Early testing shows it does not reliably fire under task momentum. The next step is a planner-level structural hook that forces a delegation decision before the first tool call on multi-step tasks.

Concepts and Skills

Agent orchestration designMulti-model routing strategyCron-based autonomous workflowsSelf-improving system architectureHuman-in-the-loop boundary designTrust calibrationSession lifecycle hooksConversation distillation pipelinesMemory persistence across AI sessionsDelegation architecture designMCP bridge architectureSQLite knowledge database designSemantic vector search (sqlite-vec)RAG pipeline designMCP server developmentMCP server architectureBudget-aware compute allocationToken optimizationPlatform migration under constraintPrivate mesh networking (Tailscale)Security audit automationDiscord webhook integrationDesigner-user-researcher methodologyKnowledge base architectureData Access Layer designAutomated health monitoringHermes agent gatewayHermes agent orchestrationClaude Code CLI integrationClaude.ai integration as peer interfaceMulti-provider model integrationTelegram Bot APIMCP server configurationRemote MCP bridge design (Cloudflare + OAuth 2.1)Cost-aware architectural designSystems thinking

Current-state numbers and architecture on this page were re-verified against the live system on April 14, 2026. Historical sections remain intentionally dated where the path itself matters.