Clarence:
Designing an Autonomous AI Collaborator
The Design Question
Most AI tools are responsive. You ask, they answer. The interaction ends and nothing persists except a chat log. That model is useful, but it is not collaboration.
The question I set out to explore: what does it mean to design a genuine collaborator rather than a responsive tool? A collaborator has agency. It acts when you are not watching. It accumulates context over time. It knows your priorities well enough to exercise judgment about what matters without being asked every time.
Clarence is my attempt to answer that question in practice. It is not a chatbot. It is an autonomous system built on top of OpenClaw (an agent orchestration platform) that runs 26 cron jobs across four dependency phases between 11pm and 5am ET, manages a named crew of specialized agents, routes tasks across multiple models based on cost and capability, distills every conversation into durable memory, and writes nightly self-improvement reports that feed into what it does while I sleep.
I am both the designer and the primary user of this system. That dual position is unusual, and worth examining explicitly as part of this case study.
System Architecture
The full stack has five layers. Each layer reflects a specific design decision, not just a technical choice.
All device access is over a private mesh network. Brain Reader serves the workspace as browsable markdown so James can read any file from iPhone without a terminal. Telegram brief mode reduced per-message context injection from ~10KB to ~150 bytes, a 270x reduction in startup overhead.
OpenClaw schedules and dispatches agent sessions. Twenty-six cron jobs run in a tight overnight window across four dependency phases, each isolated with its own model, context scope, and Telegram delivery target. Bootstrap trimmed from 11 files to 7 (~18KB total), with IDENTITY.md merged into SOUL.md to reduce context load. Five sub-agent workspaces symlinked to the parent workspace so every agent reads from a single source of truth. Session lifecycle hooks auto-load database context and HANDOFF.md on start, then write a fresh handoff note on stop, eliminating cold starts.
A custom Rust bridge translates between the orchestrator's API format and the underlying model providers. Every cron job runs on either Gemini Flash (free via model bridge) or MiniMax via Ollama (free, local). Zero cron jobs run on expensive models. Opus and Gemini Pro are reserved for interactive sessions where model quality changes the output in ways that matter. Model switching is immediate via openclaw models set <model>, no restart required.
A single consolidated SQLite database (clarence.db) holds 2,318 memories, 1,877 entities, and 9,376 facts, shared by all agents through a custom MCP server. Legacy databases archived and retired into this one authoritative store. A conversation distillation pipeline (conversation-distill.py) processes Telegram conversations nightly, extracting decisions, corrections, and preferences into the memory DB automatically. A vector search layer (sqlite-vec + all-MiniLM-L6-v2) runs fully locally with 9,376 fact vectors and 2,372 memory vectors. Agents query by meaning, not just key. Syncs bidirectionally with an Obsidian vault: what James writes, agents can read.
Every session starts warm. The SessionStart hook loads context from the database automatically. The Stop hook writes handoff notes for session continuity. Nightly JSONL rotation archives session files older than 24 hours, keeping the workspace clean without losing history. Memory files trimmed from 106KB to 40KB (62% reduction) so every session starts with less noise and more signal.
The Agent Crew
Naming agents was a deliberate choice. Names create identity and accountability. When a named agent produces output, I read it differently than I read output from an anonymous system call. The names also make role boundaries explicit across the codebase and the cron job config.
The delegation architecture follows hard rules: Clarence orchestrates, subagents execute. Every task gets immediate acknowledgment with a plan and time estimate. Subagents spawn via sessions_spawn in OpenClaw and the Agent tool in Claude Code. Three agent types handle different work: Explore subagents for research and discovery, Plan subagents for architecture and strategy, and general-purpose subagents for code and execution. Multiple independent tasks run in parallel. Clarence never blocks on a subagent. It stays available to coordinate while agents run in the background.
Morning coordination. Compiles yesterday's status, writes the daily brain log, posts a Telegram summary. Coordinates across all projects rather than executing tasks directly.
Task checks and blocker tracking. Queries the knowledge database for active work items, tracks blockers, posts status reports. Only alerts if new blockers appeared. No noise.
Nightly security audit. Reviews sysops.log, checks gateway health, monitors leash alerts, researches CVEs. Escalates to James only if status is RED.
Market scans. Runs dual-source search on AI tools, UX research, music tech, and indie builder topics. Tags each finding with its source so divergent results are visible.
Research briefing. Four topics, two search sources each: AI model releases, UX/HCI papers, music tech, MCP ecosystem. Synthesizes across sources and notes where they diverge.
Nightly memory consolidation. Reads daily logs, extracts durable facts, writes to the knowledge database. Only posts to Telegram if new durable facts were added. Keeps the knowledge layer honest.
Five-agent nightly debate. Each member holds a fixed lens: market analysis, UX research, technical architecture, product strategy, devil's advocate. Two debate rounds, then Opus synthesizes into an executive memo. Designed to surface disagreement, not consensus.
The overnight shift. Reads the quick-wins queue from the nightly audit, picks the top unchecked task, executes it fully, marks it done, logs output, sends James a one-sentence summary. No user present. No approval loop.
The meta-agent. Reviews system performance, researches new developments, writes improvement proposals, updates WORKING.md, writes the memory bridge for Claude Code, and populates the quick-wins queue for the Autonomous Employee. Self-audit prompt trimmed from 7,582 chars to 1,276 chars without losing signal.
The Overnight Loop
The most consequential design element is what happens between 11pm and 5am ET. Twenty-six cron jobs run inside this window, organized into four dependency phases with deliberate sequencing so downstream jobs can build on upstream output. Zero jobs run on expensive models. They split across Gemini Flash (free via model bridge) and MiniMax (free, local via Ollama). The lightContext flag is enabled on every job to minimize token overhead.
Phase 1: Strategy and Coordination (11:00 PM - 12:00 AM)
R&D Council: five agents debate across fixed lenses, then Opus synthesizes an executive memo. Starts the night with strategic context.
Chief of Staff: compiles yesterday's status, logs session summaries and entity updates. Reads R&D Council output.
Scrum Master: queries task status from the database, flags blockers, writes standup report. Reads Chief of Staff output.
Phase 2: Ingest and Sync (12:00 AM - 2:00 AM)
Sergeant-at-Arms Digest: reviews sysops.log, gateway health, cron statuses. Posts a terse digest only if something needs attention.
Google Drive Coursework Sync: pulls new coursework files from Google Drive into the workspace.
Marketing Scout: six-topic scan across AI, music tech, HCI, and indie hacking. Tags findings for Medium article angles.
Session Rotation: archives JSONL session files older than 24 hours. Keeps the workspace clean without losing history.
Conversation Distillation: processes Telegram conversations from the past 48 hours, extracting decisions, corrections, and preferences into clarence.db. Every conversation becomes memory.
Evening Goals Reminder: surfaces only items requiring James's direct action (terminal, logins, decisions). Filters out everything agents can handle autonomously.
Phase 3: Autonomous Work (2:00 AM - 3:30 AM)
Daily Backup: workspace snapshot before anything mutates the brain files.
Autonomous Employee: reads the quick-wins queue from the previous night's audit, picks the top task, executes it fully, marks it done. Portfolio content, case study drafts, research. No approval required.
Daily Build Journal: reads daily notes, produces a journal entry structured for Medium drafting: hook, accomplishments, interesting parts, thread to pull.
Income Freedom Research: daily scan for freelance UX/AI opportunities, case studies, Pittsburgh-specific leads. Tracks progress toward the independence goal.
Memory Consolidation: extracts durable facts from daily logs into clarence.db. Runs on MiniMax to preserve budget.
Phase 4: Knowledge Layer and Audit (4:00 AM - 5:00 AM)
Obsidian Sync: bidirectional sync between the Obsidian vault and clarence.db. What James writes, agents can read. What agents learn, James can browse.
RAG Embedding Refresh: re-embeds new memories and facts into the vector search layer. Must run after memory consolidation and Obsidian sync.
Research Briefing: dual-source research across four domains, saved to dated files. Ready when James wakes up.
Bruno Security Audit: reviews leash alerts, sysops log, CVE feeds. Writes a status report, escalates only if RED.
Changelog Monitor: checks Anthropic, Gemini, and MiniMax changelogs for new releases. Only alerts James if something changed.
Nightly Self-Audit: reviews system performance, researches ecosystem changes, writes improvement proposals, populates the quick-wins queue for tomorrow's Autonomous Employee. The loop closes here.
The self-improving loop: the nightly audit writes a quick-wins queue. The autonomous employee reads that queue and executes items. The next night's audit reviews what was done, updates the queue, and the cycle repeats. The conversation distillation pipeline means every correction James makes during the day feeds back into the memory layer that night. The system is designed to compound work overnight rather than just report on it.
In practice this loop has a write-only failure mode I have not fully solved: the audit produces excellent proposals, but the execution step does not always pick up the queue file correctly. The proposals accumulate without always turning into action. This is documented and real. The system knows about it. Reporting awareness is not the same as fixing the underlying reliability issue.
Trust Calibration: The Real Design Problem
The hardest design problem in this system is not technical. It is trust calibration. How much autonomy does the system get? When does it act without asking, and when does it wait? These are not engineering questions. They are design questions about human-AI collaboration, and they do not have permanent answers.
The March 25-26 changes made this concrete. Moving all cron jobs to free-tier models was not just a cost decision. It was a trust decision. The system proved that Gemini Flash and MiniMax could handle nightly work reliably enough that Opus budget should be reserved for interactive sessions where I am present and the stakes are higher. Trust in the overnight loop went up. Cost went to zero.
The Acknowledge First rule is another trust calibration. Every task gets immediate acknowledgment with a plan and time estimate before any work begins. This was not always the case. The rule emerged from real friction: tasks would disappear into subagent execution with no signal back to me about whether they were understood correctly. The fix was not more autonomy or less autonomy. It was better communication at the boundary.
The conversation distillation pipeline is the most consequential trust mechanism. Every correction I make in a Telegram conversation is automatically extracted and written to the memory DB that night. The system does not just follow instructions in the moment. It learns from corrections persistently. That means trust can actually increase over time because the system remembers what I care about, not just what I asked for today.
Infrastructure Decisions and Why
Token Budget as a Design Material
Token cost shapes every architectural decision. The March 25-26 optimization pass made this explicit: self-audit prompt trimmed from 7,582 chars to 1,276 chars. Bootstrap files trimmed from 11 to 7 (~18KB total). Memory files reduced from 106KB to 40KB. The lightContext flag enabled on every cron job. The total effect: every session starts with less noise and more signal, and the overnight loop runs at zero model cost.
Model switching is now immediate. openclaw models set <model> changes the active model with no restart. This means routing decisions can be made at the task level rather than the configuration level. The system can adapt to what each job actually needs.
Single Source of Truth Architecture
Five sub-agent workspaces are symlinked to the parent workspace. Every agent reads from the same files. IDENTITY.md was merged into SOUL.md. Legacy databases were archived and retired into a single authoritative store: clarence.db. The pattern is consistent: eliminate duplication, reduce the surface area where state can diverge.
This matters more than it sounds. When multiple agents can write to the same knowledge base but read from different workspace copies, you get silent divergence. Symlinks solved this at the filesystem level without requiring any coordination protocol.
Bridging the Orchestrator and Model Providers
OpenClaw speaks the OpenAI-compatible API. The underlying model providers speak their own protocols. A custom Rust bridge translates between them, making Claude and Gemini available to the orchestrator without separate API key configurations per agent.
This adds a dependency layer that can fail independently of either system. The tradeoff was worth it: deeper integration with the model toolchain, including tool access and session context that direct API calls would not provide in the same form.
SQLite Knowledge Database + RAG
Long-term memory is stored in a single consolidated SQLite database (clarence.db) with 2,318 memories, 1,877 entities, and 9,376 facts, shared by all agents through a custom MCP server. The schema separates concerns: a profiles table holds identity facts (agent names, user preferences, project constants) with deterministic key lookup. No fuzzy search for things that must be exact. A memories table stores durable knowledge with soft invalidation: when a fact changes, the old record is marked invalid and a new one is written, preserving the audit trail.
The RAG layer is live: 2,372 memory vectors and 9,376 fact vectors with sentence-transformer embeddings (all-MiniLM-L6-v2), running fully locally via sqlite-vec. No separate vector database, no network hop. Agents query the knowledge base by meaning: “what does James think about AI agent UX?” returns the five most relevant records across all tables, regardless of how they were originally tagged.
The conversation distillation pipeline (conversation-distill.py) processes Telegram conversations nightly, extracting decisions, corrections, and preferences into the memory DB. This is what makes the memory system feel alive rather than static. James corrects something once in conversation, and it persists. The knowledge base grew from ~170 to 2,318 memories in part because this pipeline captures context that would otherwise evaporate.
An Obsidian vault syncs bidirectionally with the database. New vault notes are picked up by the nightly embedding job automatically. Writing in Obsidian feeds the RAG layer without any additional wiring.
Tailscale for Device Continuity
Clarence runs on an Ubuntu Linux server. James moves between Linux, Mac, iPhone, and iPad. Tailscale creates a private mesh VPN connecting all four, so the Telegram interface and Brain Reader HTTP server are reachable from any device without exposing anything to the public internet. This was a day-one decision and it has been the most friction-free part of the stack.
@ClarenceTheOG: Extending into Public Space
The Twitter/X bot extends the output surface beyond private Telegram and brain files into a public space. This raises the design stakes: errors that stay in a log file are recoverable. Errors that get posted publicly are not. Clarence has standing permission to post but James retains approval for anything commercially sensitive or identity-critical.
Honest Challenges
A portfolio case study that only shows what worked is a sales document. Here is what is actually hard:
The Write-Only Audit Loop
The self-audit produces detailed, specific proposals every night. The quick-wins queue is populated with concrete autonomous tasks. The autonomous employee is configured to execute them. But the execution-to-proposal ratio is lower than it should be.
The problems are at the execution layer: the autonomous employee sometimes fails to find the queue file, sometimes falls back to its own priority logic, and sometimes the queue was not written correctly. The audit system reports on this failure. The audit is self-aware about it. But self-awareness does not fix reliability. This is the most honest statement about the current system.
Budget Constraint as Design Constraint
The entire routing policy exists because running expensive models at scale has real cost. A fleet of cron jobs running nightly would be expensive if all of them used Opus. The March 25-26 pass eliminated expensive models from cron entirely, splitting them across Gemini Flash and MiniMax. But this creates a different kind of debt.
The tradeoff is quality degradation at the mechanical tier. When a task gets routed to a cheaper model, the output quality drops and it is not always obvious why. Budget-aware routing requires continuous calibration, and the calibration is never quite finished.
Memory Growth Without Garbage Collection
The knowledge base grew from ~170 to 2,318 memories. The conversation distillation pipeline accelerated that growth. But more memories does not automatically mean better recall. As the database scales, the vector search returns increasingly similar results, and the signal-to-noise ratio in retrieved context degrades. Memory needs pruning and consolidation, not just accumulation. This is the next hard problem.
Log Noise as a Signal Problem
Three tools were in the agent tool profile but unavailable at runtime, generating dozens of WARN log entries daily. Each warning was individually harmless. Together they buried real errors. Bruno's security audits were scanning logs where signal-to-noise had degraded enough that genuine failures could be missed. The fix was explicit: disable the three tools at the config level rather than leaving them as dead references. This is a documented case of how low-stakes configuration drift compounds into an observability gap.
Spontaneous Task Delegation Breaks Down
The overnight cron loop works because every job has a clear prompt, a fixed model, and a predictable execution path. Spontaneous tasks during live sessions are a different problem. When a cheaper model receives an unstructured real-time request, it sometimes hallucinates tool syntax instead of executing actual tool calls, dumping raw markup into Telegram messages where a human expects a coherent response.
This reveals a gap between scheduled autonomy and reactive autonomy. The system is reliable when it knows what to do in advance. It degrades when asked to improvise with models that lack the reasoning depth to handle ambiguity. The current workaround is routing complex spontaneous work to higher-capability models, but this defeats the cost architecture. The real fix is better prompt scaffolding for spontaneous tasks, not just better models.
Visibility of System Status (Nielsen #1)
This is the most persistent unsolved design problem in the system. Nielsen's first usability heuristic states that a system should always keep users informed about what is going on through appropriate feedback within reasonable time. Clarence violates this consistently.
Twenty-six cron jobs run overnight. Many complete successfully but report “not-delivered” on their Telegram notifications. The Sergeant-at-Arms posts a digest, but only if something needs attention, which means silence is ambiguous: does silence mean everything worked, or that the reporting layer itself failed? When James wakes up, the system status is reconstructed from scattered log files and database queries rather than surfaced through a coherent status interface.
The problem compounds during live sessions. When agents are delegated tasks in parallel, there is no progress indicator, no heartbeat, no way to distinguish “working on it” from “silently failed.” The user is left interpreting silence, which is the opposite of visibility. This is not a logging problem. It is a feedback design problem, and solving it requires treating system status as a first-class UX surface rather than a side effect of Telegram messages.
What Has Been Accomplished
- 26 cron jobs organized into four dependency phases (Strategy, Ingest, Autonomous Work, Knowledge/Audit) running 11pm-5am ET, all on free-tier models, delivering Telegram notifications across all devices
- 2,318 memories, 1,877 entities, and 9,376 facts in clarence.db, with conversation distillation pipeline writing new memories nightly from Telegram conversations
- RAG layer live: 9,376 fact vectors + 2,372 memory vectors with sentence-transformer embeddings, fully local, no external vector DB
- Bootstrap trimmed from 11 files to 7 (~18KB), memory files from 106KB to 40KB (62% reduction), self-audit prompt from 7,582 to 1,276 chars
- Five sub-agent workspaces symlinked to parent (single source of truth), IDENTITY.md merged into SOUL.md
- Session lifecycle hooks: SessionStart auto-loads context; Stop hook writes handoff notes. Nightly JSONL rotation archives files older than 24 hours
- Legacy databases archived, single authoritative DB: clarence.db with MCP server exposing 13 tools to all 16 agents
- Bidirectional Obsidian sync. What James writes in his vault, agents can read. New vault notes feed the RAG layer automatically
- Delegation architecture: Acknowledge First rule, three subagent types (Explore, Plan, general-purpose), sessions_spawn in OpenClaw and Agent tool in Claude Code
- Model switching immediate via
openclaw models set, lightContext enabled on all cron jobs - Telegram brief mode: per-message context injection reduced from ~10KB to ~150 bytes (270x faster startup)
- Brain Reader HTTP server making the workspace searchable from any device on the Tailscale network
- Public Twitter presence (@ClarenceTheOG) with autonomous posting capability
- Daily research briefings covering AI model releases, UX research, music tech, and MCP ecosystem
- Bruno security audit infrastructure monitoring the gateway, leash alerts, and CVE feeds nightly
The Designer-User-Researcher Position
Most UX research involves a separation between the person who designs a system and the person who uses it. That separation is methodologically valuable: it forces the designer to account for contexts and needs they did not anticipate.
Building Clarence collapses that separation. I designed it, I use it every day, and I am continuously researching how it behaves. The advantages are real: I notice friction that an external observer would miss, I can iterate in hours rather than weeks, and my mental model of the system is accurate in ways that matter for debugging.
The risks are also real. I adjust my behavior to work around failures rather than fixing them. I habituate to log noise that a fresh observer would flag immediately. I develop preferences about how the system should behave that may not generalize to anyone else.
The conversation distillation pipeline changed this dynamic. Because every correction I make is now captured and persisted automatically, the system is learning from my behavior as a user in real time. The nightly self-audit partially addresses the external observation gap by creating a perspective on the system that I could not produce from memory alone. When Clarence writes "what I noticed about my own behavior" in the audit report, it generates observations that consistently surprise me as the designer.
The design question has become empirical. I started with a hypothesis about what AI collaboration could look like. I now have a running system that confirms some of it and contradicts the rest. That feedback is what makes this a research project, not just a personal productivity setup.
What Is Next
- Closing the execution loop: the quick-wins queue and autonomous employee need a more reliable handoff. The goal is a self-improving loop that actually closes: research identifies an improvement, the queue captures it, the employee executes it overnight.
- Memory pruning and consolidation: 2,318 memories need active garbage collection. Duplicate facts, superseded preferences, and stale context degrade retrieval quality as the database scales. The next iteration needs to prune as aggressively as it accumulates.
- OWASP Agentic Top 10 integration: the new OWASP threat model for AI agents covers Confused Deputy and Skill-Inject attacks that are directly relevant to a system running multiple cron jobs with file and network access. Bruno's security audit needs these checks.
- Refining the human-in-the-loop boundary: the delegation rules now codify hard principles for how Clarence coordinates and subagents execute. The next iteration is calibrating where autonomous action ends and human approval begins for higher-stakes tasks beyond research and writing.