Telegram as an Agent Control Plane
if you're running agents, you eventually want to talk to them from your phone. the obvious answer is to build an app. but then you're building push notifications, background sync, offline queuing, voice input, image handling, session management - months of work just to get to the starting line. telegram gives you all of that for free. i've been running it as the primary control surface for my AI agents for months. just a bot, a supergroup, and patterns that turned out to be load-bearing.
here's what actually works, why, and how to build it.
why telegram and not a custom app
the honest answer: i tried building a custom interface first. it was a waste of time. telegram already ships everything you'd spec out:
- cross-platform native app with push notifications, background sync, offline message queue
- bot API with full programmatic control: send, edit, delete, react, reply, media, files
- supergroups with forum topics: independent threaded channels inside one chat
- 30+ message edits per minute - enough for real-time streaming
- rich media: images, voice, documents, video, reactions, inline keyboards
- zero deploy pipeline - update your bot code and it's live
the tradeoff is that telegram wasn't designed to be an agent control plane. making it feel like one requires specific architectural patterns. all of these patterns emerged from actual usage - nothing here is speculative.
architecture
supergroup as workspace, topics as sessions
create a telegram supergroup with forum topics enabled. each topic becomes an independent agent session - its own conversation history, model, system prompt, and persistence rules.
my-agents (supergroup)
├── General → primary agent
├── Coach → domain-specific agent (fitness, etc)
├── Research → deep research sessions
├── Project-X → project-scoped agent
└── ...
why this works: forum topics give you session isolation for free. each topic exposes a message_thread_id via the bot API. map that to a session key in your backend.
thread_id = getattr(message, 'message_thread_id', None)
session_key = str(thread_id) if thread_id else "0"
store per-topic config in JSON: display name, model, grounding prompt, whether to reset daily. some topics are persistent (research you return to over weeks), others reset daily (morning briefing).
the mental model: firing up a new topic is the same as opening a new terminal and running claude. you get a fresh agent session with its own context. when you're done, delete the topic or just run reset in it next time you want a clean session. i keep a few placeholder topics around for throwaway conversations - same as having a few terminal tabs open.
what to do: create a supergroup, enable forum topics, create one topic per agent or domain. wire your bot to extract message_thread_id and use it as a session key. this is step one and it's immediately useful even without everything else below.
respond first, stream second
this is the single most important UX pattern. when a message comes in:
- immediately send a placeholder:
"..."(< 200ms) - start inference async
- edit the placeholder every ~2.5s with tool call status
- on completion, replace with the final response
User: "what's my sleep trend?"
Bot: ... ← instant
Bot: 📖 Reading garmin data ✓ ← 2.5s
Bot: 🔍 Searching sleep records... ← 5.0s
Bot: [final response] ← completion
the user never wonders if the bot is alive. they see work happening in real time.
implementation details that matter:
- edit throttle: 2.5s. telegram rate-limits to ~30 edits/min/chat. 2.5s is safe.
- heartbeat: 30s. if no tool calls for 30s (model is thinking), re-send current status.
- watchdog: 1 hour. safety net - cancel if nothing happens for an hour.
- 4096 char limit. telegram max per message. truncate or split.
- status rendering: compact list, max ~8 lines. older entries collapse to ... +N more. each line: emoji + tool name + checkmark when done.
placeholder = await bot.send_message(chat_id, "...")
async for event in agent.stream(prompt):
tracker.update(event)
if tracker.should_edit(): # changed + 2.5s elapsed
await bot.edit_message_text(
chat_id, placeholder.message_id,
text=tracker.render()[:4096]
)
await bot.edit_message_text(
chat_id, placeholder.message_id,
text=final_response[:4096]
)
what to do: implement the placeholder-then-edit pattern. even without the full status tracker, just sending "..." immediately and replacing on completion is a massive UX improvement over the bot going silent for 10-30 seconds.
voice input
telegram sends voice messages as .ogg files. transcribe and route through the same pipeline as text.
- download voice bytes via bot API
- transcribe with whisper (groq's hosted whisper-large-v3 is sub-second and cheap)
- post-process for name correction (ASR misspells proper nouns)
- route transcript through the normal message handler
the agent doesn't know the input was voice. you can talk to your agents from the car, the gym, walking. in production, voice transcription is sub-second - users can't tell it's not native text.
what to do: add a voice message handler that downloads, transcribes, and feeds into your existing text pipeline. groq whisper is the fastest hosted option. this is high leverage for low effort.
images in and out
inbound: when a user sends a photo, download the highest-res version, save to temp, pass the file path to the agent with the caption. modern LLMs read images directly - no OCR, no preprocessing.
outbound: the agent writes images to a known directory (e.g., data/generated_images/). the bot snapshots the directory before inference, diffs after, and sends new files as photos. short response + image = photo with caption (native feel). long response + image = text + separate photo.
reactions and replies as control signals
this is where telegram starts to feel like a real control plane.
reactions as acknowledgment: - 👍 on a reminder = dismiss it - reaction patterns feed an adaptive frequency system - consistently ignored message types automatically back off
reply-to as routing:
register handler types on outgoing messages. when the user replies to a specific message, route to the registered handler, not the general agent.
msg = await bot.send_message(chat_id, "Draft email to Will:\n...")
register_reply_handler(msg.id, "draft_approval", metadata={...})
# user replies "approved" → routes to draft_approval handler
# user replies "change the subject" → same handler, different action
handlers i run in production: draft approval, meeting confirmation, executive cycle answers, human-in-the-loop workflow fulfillment.
why this matters: reply-to gives you targeted interaction without custom UI. the user's reply is contextually bound to the message they're replying to. no modals, no buttons (though inline keyboards work too if you want them).
what to do: start with reply handlers for one workflow - email approval is a good first candidate. the pattern generalizes to any approve/reject/modify flow.
concurrent requests via session forking
what happens when the user sends a new message while a previous request is still running?
the problem: two prompts into the same session simultaneously = corrupted state.
the solution: fork.
No session exists → NEW (create fresh)
Session idle → RESUME (continue existing)
Session has pending → FORK (branch from current)
fork behavior: 1. create new session branched from current (inherits full history) 2. run inference on the fork 3. fork completes → becomes the new canonical session 4. if original completes after the fork → marked stale
stale reconciliation: archive stale responses (truncated, max 3 kept). on next resume, inject as parallel context. the agent sees what happened in the background and can incorporate it.
why "most recent fork wins": the user's latest message is their current intent. the fork serving it should be canonical.
user experience: the user doesn't know this is happening. they send messages freely. each gets its own placeholder, status stream, and response. nothing queues - concurrent messages execute immediately on forks. in practice, 2-3 simultaneous requests work well. beyond that, stale context accumulates faster than it's useful.
what to do: this is the hardest pattern to implement. skip it initially. come back to it when you notice users (or yourself) getting frustrated by queued messages. a simpler v1: just queue and process sequentially with a "working on your previous request, yours is next" message.
batch accumulation
people type in bursts. three messages in 5 seconds should be one prompt, not three agent calls.
- debounce: 2s. after each message, wait 2s for more.
- hard cap: 15s. flush after 15s regardless.
- "wait" command: holds the batch open indefinitely for multi-part prompts with photos.
- "stop" command: cancel accumulating batch, or cancel in-flight inference.
different message types merge naturally - text concatenates, photos become file reads, voice inlines as text, album photos with the same media_group_id deduplicate.
what to do: implement the 2s debounce. it's simple and immediately reduces wasted inference calls from burst typing.
background agents
for tasks that take minutes or hours - research, report generation, podcast production:
- user: "research X and write a report"
- bot: "on it - running in background"
- separate session spawned with its own key (
bg-{topic_id}-{msg_id}) - background agent runs with full tool access
- on completion, results delivered back to the topic
the primary agent can also spawn background agents programmatically via a shell command in its system prompt. delegation without blocking. if you've used backgrounded bash commands in claude code, it's the same concept - kick off heavy work, keep the main session responsive, get notified when it's done.
what to do: implement a basic background spawn - a separate process/task with its own session that posts results back to a topic when done. even a crude version (subprocess + result message) is immediately useful.
the session as decision-maker (this is the important one)
everything above is plumbing. this is the pattern that makes it all actually work.
each telegram topic has one persistent agent session. user messages go into it. but so does everything else - background job completions, executive loop reflections, workflow engine callbacks, scheduled events. they all feed into the same session. and the agent decides what to do with them.
this is fundamentally different from a notification relay. the agent isn't forwarding messages to you. it's receiving context from multiple sources, maintaining conversational state across all of them, and making judgment calls about what's worth surfacing.
how it works: external systems (your exec loop, your workflow engine, your background jobs) call an enqueue function with a trigger type, a topic ID, and a message payload. these land in a sqlite queue. an event funnel polls the queue every 5 seconds, batches events per topic with a 5-second debounce, and formats them as structured XML:
<system-events topic="0" time="2026-04-11T10:00:00">
<event trigger="background_complete" time="..." agent_id="...">
research on X completed. here are the findings...
</event>
<event trigger="stepwise_complete" time="..." job_id="...">
flow "council review" finished. job abc123.
</event>
</system-events>
the batched events get appended with "process the above events. communicate relevant ones naturally." and fed into the topic's claude session via the same inference pipeline as user messages - same session key, same conversation history, same fork-and-reconcile if a user message is in flight.
the delivery convention: the system prompt tells the agent to wrap user-facing output in <telegram-message> tags. text outside the tags is internal reasoning. for user messages, bare text is used as a fallback - the user always gets a response. but for system events, no tags = silence. the agent chose not to surface it.
# in the system prompt:
Wrap content for Telegram in <telegram-message> tags.
Only tagged content gets delivered. Text outside tags is internal.
For system events, only message if actionable. Silence is fine.
this means a background research job can complete, the agent can read the results, decide "this isn't interesting enough to interrupt him," and stay quiet. or it can read the results, decide they're relevant to something discussed earlier in the conversation, and bring them up naturally with context the user didn't have to provide. the agent has the full conversational history - it knows what you care about right now.
concrete example - executive loop: i run an autonomous executive loop that reflects on my schedule, open tasks, and incoming signals every ~90 minutes. when a cycle completes, it calls submit_event(trigger="exec_reflection_complete", topic_id="0", message=summary). the event hits the funnel, gets batched, and arrives in my primary topic's claude session. the agent reads the reflection, and maybe it tells me "heads up - your 2pm got moved to 3pm and you have a conflict with the dentist." or maybe nothing changed and it stays silent. i don't get spammed with "executive cycle 47 complete, no action items." the agent filters.
concrete example - workflow completions: i use a workflow engine (stepwise) for multi-step jobs - research, podcast production, code analysis. when a job completes, a webhook fires, and the completion payload lands in the telegram session. the agent reads the output and decides how to present it. if i asked for the job, it summarizes the results. if it was a background job i forgot about, it might just note "that research you kicked off yesterday finished - want the summary?" if the job failed, it tells me why without dumping a stack trace.
concrete example - human-in-the-loop: when a workflow step needs human input (approval, clarification, a judgment call), the system sends a stepwise_suspend event to the topic. the agent presents the question conversationally. when i reply, the reply routes back through the agent session - not directly to the workflow engine - so the agent can interpret, add context, or ask a clarifying question before fulfilling the step. the agent mediates between me and the automation.
why this matters: most agent-to-human interfaces are either fully synchronous (you ask, it answers) or fully asynchronous (it sends notifications). this pattern is neither. the agent session is a persistent context window that multiple systems can write into, and the agent acts as an intelligent filter between those systems and you. it's closer to having a chief of staff who reads all your incoming mail and decides what to bring to your attention.
the practical effect: i check telegram and there's a message from my agent that says "the competitive analysis you asked about yesterday is done - the key finding is X, and it's relevant because of that conversation you had with Y last week. full report is at Z." no notification fatigue. no robotic "job complete" messages. the agent speaks in its own voice with full context.
what to do: this is the highest-leverage pattern in the whole system but also the one that requires the most infrastructure. start simple: when a background job completes, instead of sending a canned notification, feed the result into the topic's agent session and let the agent decide how to present it. add the delivery-tag convention so the agent can choose silence. then gradually add more event sources. the more context flows through the session, the better the agent's judgment gets about what matters.
dynamic system prompts per topic
each topic invocation builds its system prompt dynamically:
- topic identity and purpose
- domain-specific grounding (coach topic gets fitness context, etc)
- telegram-specific formatting rules: lowercase, different bold syntax, 4096 limit
- the
<telegram-message>delivery convention (described above) - current timestamp
the grounding prompt is where topics get their personality. my coach topic gets fitness history, training load, and physiological context. my primary topic gets a full identity profile, working memory, and tool access. a research topic might get minimal grounding and just run with a high tool-use budget. same bot, same infrastructure, completely different agents.
direct commands
not everything needs the LLM. certain keywords are intercepted by the bot before they ever hit the agent:
stop- cancel the in-flight request or accumulating batch. edits the placeholder to "stopped."wait- hold the debounce batch open indefinitely. for composing multi-part prompts with photos without the 2s timer flushing early.reset- clear the topic's session and start fresh. re-sends the topic's grounding prompt if one is configured.model- list available models with current selection marked.model 2switches and resets the session. this lets you swap between claude code, frontier models, or fast/cheap models on the fly based on task complexity. i use aloop as the backend so any model or runtime is just a number.claude- list configured API accounts with usage stats (5-hour and 7-day rate limits, reset times).claude 2switches accounts.claude savecaptures current CLI credentials to the active slot. useful when you're hitting rate limits and want to rotate.
no slash prefix, no bot menu - just plain words the bot recognizes and acts on immediately. zero inference cost. the key insight: these are session-level controls, not agent features. the user manages the control plane directly without asking the agent to manage itself.
what you need
- telegram bot (messages, edits, media): python-telegram-bot, aiogram, or telethon
- agent backend (LLM inference + tools): claude API, openai, any agent framework
- session store (per-topic persistence): sqlite, redis, or json files
- whisper API (voice transcription): groq (fastest), openai, or local whisper
- task registry (track in-flight requests): in-memory dict with timeout reaper
start with: bot + agent backend + session store. add voice, images, forking, background agents incrementally.
implementation order
if you're starting from zero, this is the order i'd build in:
- bot + supergroup + topics - basic message routing by topic
- placeholder-then-edit - instant feedback loop
- debounce - stop wasting inference on burst typing
- voice - high leverage, low effort
- reply handlers - one workflow (email approval, etc)
- background agents - offload heavy tasks
- event injection + delivery tags - feed background completions into the session, let the agent decide what to surface. this is where the system starts feeling like a control plane instead of a chatbot.
- images - in and out
- session forking - only when queuing becomes painful
- reaction-based frequency adaptation - polish
- multi-source event funnel - exec loops, workflow engines, scheduled events all flowing through the session
each step is independently useful. you don't need all eleven to get value. but step 7 is where the architecture fundamentally changes character - before that you have a chatbot, after that you have a control plane.
production numbers
after months of daily use as my primary agent interface:
- placeholder: < 200ms
- voice transcription: sub-second (groq)
- most responses: 5-30s
- concurrent via forking: 2-3 requests cleanly
- persistent topics: weeks of history across compactions
- failure mode: error replaces placeholder, user replies to retry, no silent failures
the key insight
telegram's constraints - 4096 chars, edit rate limits, no custom UI - actually force good agent UX. concise responses. real-time status. voice-first input. background processing with async delivery.
the phone in your pocket is already a control plane for agents on your server. no app to build. no app store. no react native.
just a bot, a supergroup, and some thoughtful architecture.