Agents Are Not Just Model Calls
Article

Agents Are Not Just Model Calls

Kirk Marple

Kirk Marple

Dossium Agents Launch Week— Part 5 of 5

A five-part launch series on the Agent Experience layer for B2B work: context and methodology, channels and persona, governed action, research orchestration, and the runtime that makes agents dependable.

A diligence run can take two hours.

A portfolio monitor fires at 7am whether anyone is watching or not.

A worker spawned from a Slack thread has to deliver back to that same Slack thread an hour later, even if the deployment that started it is long gone.

Some runs touch fifty companies in dependent waves. Some decompose into ten workers. Some write to Slack, Notion, Linear, email, and calendar. Some should do nothing because nothing changed.

You cannot build that with setTimeout and hope.

You cannot build it with one cron endpoint and a database row called status.

You cannot build it by wrapping a model call in a queue and calling it an agent.

This is day five of launch week. The first four pieces were about product: context, presence, action, and intelligence. Today is the technical closer.

If you are building agents for real work, the question is not only "which model?"

The question is:

What runtime makes the model safe, durable, interruptible, budgeted, observable, and able to come back where the work started?

Understanding makes an agent accurate.

Runtime makes it dependable.


What Breaks Without Runtime

The first failure mode is obvious: the run times out.

The agent gets halfway through the data room, the function dies, and the useful part never returns.

The second failure mode is worse: duplicate side effects.

A retry fires, the agent has lost track of what already happened, and now it posts the same Slack message twice, files two tickets, or sends a second email that looks just different enough to be embarrassing.

Then come the subtler failures:

  • A worker finishes but has nowhere to deliver because the channel context expired.
  • A scheduled agent spends tokens every morning on "nothing changed."
  • The model loops on the same tool call for thirty minutes.
  • The context window fills with stale tool results.
  • A child worker inherits too much authority from the parent.
  • A run that should have drafted an email sends it.
  • A trigger storm turns into a thousand writes.

These are not model-intelligence problems.

They are runtime problems.


Durable Backbone: Vercel Workflows

Vercel Workflows is the durable backbone for long-running Dossium agent runs.

The important property is checkpointing. A workflow step persists its input, output, and side effects before the next step runs. If the process crashes, deploys, or gets evicted, the workflow resumes from the last completed step.

That is the primitive you need for agent work you do not want to babysit.

Our agentRun workflow is roughly:

  1. Load the agent specification from Graphlit and operational state from Redis.
  2. Increment concurrency counters and enforce org-level run limits.
  3. Resolve user, project, channels, accounts, and execution context.
  4. Run pre-check probes for scheduled agents.
  5. Execute the first segment through the harness.
  6. Decide whether the run is complete, needs continuation, or has emitted a worker/DAG branch.
  7. Persist completion state, metrics, output, and side effects.

Only one of those steps calls the model. The rest is deterministic runtime scaffolding.

That distinction matters. A long-running agent is not "one big model call." It is a sequence of checkpointed decisions, tool calls, side effects, summaries, continuations, and delivery steps.

The user-facing version is simple: if we deploy at 7:03 while your 7am Portfolio Monitor is running, the run should still finish.


Scheduling And Fan-Out: Upstash QStash

Upstash QStash handles the HTTP-native scheduling and fan-out layer.

We use it for three jobs.

Cron triggers. Scheduled agents have QStash schedules. QStash fires a signed POST to the run endpoint, handles retries, and keeps wall-clock scheduling out of our application code.

Signed delivery. Inbound trigger and webhook paths verify signatures before doing real work. Anything unsigned gets dropped before it becomes an agent run.

Deferred channel work. When a fast channel handler needs to spawn long-running work, it queues that work through QStash instead of trying to hold the original request open.

QStash fits this workload because the interface is HTTP. A schedule is a POST. A delayed job is a POST. A retry is a POST with a signature. The runtime already speaks that language.

Upstash Redis stores operational state: schedule IDs, concurrent counters, rate limits, run metrics, dedupe keys, and cached agent routing. It is not the customer knowledge store.

Graphlit is the source of truth for agents, conversations, content, entities, facts, and specifications.

Redis is operational state.

Customer knowledge lives in the durable context platform. Runtime counters live in the runtime store.

That separation is boring in the best way.


Two Execution Paths

Not every activation should pay the same runtime cost.

A Slack reply should feel fast. A voice answer should come back in seconds. A scheduled deep research run should survive deploys, continue across budgets, and deliver later.

So Dossium agents have two execution paths.

Inline path. Channel handlers use the inline path for fast, single-pass interaction: Slack, Teams, Discord, Telegram, WhatsApp, Google Chat, email, voice, iMessage/SMS, and web chat. Same persona, same context model, same tools, but optimized for latency.

Workflow path. Scheduled, triggered, webhook, manual, continuation, and worker runs use the durable workflow path. This path supports checkpoints, continuations, dependent research plans, worker branches, and long-running output delivery.

The user should not have to know which path ran.

They should only feel that quick answers are quick, and long jobs come back when they are done.


Channel Subagent Delivery

This is one of the details that determines whether an agent feels like a toy or a teammate.

A user asks a Slack agent:

Run a deeper pass on these five companies and bring back the contradictions before partner meeting.

The channel handler cannot wait an hour. Slack will time out. The browser tab may close. The function instance will disappear.

But the result still needs to land in the same Slack thread.

So inline execution can emit worker requests with delivery context attached:

  • Slack team, channel, and thread timestamp
  • Email inbox and message IDs
  • SMS or iMessage conversation identity
  • Voice call metadata
  • Telegram or chat thread identifiers

The runtime queues a durable channelSubagentRun workflow. That workflow runs the workers, synthesizes their outputs, extracts the deliverable, then calls the channel-specific delivery adapter with the original context.

The result appears where the work started.

Same Slack thread. Same email thread. Same text conversation.

Retries are deduped with a Redis NX key, so the same parent run does not spawn duplicate worker deliveries.

If the agent asks for an hour, it has to know where to come back.


The Harness

The harness is the loop that drives the model through turns:

assistant response -> tool calls -> tool results -> next assistant response -> more tools -> completion

That loop is where long-running agents go wrong if you do not constrain them.

Dossium's harness adds five controls.

1. Stuck detection. We track repeated tool calls, repeated assistant text, consecutive tool errors, and empty assistant turns. First strike, the harness nudges the model to change approach or finish with what it has. Second strike, the run exits with a stuck status and persists what it gathered.

2. Wind-down protocol. Near the end of a budget, the harness reserves final turns for synthesis and task_complete. The goal is no silent failure. Even if the agent runs out of room, it should return the best answer it can.

3. Context-window management. Tool results are the usual culprit. A single retrieval can dump tens of thousands of tokens into the conversation. We use graduated responses: truncate oversized results, ask the model to summarize when the context is getting heavy, then drop old tool rounds when preserving the latest reasoning matters more.

4. Scratchpads. There are two layers. A conversation scratchpad for the current run, and an agent scratchpad for cross-run working memory. The first helps the agent keep a plan. The second helps it learn how it should operate over time.

5. Post-run evaluation. Optional LLM-as-judge scoring can evaluate completeness, quality, efficiency, and issues after a run. Low scores become a signal that the agent spec or skills need work.

The goal is not to make the model obedient by vibes.

The goal is to give the runtime enough control surfaces to notice drift and recover or stop.


Pre-Check Probes

Scheduled agents have a cost problem.

Most days, nothing important changed.

A portfolio monitor that wakes up every weekday should not burn a full run just to say "no significant updates." A customer-health agent should not spend tokens proving the account is quiet.

So scheduled runs start with deterministic pre-check probes before the model wakes up.

The probes check things like:

  • New content since last run
  • New entities
  • Fact deltas
  • Content spikes
  • Entity changes

These are fast Redis or Graphlit queries. The whole pre-check stage is designed to complete in well under a second.

If everything is quiet, the workflow short-circuits to complete.

Quiet days should be cheap.

Signal days should get the full run.

This matters for economics as much as reliability. Agents that run on schedules need to know when not to run.


Progressive Tool Disclosure

The full tool registry is large.

Putting every tool definition into every prompt is expensive and makes tool selection worse.

So the default runtime exposes a small set of meta-tools:

  • analyze_prompt to route the work
  • search_tools to discover relevant capabilities
  • describe_tools to pull full schemas only when needed
  • execute_tool to invoke discovered tools
  • Harness-level tools like update_scratchpad and task_complete

The rest of the registry sits behind retrieval.

The agent does not need to stare at every possible tool in the company to answer a QBR question. It can discover the small set of tools relevant to the task, inspect them, and execute them.

This gives us three benefits:

  • Smaller prompt surface
  • Fewer wrong tool choices
  • A registry that can grow without making every run heavier

For developers, this is one of the most important runtime design choices. Tool bloat is real. Tool retrieval is not optional once the system gets large enough.


Prompt Caching

Anthropic prompt caching is enabled by default.

The architecture is naturally cache-friendly because the prefix is stable:

system prompt -> SOUL.md persona -> matched SKILL.md methodology -> meta-tool definitions

That prefix is mostly identical across turns within a run and often stable across runs for the same agent. On long multi-turn runs, cached input tokens matter.

This is one reason we care so much about persona and skills as structured context. They are not only product concepts. They create a stable high-attention prefix the runtime can reuse efficiently.

The buyer version: long runs get cheaper and faster.

The builder version: prompt architecture is runtime architecture.


Side-Effect Gating

If an agent can email people, post to Slack, mutate CRM records, file tickets, and spawn workers, containment cannot be a paragraph in the system prompt.

It needs hard layers.

Layer 1: Context-type blocks. Runs execute as interactive, triggered, channel, or subagent contexts. Each context type has tools it simply cannot load. A channel worker should not create new persistent agents. A triggered run should not spawn autonomous loops without bounds.

Layer 2: Per-run budgets. Each run has caps on notifications, mutations, agent creation, and worker spawns. Child workers inherit budgets from the parent, bounded by what the parent has left.

Layer 3: Org-level rate limits. Runs, subagent executions, concurrent work, and write actions are bounded at the org level. A misconfigured trigger or runaway loop hits a wall.

Layer 4: Draft paths. Some actions should produce drafts, not sends. Email is the obvious case. The agent can prepare the work product while the human keeps the final click.

This is how you let agents act without letting them run wild.

Safety is not one mechanism.

It is a stack.


Why This Is Part Of Graphlit

Dossium is the first product built on this runtime, but the runtime is part of the Graphlit Platform.

That matters because agents and context should not be separate systems.

The runtime needs the context graph to know what changed, what entities exist, what facts were extracted, which conversations matter, and what the agent has already done. The context platform needs the runtime because retrieval alone does not perform work.

Context without runtime is memory.

Runtime without context is automation.

Dossium is where they meet first.

We are not opening this stack as a public API today.

But the architecture is built so Dossium is one application of the runtime, not the only possible one.

Watch this space.


The Series

That is launch week.

Monday: context and methodology.

Tuesday: presence, persona, and skills.

Wednesday: governed action.

Thursday: research intelligence.

Friday: runtime and trust.

None of those pieces are enough alone.

Together, they are what make an agent feel like it actually did the work.

If you want to see what this looks like in production, sign up at dossium.ai and give the agent a job that should take longer than a chat response: a portfolio sweep, diligence pass, QBR prep, or escalation review.

That is where runtime shows up.

Not in the demo.

In the fact that it comes back.

Ready to Build with Graphlit?

Start building AI-powered applications with our API-first platform. Free tier includes 100 credits/month — no credit card required.

No credit card required • 5 minutes to first API call