Skip to content
Search ESC

Temporal Observability for AI Workflows: What to Instrument Beyond Workflow Status

2026-06-29 · 8 min read · Igor Bobriakov

Temporal gives you workflow status by default. Completed, failed, timed out, cancelled — the Temporal UI shows execution state for every workflow run, with the full event history and retry counts per activity. For infrastructure health, that is useful. For AI workflows, it is not enough.

Workflow status tells you whether the workflow completed. It does not tell you whether it produced correct results, stayed within cost budget, or met latency SLAs at the activity level. A Temporal workflow that calls an LLM repeatedly can complete successfully while consuming more token budget than expected, returning outputs that failed downstream quality checks and triggered silent fallbacks, or producing responses that were technically valid but semantically wrong in ways no infrastructure metric captures.

The gap between “workflow completed” and “workflow performed correctly” is where AI-specific instrumentation lives. This post covers what to build in that gap: custom search attributes for fleet-level visibility, activity-level tracing for LLM calls, token tracking across multi-step executions, and alerting patterns that surface quality degradation before users report it.

What Temporal Gives You Out of the Box

The Temporal event history is genuinely useful. For each workflow execution, you get a complete log of every event: WorkflowExecutionStarted, ActivityTaskScheduled, ActivityTaskStarted, ActivityTaskCompleted or ActivityTaskFailed, and the final WorkflowExecutionCompleted or WorkflowExecutionFailed. Each event has a timestamp, so you can reconstruct exact timing for any step.

The Temporal UI exposes this history per execution and adds aggregate views: workflow count by status, workflow duration distributions, schedule-to-start latency, and activity failure rates by activity type. These are useful metrics for capacity planning and worker fleet health.

Custom search attributes extend this further. You register them at the namespace level and set them from inside the workflow function. They become queryable via the visibility API — which means you can run fleet-level queries against your workflow executions by any attribute you define.

What Temporal does not provide is content-level visibility: what the LLM returned, how many tokens each call consumed, whether the output passed your quality criteria, or what the model decided at each reasoning step. Those signals require explicit instrumentation at the activity level and structured accumulation in the workflow.

Principle: Temporal's built-in observability is infrastructure observability — it answers "did the execution proceed correctly?" AI observability answers "did the execution produce correct results?" The two questions require different instrumentation, and neither substitutes for the other.

The Four Observability Layers for AI Workflows

LayerWhat It CoversWhat It CatchesWhat It Misses
Temporal built-inWorkflow status, event history, retry counts, duration, queue latencyWorker failures, task queue saturation, stuck workflows, retry stormsOutput quality, token cost, LLM response content, semantic correctness
Custom search attributesIndexed metadata on workflow executions (model name, token totals, quality score, cost)Fleet-level cost overruns, quality distribution shifts, model version anomaliesPer-activity breakdown, tool call arguments, intermediate reasoning steps
Activity-level tracingPer-LLM-call metrics: tokens in/out, latency, model, prompt fingerprint, quality signalExpensive individual calls, per-step quality failures, provider latency degradationCross-workflow patterns, fleet-level trends, correlated failures across executions
AI-specific instrumentationTool call arguments and responses, reasoning chain reconstruction, quality scoring, hallucination signalsWrong tool invocations, argument errors, output correctness failures, silent degradationInfrastructure health (covered by layers 1–2)

The four layers are complementary. A production AI workflow needs all four. Teams that instrument only layer 1 know their workflows are completing but cannot diagnose why outputs are wrong. Teams that instrument layers 3 and 4 without layer 2 lose the ability to run fleet-level queries — they can diagnose individual executions but cannot find the pattern across hundreds of runs.

Custom Search Attributes for AI Workflows

Custom search attributes are the bridge between per-execution detail and fleet-level visibility. Register them at the namespace level, then set and update them from inside the workflow function as the execution proceeds.

For AI workflows, the following attributes provide the most diagnostic value:

  • model_name (Keyword) — the primary LLM model used in this workflow execution
  • total_input_tokens (Int) — cumulative input tokens across all LLM activity calls
  • total_output_tokens (Int) — cumulative output tokens across all LLM activity calls
  • workflow_cost_usd (Double) — accumulated cost in USD, computed from token counts and model pricing
  • quality_score (Double) — the final quality evaluation score for the workflow output (0.0–1.0)
  • has_quality_violation (Bool) — true if any activity’s output failed the quality threshold
  • tool_call_count (Int) — total number of tool calls made across all activities
  • fallback_triggered (Bool) — true if the workflow invoked a fallback model or strategy

With these attributes registered, the Temporal visibility API supports queries like:

WorkflowType = "DocumentProcessingWorkflow"
AND quality_score < 0.7
AND ExecutionStatus = "Completed"
AND StartTime > "2026-06-28T00:00:00Z"

This query returns workflows that completed successfully but produced below-threshold quality in the chosen time window — a fleet-level quality degradation signal that infrastructure metrics alone do not produce.

Warning: Custom search attributes cannot be set from inside activity functions — only from the workflow function itself. The workflow must receive quality and cost data as return values from activities, then update the search attributes after each activity completes. Attempting to call workflow.upsert_search_attributes() from inside an activity raises a runtime error. Design your activity return types to carry all the data the workflow needs for attribute updates.

Activity-Level Instrumentation for LLM Calls

Each activity that calls an LLM is an instrumentation boundary. The activity receives a prompt and configuration, executes the call, and returns a result. The result must carry more than just the model’s text output — it must carry the execution metadata the workflow needs to accumulate cost, update quality signals, and reconstruct what happened.

The following Python implementation shows a structured approach using Pydantic models and a Temporal workflow interceptor:

from dataclasses import dataclass, field
from typing import Optional
import time
from pydantic import BaseModel
from temporalio import activity, workflow
from temporalio.client import Client
from temporalio.worker import Worker, Interceptor
from temporalio.worker.workflow_sandbox import SandboxedWorkflowRunner
class AIActivityTrace(BaseModel):
"""Structured trace record for a single LLM activity execution."""
activity_name: str
model: str
input_tokens: int
output_tokens: int
latency_ms: float
cost_usd: float
quality_score: Optional[float] = None
quality_passed: bool = True
tool_calls_made: int = 0
prompt_fingerprint: str = "" # sha256[:8] of prompt template
fallback_used: bool = False
error: Optional[str] = None
class WorkflowMetrics(BaseModel):
"""Accumulated metrics across all activities in a workflow execution."""
total_input_tokens: int = 0
total_output_tokens: int = 0
total_cost_usd: float = 0.0
total_latency_ms: float = 0.0
activity_count: int = 0
quality_violations: int = 0
tool_call_count: int = 0
fallback_count: int = 0
min_quality_score: float = 1.0
model_name: str = ""
def accumulate(self, trace: AIActivityTrace) -> None:
"""Update accumulated metrics with a new activity trace."""
self.total_input_tokens += trace.input_tokens
self.total_output_tokens += trace.output_tokens
self.total_cost_usd += trace.cost_usd
self.total_latency_ms += trace.latency_ms
self.activity_count += 1
self.tool_call_count += trace.tool_calls_made
if trace.fallback_used:
self.fallback_count += 1
if not trace.quality_passed:
self.quality_violations += 1
if trace.quality_score is not None:
self.min_quality_score = min(self.min_quality_score, trace.quality_score)
if not self.model_name:
self.model_name = trace.model
# Pricing table controlled by your application. Keep this snapshot versioned
# and update it when provider prices or model IDs change.
MODEL_PRICING_USD_PER_1K = {
"your-fast-model": {"input": 0.0, "output": 0.0},
"your-strong-model": {"input": 0.0, "output": 0.0},
}
def compute_cost(model: str, input_tokens: int, output_tokens: int) -> float:
pricing = MODEL_PRICING_USD_PER_1K.get(model, {"input": 0.0, "output": 0.0})
return (input_tokens / 1000 * pricing["input"]) + (output_tokens / 1000 * pricing["output"])
@dataclass
class LLMActivityResult:
"""Return type for all LLM activities — carries both content and trace."""
content: str
trace: AIActivityTrace
raw_tool_calls: list = field(default_factory=list)
@activity.defn
async def call_llm_with_tracing(
prompt: str,
model: str,
activity_name: str,
quality_threshold: float = 0.7,
) -> LLMActivityResult:
"""
LLM activity that emits a full AIActivityTrace alongside its output.
Quality scoring is a placeholder — replace with your evaluation logic.
"""
import hashlib
import openai # or your preferred client
prompt_fingerprint = hashlib.sha256(prompt.encode()).hexdigest()[:8]
start = time.monotonic()
client = openai.AsyncOpenAI()
response = await client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
)
latency_ms = (time.monotonic() - start) * 1000
usage = response.usage
input_tokens = usage.prompt_tokens
output_tokens = usage.completion_tokens
content = response.choices[0].message.content or ""
# Compute quality score — replace with domain-specific eval
quality_score = _score_output(content)
cost_usd = compute_cost(model, input_tokens, output_tokens)
trace = AIActivityTrace(
activity_name=activity_name,
model=model,
input_tokens=input_tokens,
output_tokens=output_tokens,
latency_ms=latency_ms,
cost_usd=cost_usd,
quality_score=quality_score,
quality_passed=quality_score >= quality_threshold,
prompt_fingerprint=prompt_fingerprint,
)
return LLMActivityResult(content=content, trace=trace)
def _score_output(content: str) -> float:
"""
Placeholder quality scorer. In production, replace with:
- LLM-as-judge evaluation
- Regex/heuristic checks for required fields
- Embedding similarity against reference outputs
- Task-specific validation (JSON schema, citation presence, etc.)
"""
if not content or len(content) < 10:
return 0.0
return 0.85 # replace with real scoring logic
@workflow.defn
class InstrumentedAIWorkflow:
"""
Example workflow that accumulates metrics and updates search attributes
after each LLM activity. Demonstrates the correct instrumentation pattern.
"""
@workflow.run
async def run(self, document: str) -> dict:
metrics = WorkflowMetrics()
# Step 1: Extract key entities
result1 = await workflow.execute_activity(
call_llm_with_tracing,
args=[
f"Extract named entities from:\n{document}",
"your-fast-model",
"entity_extraction",
0.8,
],
start_to_close_timeout=__import__("datetime").timedelta(seconds=30),
)
metrics.accumulate(result1.trace)
await _update_search_attributes(metrics)
# Step 2: Summarize with full model
result2 = await workflow.execute_activity(
call_llm_with_tracing,
args=[
f"Summarize the following, incorporating entities {result1.content}:\n{document}",
"your-strong-model",
"summarization",
0.75,
],
start_to_close_timeout=__import__("datetime").timedelta(seconds=60),
)
metrics.accumulate(result2.trace)
await _update_search_attributes(metrics)
# Final quality check
final_quality = metrics.min_quality_score if metrics.activity_count > 0 else 0.0
return {
"summary": result2.content,
"entities": result1.content,
"metrics": metrics.model_dump(),
"quality_passed": metrics.quality_violations == 0,
}
async def _update_search_attributes(metrics: WorkflowMetrics) -> None:
"""Update Temporal search attributes from accumulated workflow metrics."""
workflow.upsert_search_attributes({
"total_input_tokens": [metrics.total_input_tokens],
"total_output_tokens": [metrics.total_output_tokens],
"workflow_cost_usd": [round(metrics.total_cost_usd, 6)],
"model_name": [metrics.model_name],
"tool_call_count": [metrics.tool_call_count],
"has_quality_violation": [metrics.quality_violations > 0],
"quality_score": [round(metrics.min_quality_score, 4)],
})

The critical pattern here is that _update_search_attributes is called from the workflow function after each activity, not from inside the activity. The WorkflowMetrics accumulator is a plain dataclass — deterministic, no external calls — so Temporal’s replay mechanism handles it correctly. Each replay produces the same accumulated state from the activity return values, without re-executing any LLM calls.

Token Tracking Across Workflow Executions

Token tracking has two levels: per-execution (handled by the accumulator above) and fleet-level (handled by querying custom search attributes across all executions).

Per-execution tracking catches cost overruns in individual workflows. Fleet-level tracking catches drift: a gradual increase in average tokens per workflow that indicates prompt bloat, context window expansion, or model behavior change.

To detect fleet-level drift, query the total_input_tokens and total_output_tokens custom attributes across recent executions of a given workflow type and compute a rolling average. A sustained increase in average input tokens without a corresponding change in document length or workflow configuration is a signal that something upstream changed — often a prompt template update that expanded the context, or a retrieval component returning larger chunks.

Cost alerting follows the same pattern. Set a per-execution cost budget as a workflow-level constant, check the accumulated workflow_cost_usd before scheduling each activity, and return a budget-exhausted result rather than continuing if the budget is exceeded. This is the correct pattern — the check lives in the workflow function and uses deterministic state from activity return values, not from external reads.

COST_BUDGET_USD = 0.0 # replace with your maximum spend per workflow execution
@workflow.run
async def run_with_cost_guard(self, document: str) -> dict:
metrics = WorkflowMetrics()
for step_prompt in self._build_step_prompts(document):
if metrics.total_cost_usd >= COST_BUDGET_USD:
return {
"status": "budget_exhausted",
"steps_completed": metrics.activity_count,
"cost_usd": metrics.total_cost_usd,
"metrics": metrics.model_dump(),
}
result = await workflow.execute_activity(
call_llm_with_tracing,
args=[step_prompt, "your-fast-model", f"step_{metrics.activity_count}", 0.7],
start_to_close_timeout=__import__("datetime").timedelta(seconds=30),
)
metrics.accumulate(result.trace)
await _update_search_attributes(metrics)
return {"status": "completed", "metrics": metrics.model_dump()}

What Rising Retry Counts Actually Signal

Temporal reports retry counts per activity in the event history and in aggregate metrics. For standard service calls, high retry counts mean the downstream service is unavailable or slow. For LLM activities, the interpretation is more nuanced.

If retry counts rise without a corresponding rise in hard errors (HTTP 5xx, network timeouts), the most likely cause is not provider availability — it is output validation failure. The LLM is returning outputs that pass the HTTP layer (200 OK) but fail downstream checks that trigger retry logic in the activity or workflow.

This pattern indicates one of several things: a prompt template change that shifted model behavior, a model version update from the provider, or upstream data quality degradation causing the model to produce outputs that fail structured validation.

Principle: Rising retry counts on LLM activities without rising hard error rates can be a leading indicator of output quality degradation, not a trailing indicator of infrastructure failure. Alert on this pattern before user reports become the first signal.

Alerting on Workflow Patterns That Indicate Quality Degradation

The following alerting thresholds are illustrative starting points for Temporal AI workflows. Calibrate them against your own baseline before treating them as production gates.

  • Quality score distribution drops below baseline — query quality_score search attribute across completed executions of the affected workflow type and alert when the rolling median moves below the release threshold.
  • Retry count per activity rises without rising hard error rate — Temporal's default metrics expose retry counts per activity type. A material increase in retries on an LLM activity with no corresponding increase in failed executions is an output validation signal, not just an infrastructure signal.
  • Average tokens per execution increases above baseline — query total_input_tokens and total_output_tokens across executions and alert on sustained increases. This catches prompt bloat and retrieval expansion silently degrading cost efficiency.
  • has_quality_violation rate exceeds the release threshold — if too many workflows set has_quality_violation = true, the quality threshold is being crossed at a rate that warrants investigation, not just monitoring.
  • Workflow duration rises at constant token count — increased duration without increased token count can mean more activity calls, tool iterations, or self-correction loops to reach the same output.
  • workflow_cost_usd exceeds the budget guard — if high-percentile workflow cost exceeds the configured budget, budget guards are not triggering early enough, or the budget constant needs recalibration.
  • fallback_triggered rate rises above baseline — if fallback usage is rising, the primary model or strategy may be degrading and the fallback may be absorbing the load silently.

Calibrate against a representative baseline before setting production alert rules.

Connecting Temporal Traces to Distributed Tracing

Temporal’s event history is a closed system — it captures everything within a workflow execution but does not natively emit OpenTelemetry traces that correlate with your application’s distributed trace. If you want to connect a Temporal workflow execution to the HTTP request that triggered it, or to the downstream service calls it makes, you need explicit trace context propagation.

The pattern: when the HTTP request arrives and enqueues a Temporal workflow, extract the current trace context (W3C traceparent header) and pass it as a workflow input field. Inside the workflow, restore the trace context before executing activities. Inside activities, use the restored context to emit spans that are children of the originating HTTP request’s trace.

This connects the Temporal event history to your distributed trace in whatever system you use — Jaeger, Honeycomb, Datadog, or LangSmith for LLM-specific observability. The result is a full trace from HTTP request through workflow scheduling, through each LLM activity call, through downstream tool calls — all correlated by a single trace ID.

Warning: Do not store trace context in Temporal workflow state (workflow variables or data converters) if it contains mutable pointers to in-memory span objects. Temporal serializes workflow state to JSON for replay. Trace context should be propagated as a plain string (the W3C traceparent header value) and the span should be reconstructed from it at activity execution time, not stored as a live span object.

Building the Monitoring Surface

Once activity traces are being emitted and search attributes are updated per execution, the monitoring surface builds out in three directions.

Per-execution debugging. The Temporal UI event history plus the structured AIActivityTrace records per activity let you reconstruct exactly what happened in any failing execution: which prompts were used, what the model returned at each step, where quality thresholds were missed, and what the cumulative cost was at each decision point. This is what agent observability that triggers a production audit looks like in practice.

Fleet-level dashboards. Custom search attribute queries support the fleet view: quality score distribution, cost distribution, model name breakdown, fallback rate, quality violation rate — all filterable by workflow type, time range, and model name. This dashboard answers “is the fleet healthy?” as a complement to Temporal’s infrastructure dashboard answering “are workers healthy?”

Alerting pipeline. Temporal exposes workflow execution metrics via Prometheus. You can alert on workflow counts by status, retry counts per activity, and schedule-to-start latency from the built-in metrics endpoint. Quality and cost alerts require querying the custom search attributes via the visibility API on a schedule — a simple cron job or scheduled Temporal workflow that runs the fleet-level queries and fires alerts when thresholds are crossed.

The combination of these three layers is what distinguishes AI monitoring from AI observability in a Temporal-based system. Monitoring tells you the infrastructure is healthy. Observability tells you whether the AI is performing correctly inside that infrastructure — and gives you the evidence to diagnose why when it is not.

The Instrumentation Checklist

  • Register custom search attributes at the Temporal namespace level before deploying instrumented workflows: total_input_tokens, total_output_tokens, workflow_cost_usd, quality_score, has_quality_violation, model_name, tool_call_count
  • Return a structured trace record (AIActivityTrace or equivalent) from every LLM activity alongside the content output — never return raw strings from activities that call external models
  • Accumulate trace records in a WorkflowMetrics object in the workflow function; call upsert_search_attributes after each activity completes, not only at workflow end
  • Implement cost guards inside the workflow function by checking accumulated total_cost_usd against a budget constant before scheduling each subsequent activity
  • Propagate distributed trace context as a plain string field in workflow input; reconstruct spans inside activities from that string, not from live span objects in workflow state
  • Alert on rising retry counts without rising hard error rates — this pattern can precede user-visible quality failures
  • Establish baselines for quality score distribution, tokens per execution, and cost per execution before setting production alert thresholds

Frequently Asked Questions

What observability does Temporal provide out of the box for AI workflows?

Temporal's built-in observability covers execution state (running, completed, failed, timed out, cancelled), retry counts and retry history per activity, workflow duration and queue latency, and workflow search attributes that you define. The Temporal UI shows the event history for any workflow execution — every activity schedule, start, and completion event, with timestamps. What Temporal does not provide is anything about the content of those activities: what the LLM returned, how many tokens were consumed, whether the output met quality thresholds, or what the model decided at each reasoning step. That instrumentation must be built explicitly.

How do custom search attributes work in Temporal for AI workflows?

Custom search attributes are indexed metadata fields attached to workflow executions and queryable via the Temporal visibility API. You register them at the namespace level with a type such as Keyword, Text, Int, Double, Bool, or Datetime. Workflows set them through the SDK search-attribute API. For AI workflows, useful search attributes include model_name, total_input_tokens, total_output_tokens, quality_score, workflow_cost_usd, and has_quality_violation. These are queryable via the list workflow API, enabling fleet-wide queries such as completed workflows below a quality threshold for a given model.

What is the correct way to track LLM token costs across a multi-step Temporal workflow?

Track token costs at the activity level and accumulate them in workflow state. Each activity that calls an LLM returns a structured result including input_tokens, output_tokens, and a computed cost_usd based on a pricing snapshot controlled by the application. The workflow accumulates these across all activity executions in a running total and updates search attributes after each activity. This approach works correctly with Temporal's replay semantics — the cost calculation is deterministic from activity return values, so replays produce the same accumulated cost without re-executing LLM calls. Do not compute cost inside the workflow function using external pricing lookups; pricing data must come from activity return values or a deterministic mapping baked into workflow code.

What patterns in Temporal workflow history indicate quality degradation before users report it?

Three patterns are useful leading indicators: rising retry counts on LLM activities without a corresponding rise in hard errors, increasing workflow duration at stable token volume, and quality_score search attribute distribution shifting downward across the fleet without a deployment event. Those patterns can indicate prompt drift, model behavior change, upstream data shift, or validation failures that are not visible from workflow status alone.


The next instrumentation problem after this one is deciding what to do when the fleet signals arrive — which degradation patterns warrant immediate rollback, which warrant a shadow evaluation, and which are noise. That decision architecture is where most teams get stuck once the observability layer is working.

The decision rule

If you are running Temporal workflows in production with LLM activities, the gap between workflow status and workflow correctness is where most quality failures hide. Review activity trace design, search attribute coverage, cost accounting, and alerting threshold calibration before the fleet signals arrive. The Enterprise Agentic Assessment Kit can scope the instrumentation gaps in the current system.

Technical Review

Bring the system under review

Send the system context, constraints, and pressure. A Principal Engineer reviews it and recommends the next step.

[ SUBMIT SPECS ]

No SDRs. A Principal Engineer reviews every submission.

About the author

Igor Bobriakov

AI Architect. Author of Production-Ready AI Agents. 15 years deploying production AI platforms and agentic systems for enterprise clients and deep-tech startups.