Temporal gives you workflow status by default. Completed, failed, timed out, cancelled — the Temporal UI shows execution state for every workflow run, with the full event history and retry counts per activity. For infrastructure health, that is useful. For AI workflows, it is not enough.
Workflow status tells you whether the workflow completed. It does not tell you whether it produced correct results, stayed within cost budget, or met latency SLAs at the activity level. A Temporal workflow that calls an LLM repeatedly can complete successfully while consuming more token budget than expected, returning outputs that failed downstream quality checks and triggered silent fallbacks, or producing responses that were technically valid but semantically wrong in ways no infrastructure metric captures.
The gap between “workflow completed” and “workflow performed correctly” is where AI-specific instrumentation lives. This post covers what to build in that gap: custom search attributes for fleet-level visibility, activity-level tracing for LLM calls, token tracking across multi-step executions, and alerting patterns that surface quality degradation before users report it.
What Temporal Gives You Out of the Box
The Temporal event history is genuinely useful. For each workflow execution, you get a complete log of every event: WorkflowExecutionStarted, ActivityTaskScheduled, ActivityTaskStarted, ActivityTaskCompleted or ActivityTaskFailed, and the final WorkflowExecutionCompleted or WorkflowExecutionFailed. Each event has a timestamp, so you can reconstruct exact timing for any step.
The Temporal UI exposes this history per execution and adds aggregate views: workflow count by status, workflow duration distributions, schedule-to-start latency, and activity failure rates by activity type. These are useful metrics for capacity planning and worker fleet health.
Custom search attributes extend this further. You register them at the namespace level and set them from inside the workflow function. They become queryable via the visibility API — which means you can run fleet-level queries against your workflow executions by any attribute you define.
What Temporal does not provide is content-level visibility: what the LLM returned, how many tokens each call consumed, whether the output passed your quality criteria, or what the model decided at each reasoning step. Those signals require explicit instrumentation at the activity level and structured accumulation in the workflow.
The Four Observability Layers for AI Workflows
| Layer | What It Covers | What It Catches | What It Misses |
|---|---|---|---|
| Temporal built-in | Workflow status, event history, retry counts, duration, queue latency | Worker failures, task queue saturation, stuck workflows, retry storms | Output quality, token cost, LLM response content, semantic correctness |
| Custom search attributes | Indexed metadata on workflow executions (model name, token totals, quality score, cost) | Fleet-level cost overruns, quality distribution shifts, model version anomalies | Per-activity breakdown, tool call arguments, intermediate reasoning steps |
| Activity-level tracing | Per-LLM-call metrics: tokens in/out, latency, model, prompt fingerprint, quality signal | Expensive individual calls, per-step quality failures, provider latency degradation | Cross-workflow patterns, fleet-level trends, correlated failures across executions |
| AI-specific instrumentation | Tool call arguments and responses, reasoning chain reconstruction, quality scoring, hallucination signals | Wrong tool invocations, argument errors, output correctness failures, silent degradation | Infrastructure health (covered by layers 1–2) |
The four layers are complementary. A production AI workflow needs all four. Teams that instrument only layer 1 know their workflows are completing but cannot diagnose why outputs are wrong. Teams that instrument layers 3 and 4 without layer 2 lose the ability to run fleet-level queries — they can diagnose individual executions but cannot find the pattern across hundreds of runs.
Custom Search Attributes for AI Workflows
Custom search attributes are the bridge between per-execution detail and fleet-level visibility. Register them at the namespace level, then set and update them from inside the workflow function as the execution proceeds.
For AI workflows, the following attributes provide the most diagnostic value:
model_name(Keyword) — the primary LLM model used in this workflow executiontotal_input_tokens(Int) — cumulative input tokens across all LLM activity callstotal_output_tokens(Int) — cumulative output tokens across all LLM activity callsworkflow_cost_usd(Double) — accumulated cost in USD, computed from token counts and model pricingquality_score(Double) — the final quality evaluation score for the workflow output (0.0–1.0)has_quality_violation(Bool) — true if any activity’s output failed the quality thresholdtool_call_count(Int) — total number of tool calls made across all activitiesfallback_triggered(Bool) — true if the workflow invoked a fallback model or strategy
With these attributes registered, the Temporal visibility API supports queries like:
WorkflowType = "DocumentProcessingWorkflow"AND quality_score < 0.7AND ExecutionStatus = "Completed"AND StartTime > "2026-06-28T00:00:00Z"This query returns workflows that completed successfully but produced below-threshold quality in the chosen time window — a fleet-level quality degradation signal that infrastructure metrics alone do not produce.
workflow.upsert_search_attributes() from inside an activity raises a runtime error. Design your activity return types to carry all the data the workflow needs for attribute updates.Activity-Level Instrumentation for LLM Calls
Each activity that calls an LLM is an instrumentation boundary. The activity receives a prompt and configuration, executes the call, and returns a result. The result must carry more than just the model’s text output — it must carry the execution metadata the workflow needs to accumulate cost, update quality signals, and reconstruct what happened.
The following Python implementation shows a structured approach using Pydantic models and a Temporal workflow interceptor:
from dataclasses import dataclass, fieldfrom typing import Optionalimport timefrom pydantic import BaseModelfrom temporalio import activity, workflowfrom temporalio.client import Clientfrom temporalio.worker import Worker, Interceptorfrom temporalio.worker.workflow_sandbox import SandboxedWorkflowRunner
class AIActivityTrace(BaseModel): """Structured trace record for a single LLM activity execution.""" activity_name: str model: str input_tokens: int output_tokens: int latency_ms: float cost_usd: float quality_score: Optional[float] = None quality_passed: bool = True tool_calls_made: int = 0 prompt_fingerprint: str = "" # sha256[:8] of prompt template fallback_used: bool = False error: Optional[str] = None
class WorkflowMetrics(BaseModel): """Accumulated metrics across all activities in a workflow execution.""" total_input_tokens: int = 0 total_output_tokens: int = 0 total_cost_usd: float = 0.0 total_latency_ms: float = 0.0 activity_count: int = 0 quality_violations: int = 0 tool_call_count: int = 0 fallback_count: int = 0 min_quality_score: float = 1.0 model_name: str = ""
def accumulate(self, trace: AIActivityTrace) -> None: """Update accumulated metrics with a new activity trace.""" self.total_input_tokens += trace.input_tokens self.total_output_tokens += trace.output_tokens self.total_cost_usd += trace.cost_usd self.total_latency_ms += trace.latency_ms self.activity_count += 1 self.tool_call_count += trace.tool_calls_made if trace.fallback_used: self.fallback_count += 1 if not trace.quality_passed: self.quality_violations += 1 if trace.quality_score is not None: self.min_quality_score = min(self.min_quality_score, trace.quality_score) if not self.model_name: self.model_name = trace.model
# Pricing table controlled by your application. Keep this snapshot versioned# and update it when provider prices or model IDs change.MODEL_PRICING_USD_PER_1K = { "your-fast-model": {"input": 0.0, "output": 0.0}, "your-strong-model": {"input": 0.0, "output": 0.0},}
def compute_cost(model: str, input_tokens: int, output_tokens: int) -> float: pricing = MODEL_PRICING_USD_PER_1K.get(model, {"input": 0.0, "output": 0.0}) return (input_tokens / 1000 * pricing["input"]) + (output_tokens / 1000 * pricing["output"])
@dataclassclass LLMActivityResult: """Return type for all LLM activities — carries both content and trace.""" content: str trace: AIActivityTrace raw_tool_calls: list = field(default_factory=list)
@activity.defnasync def call_llm_with_tracing( prompt: str, model: str, activity_name: str, quality_threshold: float = 0.7,) -> LLMActivityResult: """ LLM activity that emits a full AIActivityTrace alongside its output. Quality scoring is a placeholder — replace with your evaluation logic. """ import hashlib import openai # or your preferred client
prompt_fingerprint = hashlib.sha256(prompt.encode()).hexdigest()[:8] start = time.monotonic()
client = openai.AsyncOpenAI() response = await client.chat.completions.create( model=model, messages=[{"role": "user", "content": prompt}], )
latency_ms = (time.monotonic() - start) * 1000 usage = response.usage input_tokens = usage.prompt_tokens output_tokens = usage.completion_tokens content = response.choices[0].message.content or ""
# Compute quality score — replace with domain-specific eval quality_score = _score_output(content) cost_usd = compute_cost(model, input_tokens, output_tokens)
trace = AIActivityTrace( activity_name=activity_name, model=model, input_tokens=input_tokens, output_tokens=output_tokens, latency_ms=latency_ms, cost_usd=cost_usd, quality_score=quality_score, quality_passed=quality_score >= quality_threshold, prompt_fingerprint=prompt_fingerprint, )
return LLMActivityResult(content=content, trace=trace)
def _score_output(content: str) -> float: """ Placeholder quality scorer. In production, replace with: - LLM-as-judge evaluation - Regex/heuristic checks for required fields - Embedding similarity against reference outputs - Task-specific validation (JSON schema, citation presence, etc.) """ if not content or len(content) < 10: return 0.0 return 0.85 # replace with real scoring logic
@workflow.defnclass InstrumentedAIWorkflow: """ Example workflow that accumulates metrics and updates search attributes after each LLM activity. Demonstrates the correct instrumentation pattern. """
@workflow.run async def run(self, document: str) -> dict: metrics = WorkflowMetrics()
# Step 1: Extract key entities result1 = await workflow.execute_activity( call_llm_with_tracing, args=[ f"Extract named entities from:\n{document}", "your-fast-model", "entity_extraction", 0.8, ], start_to_close_timeout=__import__("datetime").timedelta(seconds=30), ) metrics.accumulate(result1.trace) await _update_search_attributes(metrics)
# Step 2: Summarize with full model result2 = await workflow.execute_activity( call_llm_with_tracing, args=[ f"Summarize the following, incorporating entities {result1.content}:\n{document}", "your-strong-model", "summarization", 0.75, ], start_to_close_timeout=__import__("datetime").timedelta(seconds=60), ) metrics.accumulate(result2.trace) await _update_search_attributes(metrics)
# Final quality check final_quality = metrics.min_quality_score if metrics.activity_count > 0 else 0.0
return { "summary": result2.content, "entities": result1.content, "metrics": metrics.model_dump(), "quality_passed": metrics.quality_violations == 0, }
async def _update_search_attributes(metrics: WorkflowMetrics) -> None: """Update Temporal search attributes from accumulated workflow metrics.""" workflow.upsert_search_attributes({ "total_input_tokens": [metrics.total_input_tokens], "total_output_tokens": [metrics.total_output_tokens], "workflow_cost_usd": [round(metrics.total_cost_usd, 6)], "model_name": [metrics.model_name], "tool_call_count": [metrics.tool_call_count], "has_quality_violation": [metrics.quality_violations > 0], "quality_score": [round(metrics.min_quality_score, 4)], })The critical pattern here is that _update_search_attributes is called from the workflow function after each activity, not from inside the activity. The WorkflowMetrics accumulator is a plain dataclass — deterministic, no external calls — so Temporal’s replay mechanism handles it correctly. Each replay produces the same accumulated state from the activity return values, without re-executing any LLM calls.
Token Tracking Across Workflow Executions
Token tracking has two levels: per-execution (handled by the accumulator above) and fleet-level (handled by querying custom search attributes across all executions).
Per-execution tracking catches cost overruns in individual workflows. Fleet-level tracking catches drift: a gradual increase in average tokens per workflow that indicates prompt bloat, context window expansion, or model behavior change.
To detect fleet-level drift, query the total_input_tokens and total_output_tokens custom attributes across recent executions of a given workflow type and compute a rolling average. A sustained increase in average input tokens without a corresponding change in document length or workflow configuration is a signal that something upstream changed — often a prompt template update that expanded the context, or a retrieval component returning larger chunks.
Cost alerting follows the same pattern. Set a per-execution cost budget as a workflow-level constant, check the accumulated workflow_cost_usd before scheduling each activity, and return a budget-exhausted result rather than continuing if the budget is exceeded. This is the correct pattern — the check lives in the workflow function and uses deterministic state from activity return values, not from external reads.
COST_BUDGET_USD = 0.0 # replace with your maximum spend per workflow execution
@workflow.runasync def run_with_cost_guard(self, document: str) -> dict: metrics = WorkflowMetrics()
for step_prompt in self._build_step_prompts(document): if metrics.total_cost_usd >= COST_BUDGET_USD: return { "status": "budget_exhausted", "steps_completed": metrics.activity_count, "cost_usd": metrics.total_cost_usd, "metrics": metrics.model_dump(), }
result = await workflow.execute_activity( call_llm_with_tracing, args=[step_prompt, "your-fast-model", f"step_{metrics.activity_count}", 0.7], start_to_close_timeout=__import__("datetime").timedelta(seconds=30), ) metrics.accumulate(result.trace) await _update_search_attributes(metrics)
return {"status": "completed", "metrics": metrics.model_dump()}What Rising Retry Counts Actually Signal
Temporal reports retry counts per activity in the event history and in aggregate metrics. For standard service calls, high retry counts mean the downstream service is unavailable or slow. For LLM activities, the interpretation is more nuanced.
If retry counts rise without a corresponding rise in hard errors (HTTP 5xx, network timeouts), the most likely cause is not provider availability — it is output validation failure. The LLM is returning outputs that pass the HTTP layer (200 OK) but fail downstream checks that trigger retry logic in the activity or workflow.
This pattern indicates one of several things: a prompt template change that shifted model behavior, a model version update from the provider, or upstream data quality degradation causing the model to produce outputs that fail structured validation.
Alerting on Workflow Patterns That Indicate Quality Degradation
The following alerting thresholds are illustrative starting points for Temporal AI workflows. Calibrate them against your own baseline before treating them as production gates.
- Quality score distribution drops below baseline — query
quality_scoresearch attribute across completed executions of the affected workflow type and alert when the rolling median moves below the release threshold. - Retry count per activity rises without rising hard error rate — Temporal's default metrics expose retry counts per activity type. A material increase in retries on an LLM activity with no corresponding increase in failed executions is an output validation signal, not just an infrastructure signal.
- Average tokens per execution increases above baseline — query
total_input_tokensandtotal_output_tokensacross executions and alert on sustained increases. This catches prompt bloat and retrieval expansion silently degrading cost efficiency. - has_quality_violation rate exceeds the release threshold — if too many workflows set
has_quality_violation = true, the quality threshold is being crossed at a rate that warrants investigation, not just monitoring. - Workflow duration rises at constant token count — increased duration without increased token count can mean more activity calls, tool iterations, or self-correction loops to reach the same output.
- workflow_cost_usd exceeds the budget guard — if high-percentile workflow cost exceeds the configured budget, budget guards are not triggering early enough, or the budget constant needs recalibration.
- fallback_triggered rate rises above baseline — if fallback usage is rising, the primary model or strategy may be degrading and the fallback may be absorbing the load silently.
Calibrate against a representative baseline before setting production alert rules.
Connecting Temporal Traces to Distributed Tracing
Temporal’s event history is a closed system — it captures everything within a workflow execution but does not natively emit OpenTelemetry traces that correlate with your application’s distributed trace. If you want to connect a Temporal workflow execution to the HTTP request that triggered it, or to the downstream service calls it makes, you need explicit trace context propagation.
The pattern: when the HTTP request arrives and enqueues a Temporal workflow, extract the current trace context (W3C traceparent header) and pass it as a workflow input field. Inside the workflow, restore the trace context before executing activities. Inside activities, use the restored context to emit spans that are children of the originating HTTP request’s trace.
This connects the Temporal event history to your distributed trace in whatever system you use — Jaeger, Honeycomb, Datadog, or LangSmith for LLM-specific observability. The result is a full trace from HTTP request through workflow scheduling, through each LLM activity call, through downstream tool calls — all correlated by a single trace ID.
Building the Monitoring Surface
Once activity traces are being emitted and search attributes are updated per execution, the monitoring surface builds out in three directions.
Per-execution debugging. The Temporal UI event history plus the structured AIActivityTrace records per activity let you reconstruct exactly what happened in any failing execution: which prompts were used, what the model returned at each step, where quality thresholds were missed, and what the cumulative cost was at each decision point. This is what agent observability that triggers a production audit looks like in practice.
Fleet-level dashboards. Custom search attribute queries support the fleet view: quality score distribution, cost distribution, model name breakdown, fallback rate, quality violation rate — all filterable by workflow type, time range, and model name. This dashboard answers “is the fleet healthy?” as a complement to Temporal’s infrastructure dashboard answering “are workers healthy?”
Alerting pipeline. Temporal exposes workflow execution metrics via Prometheus. You can alert on workflow counts by status, retry counts per activity, and schedule-to-start latency from the built-in metrics endpoint. Quality and cost alerts require querying the custom search attributes via the visibility API on a schedule — a simple cron job or scheduled Temporal workflow that runs the fleet-level queries and fires alerts when thresholds are crossed.
The combination of these three layers is what distinguishes AI monitoring from AI observability in a Temporal-based system. Monitoring tells you the infrastructure is healthy. Observability tells you whether the AI is performing correctly inside that infrastructure — and gives you the evidence to diagnose why when it is not.
The Instrumentation Checklist
- Register custom search attributes at the Temporal namespace level before deploying instrumented workflows:
total_input_tokens,total_output_tokens,workflow_cost_usd,quality_score,has_quality_violation,model_name,tool_call_count - Return a structured trace record (
AIActivityTraceor equivalent) from every LLM activity alongside the content output — never return raw strings from activities that call external models - Accumulate trace records in a
WorkflowMetricsobject in the workflow function; callupsert_search_attributesafter each activity completes, not only at workflow end - Implement cost guards inside the workflow function by checking accumulated
total_cost_usdagainst a budget constant before scheduling each subsequent activity - Propagate distributed trace context as a plain string field in workflow input; reconstruct spans inside activities from that string, not from live span objects in workflow state
- Alert on rising retry counts without rising hard error rates — this pattern can precede user-visible quality failures
- Establish baselines for quality score distribution, tokens per execution, and cost per execution before setting production alert thresholds
Frequently Asked Questions
What observability does Temporal provide out of the box for AI workflows?
Temporal's built-in observability covers execution state (running, completed, failed, timed out, cancelled), retry counts and retry history per activity, workflow duration and queue latency, and workflow search attributes that you define. The Temporal UI shows the event history for any workflow execution — every activity schedule, start, and completion event, with timestamps. What Temporal does not provide is anything about the content of those activities: what the LLM returned, how many tokens were consumed, whether the output met quality thresholds, or what the model decided at each reasoning step. That instrumentation must be built explicitly.
How do custom search attributes work in Temporal for AI workflows?
Custom search attributes are indexed metadata fields attached to workflow executions and queryable via the Temporal visibility API. You register them at the namespace level with a type such as Keyword, Text, Int, Double, Bool, or Datetime. Workflows set them through the SDK search-attribute API. For AI workflows, useful search attributes include model_name, total_input_tokens, total_output_tokens, quality_score, workflow_cost_usd, and has_quality_violation. These are queryable via the list workflow API, enabling fleet-wide queries such as completed workflows below a quality threshold for a given model.
What is the correct way to track LLM token costs across a multi-step Temporal workflow?
Track token costs at the activity level and accumulate them in workflow state. Each activity that calls an LLM returns a structured result including input_tokens, output_tokens, and a computed cost_usd based on a pricing snapshot controlled by the application. The workflow accumulates these across all activity executions in a running total and updates search attributes after each activity. This approach works correctly with Temporal's replay semantics — the cost calculation is deterministic from activity return values, so replays produce the same accumulated cost without re-executing LLM calls. Do not compute cost inside the workflow function using external pricing lookups; pricing data must come from activity return values or a deterministic mapping baked into workflow code.
What patterns in Temporal workflow history indicate quality degradation before users report it?
Three patterns are useful leading indicators: rising retry counts on LLM activities without a corresponding rise in hard errors, increasing workflow duration at stable token volume, and quality_score search attribute distribution shifting downward across the fleet without a deployment event. Those patterns can indicate prompt drift, model behavior change, upstream data shift, or validation failures that are not visible from workflow status alone.
The next instrumentation problem after this one is deciding what to do when the fleet signals arrive — which degradation patterns warrant immediate rollback, which warrant a shadow evaluation, and which are noise. That decision architecture is where most teams get stuck once the observability layer is working.
The decision rule
If you are running Temporal workflows in production with LLM activities, the gap between workflow status and workflow correctness is where most quality failures hide. Review activity trace design, search attribute coverage, cost accounting, and alerting threshold calibration before the fleet signals arrive. The Enterprise Agentic Assessment Kit can scope the instrumentation gaps in the current system.