Temporal Retry Patterns for LLM API Calls

Q: What non-retryable errors should be excluded from Temporal LLM activity retry policies?

Exclude errors that retry cannot fix: 400 Bad Request (malformed prompt or exceeding context limits), 401 Unauthorized (invalid API key), 422 Unprocessable Entity (content policy rejection), and any application-level exception that indicates a logic error in the workflow itself. Retry only on 429 rate limit, 500 internal server error, 503 service unavailable, and network-level timeouts. Retrying on 400 or 422 errors wastes budget and masks bugs that need code fixes, not retry attempts.

Q: How does heartbeat timeout differ from start-to-close timeout for LLM activities?

Start-to-close timeout is the wall-clock budget for the entire activity execution — if the activity has not completed by this deadline, Temporal cancels it and marks it failed. Heartbeat timeout is a liveness check: if the activity does not call activity.heartbeat() within the heartbeat interval, Temporal assumes the worker died and reschedules the activity on a different worker. For long-running LLM calls (streaming responses, large context windows), set a short heartbeat interval and call heartbeat() after each chunk or processing step. If you only set start-to-close timeout without heartbeats, a stuck worker that never returns holds the task slot for the full timeout window.

Q: What is the dead letter pattern in Temporal, and when should LLM workflows use it?

The dead letter pattern routes activities that exhaust all retry attempts to a separate workflow or queue for inspection rather than silently discarding them. For LLM workflows, this means failed activities are captured with their input, the full retry history, and the final exception — then written to a dead letter queue (a separate Temporal task queue, a database table, or a message queue). Use it when the business cost of a silently failed LLM call is non-trivial: customer-facing content generation, document processing pipelines where every item must be accounted for, and audit-sensitive workflows where dropped requests create compliance gaps.

Q: How should cost caps be implemented without blocking the Temporal workflow thread?

Cost cap state must live outside the workflow function. The workflow function in Temporal is deterministic — it cannot make external calls or read from mutable external state directly. The correct pattern is to track cumulative token spend in a Temporal workflow signal or a query-readable workflow variable updated by each activity's return value, then check the accumulated spend at the top of each workflow loop iteration before scheduling the next activity. If the cap is exceeded, the workflow returns a budget-exhausted result rather than scheduling more activities. Do not implement the cap check inside the activity itself — the activity runs outside the workflow's deterministic context and cannot safely update shared state.

Temporal gives you retry policies per activity. Most teams wire them once and move on. For database calls or HTTP requests, generic retry settings may be close enough. For LLM API calls, inherited defaults can burn budget, miss failures, and leave you with no circuit protection when a provider degrades.

LLM activities have a different failure profile from standard service calls. They are expensive per attempt. They have rate and token quotas that interact with retry logic in non-obvious ways. Their latency can range from fast responses to long waits depending on context size, streaming mode, and provider health. And the errors that should not be retried — 400 Bad Request, 422 content policy rejection — look superficially similar to errors that should be retried at the HTTP level.

The retry policy you inherit from generic examples is not calibrated for any of that. This post covers the specific configuration decisions that make Temporal activity retry policies work correctly for LLM API calls: retry policy structure, timeout hierarchy, circuit breaking via heartbeat, cost cap implementation, provider failover inside activities, and the dead letter pattern for exhausted requests.

Failure Scenario	Correct Temporal Mechanism	Configuration Point	Common Mistake
Provider rate limit (429)	Retry with exponential backoff	initial_interval, backoff_coefficient, maximum_interval in RetryPolicy	Using a fixed 1s initial interval — floods the provider immediately on retry
Malformed request / context limit exceeded (400)	Non-retryable — fail immediately	non_retryable_error_types in RetryPolicy	Retrying 400s, wasting budget on requests that will never succeed
Worker process dies mid-streaming response	Heartbeat timeout + reschedule	heartbeat_timeout in execute_activity call	No heartbeat configured — dead worker holds task slot for full start-to-close window
Repeated failures driving token spend past budget	Workflow-level cost cap check	Accumulated spend tracked in workflow state, checked before each activity dispatch	Cost cap implemented inside activity — not visible to workflow orchestration logic
Primary provider degraded	Provider failover inside activity	Try primary, catch on failure, attempt secondary before raising to Temporal	Failover at workflow level — burns one full retry attempt per provider before switching
Activity exhausts all retry attempts	Dead letter capture in workflow exception handler	ActivityError catch in workflow run(), write to dead letter sink	No handler — failed activity silently terminates the workflow with no audit trail

Retry Policy Structure for LLM Activities

The Temporal RetryPolicy has five fields that matter for LLM calls: initial_interval, backoff_coefficient, maximum_interval, maximum_attempts, and non_retryable_error_types. Each deserves explicit configuration for LLM workloads.

If initial_interval is too short, a rate-limited activity simply hits the provider again before capacity has recovered. For LLM activities, start with a conservative non-zero interval and tune it against the provider limits your account actually has.

backoff_coefficient controls exponential growth. Exponential backoff is the right shape for rate limits and transient provider failures. Avoid flattening it into near-immediate retry; that pattern produces retry storms under sustained provider load.

maximum_interval caps the backoff ceiling. Set this according to the workflow’s latency budget. Beyond that budget, the workflow should fail visibly rather than wait indefinitely.

maximum_attempts should be a small bounded number for most LLM activities. Too many attempts usually means you are either in a provider outage (handled by circuit breaking, not retry) or have a request that will never succeed (handled by non-retryable error classification).

The most consequential field is non_retryable_error_types. This is where most teams leave money on the table. LLM providers return 400 for invalid requests, 401 for bad credentials, and 422 for content policy rejections. None of these will resolve on retry. Classifying them as non-retryable fails the activity on the first attempt instead of exhausting the retry budget.

import asyncio
from datetime import timedelta
from dataclasses import dataclass
from typing import Optional

from anthropic import AsyncAnthropic, APIStatusError, APIConnectionError, APITimeoutError
from temporalio import activity, workflow
from temporalio.common import RetryPolicy
from temporalio.exceptions import ActivityError, ApplicationError


# ── Custom exceptions for non-retryable classification ──────────────────────

class LLMBadRequestError(Exception):
    """Raised on 400 — prompt too long, malformed JSON, or invalid parameters."""

class LLMAuthError(Exception):
    """Raised on 401 — bad credentials. Retry cannot recover this."""

class LLMContentPolicyError(Exception):
    """Raised on 422 — content rejected by provider safety filters."""

class LLMBudgetExhaustedError(Exception):
    """Raised when workflow-level token budget is exceeded."""


# ── Activity input/output types ──────────────────────────────────────────────

@dataclass
class LLMCallInput:
    system_prompt: str
    user_message: str
    model: str = "your-production-model-id"
    max_tokens: int = 2048


@dataclass
class LLMCallResult:
    content: str
    input_tokens: int
    output_tokens: int
    model: str
    provider: str


# ── Retry policy for LLM activities ─────────────────────────────────────────

LLM_RETRY_POLICY = RetryPolicy(
    initial_interval=timedelta(seconds=5),
    backoff_coefficient=2.0,
    maximum_interval=timedelta(seconds=120),
    maximum_attempts=4,
    non_retryable_error_types=[
        "LLMBadRequestError",
        "LLMAuthError",
        "LLMContentPolicyError",
        "LLMBudgetExhaustedError",
    ],
)


# ── Activity definition ──────────────────────────────────────────────────────

@activity.defn
async def call_llm_with_failover(inputs: LLMCallInput) -> LLMCallResult:
    """
    Calls Anthropic Claude with automatic failover to a secondary model on
    transient provider failures. Heartbeats during long responses.
    Raises non-retryable exceptions for errors that retry cannot fix.
    """
    client = AsyncAnthropic()

    # Heartbeat immediately so Temporal knows the activity is running
    activity.heartbeat({"status": "starting", "model": inputs.model})

    try:
        # Primary call
        response = await _call_anthropic(client, inputs, activity_heartbeat=True)
        return LLMCallResult(
            content=response.content[0].text,
            input_tokens=response.usage.input_tokens,
            output_tokens=response.usage.output_tokens,
            model=response.model,
            provider="anthropic-primary",
        )

    except APIStatusError as e:
        if e.status_code == 400:
            raise LLMBadRequestError(
                f"Bad request to {inputs.model}: {e.message}"
            ) from e
        if e.status_code == 401:
            raise LLMAuthError("Invalid API key — check credentials") from e
        if e.status_code == 422:
            raise LLMContentPolicyError(
                f"Content policy rejection on model {inputs.model}"
            ) from e
        if e.status_code == 429:
            # Rate limited — let Temporal retry with backoff
            raise

        # 500/503 — attempt secondary model before surfacing to Temporal
        activity.heartbeat({"status": "failover", "reason": str(e.status_code)})
        return await _failover_to_secondary(client, inputs)

    except (APIConnectionError, APITimeoutError):
        # Network-level failures — let Temporal retry
        raise


async def _call_anthropic(
    client: AsyncAnthropic,
    inputs: LLMCallInput,
    activity_heartbeat: bool = False,
) -> object:
    """Wrapper that emits heartbeats during long completions."""

    async def heartbeat_loop() -> None:
        while True:
            await asyncio.sleep(10)
            activity.heartbeat({"status": "in_progress"})

    if activity_heartbeat:
        heartbeat_task = asyncio.create_task(heartbeat_loop())
        try:
            response = await client.messages.create(
                model=inputs.model,
                max_tokens=inputs.max_tokens,
                system=inputs.system_prompt,
                messages=[{"role": "user", "content": inputs.user_message}],
            )
        finally:
            heartbeat_task.cancel()
    else:
        response = await client.messages.create(
            model=inputs.model,
            max_tokens=inputs.max_tokens,
            system=inputs.system_prompt,
            messages=[{"role": "user", "content": inputs.user_message}],
        )

    return response


async def _failover_to_secondary(
    client: AsyncAnthropic, inputs: LLMCallInput
) -> LLMCallResult:
    """Falls back to a secondary model for provider-level 5xx failures."""
    fallback_inputs = LLMCallInput(
        system_prompt=inputs.system_prompt,
        user_message=inputs.user_message,
        model="your-secondary-model-id",
        max_tokens=inputs.max_tokens,
    )
    activity.heartbeat({"status": "secondary_attempt"})
    response = await _call_anthropic(client, fallback_inputs)
    return LLMCallResult(
        content=response.content[0].text,
        input_tokens=response.usage.input_tokens,
        output_tokens=response.usage.output_tokens,
        model=response.model,
        provider="anthropic-secondary",
    )

Timeout Hierarchy: Schedule-to-Start, Start-to-Close, Heartbeat

Temporal has three timeout types for activities, and they interact with LLM workloads differently.

schedule_to_start_timeout measures time from when the workflow schedules the activity to when a worker picks it up. If your worker fleet is undersized relative to workflow volume, activities queue. For LLM workflows where each activity has a real cost ceiling, a backed-up queue means you are committing to costs you cannot currently execute. Set schedule_to_start_timeout to a bounded queueing window and treat breaches as a worker capacity signal.

start_to_close_timeout is the full execution budget for the activity. For LLM calls, this must account for realistic worst-case latency: a large context window on a degraded provider can take much longer than a small request. Setting this too low — the most common mistake, inherited from HTTP defaults — causes Temporal to cancel and retry activities that would have succeeded, increasing token spend on the retry attempt.

heartbeat_timeout is the liveness check interval. If the activity does not call activity.heartbeat() within this window, Temporal treats the worker as dead and reschedules the activity on another worker. Set a heartbeat interval that is shorter than the activity timeout and call heartbeat() regularly during long completions. The code above does this in the heartbeat_loop coroutine.

Principle: the correct relationship between these timeouts is: heartbeat_timeout < start_to_close_timeout < workflow execution timeout. Heartbeat catches dead workers before start_to_close fires. Start_to_close caps individual activity duration. Workflow timeout caps total execution including all retries. Most teams configure only start_to_close and miss the liveness guarantee that heartbeat provides.

Circuit Breaking via Heartbeat and Workflow Signals

Temporal does not have a native circuit breaker primitive. The pattern that works is a workflow-level failure counter updated by activity results, checked before each dispatch.

The circuit opens when consecutive failures on a specific tool or provider cross a threshold. When the circuit is open, the workflow skips scheduling new activities for that provider and either routes to a fallback or returns a degraded result. After a cooldown period, the workflow closes the circuit and allows one probe attempt.

This is different from the retry policy, which handles individual activity retry behavior. The circuit breaker handles the broader question of whether to attempt more calls at all when a pattern of failures indicates the provider is structurally unavailable.

This pattern appears when teams scale document processing workflows past a few hundred concurrent executions. At low volume, individual retry policies absorb provider instability. At higher volume, a degraded provider means hundreds of parallel workflows all retry simultaneously — amplifying the load on a provider that is already struggling. The circuit breaker is what prevents a partial outage from becoming a full one.

@dataclass
class CircuitState:
    consecutive_failures: int = 0
    is_open: bool = False
    open_until: Optional[float] = None  # epoch seconds


@workflow.defn
class DocumentProcessingWorkflow:
    """Processes a list of documents with LLM analysis and cost cap enforcement."""

    @workflow.run
    async def run(
        self,
        document_ids: list[str],
        token_budget: int = 500_000,
    ) -> dict:
        results = {}
        failed_ids = []
        total_tokens_used = 0
        circuit = CircuitState()
        CIRCUIT_FAILURE_THRESHOLD = 3
        CIRCUIT_COOLDOWN_SECONDS = 90

        for doc_id in document_ids:
            # ── Cost cap check ─────────────────────────────────────────────
            if total_tokens_used >= token_budget:
                workflow.logger.warning(
                    f"Token budget exhausted at {total_tokens_used} tokens. "
                    f"Remaining {len(document_ids) - len(results) - len(failed_ids)} "
                    f"documents skipped."
                )
                break

            # ── Circuit breaker check ──────────────────────────────────────
            if circuit.is_open:
                now = workflow.now().timestamp()
                if circuit.open_until and now < circuit.open_until:
                    failed_ids.append(doc_id)
                    continue
                # Cooldown elapsed — close circuit and probe
                circuit.is_open = False
                circuit.consecutive_failures = 0

            try:
                result: LLMCallResult = await workflow.execute_activity(
                    call_llm_with_failover,
                    LLMCallInput(
                        system_prompt="Summarize the key findings in this document.",
                        user_message=f"Document ID: {doc_id}",
                    ),
                    start_to_close_timeout=timedelta(minutes=8),
                    heartbeat_timeout=timedelta(seconds=45),
                    retry_policy=LLM_RETRY_POLICY,
                )

                total_tokens_used += result.input_tokens + result.output_tokens
                results[doc_id] = result.content
                circuit.consecutive_failures = 0  # Reset on success

            except ActivityError as e:
                circuit.consecutive_failures += 1
                failed_ids.append(doc_id)

                if circuit.consecutive_failures >= CIRCUIT_FAILURE_THRESHOLD:
                    circuit.is_open = True
                    circuit.open_until = (
                        workflow.now().timestamp() + CIRCUIT_COOLDOWN_SECONDS
                    )
                    workflow.logger.error(
                        f"Circuit opened after {circuit.consecutive_failures} "
                        f"consecutive failures. Cooldown: {CIRCUIT_COOLDOWN_SECONDS}s"
                    )

                # Dead letter: capture failed request with full context
                await workflow.execute_activity(
                    write_to_dead_letter_queue,
                    {"doc_id": doc_id, "error": str(e), "tokens_used": total_tokens_used},
                    start_to_close_timeout=timedelta(seconds=30),
                    retry_policy=RetryPolicy(maximum_attempts=2),
                )

        return {
            "processed": len(results),
            "failed": len(failed_ids),
            "total_tokens_used": total_tokens_used,
            "budget_remaining": max(0, token_budget - total_tokens_used),
            "circuit_state": "open" if circuit.is_open else "closed",
            "results": results,
            "failed_ids": failed_ids,
        }

Cost Cap Implementation

The cost cap pattern shown above tracks total_tokens_used as a workflow-level variable updated by each activity’s return value. This is the correct placement. The cap check runs in the workflow function, which is deterministic and has access to the full execution state.

Two mistakes are common here. The first is implementing the cost cap inside the activity. The activity runs in the worker process, outside the workflow’s deterministic context. An activity cannot safely read or update shared workflow state across parallel executions. The second mistake is using an external database or cache to share token spend across workflow instances — this creates a synchronization point outside Temporal’s event log, which means replays will not reproduce the same behavior.

Token spend per call can be estimated before the call if the provider SDK does not return usage metadata on failure. For Anthropic’s Python SDK, token counts are available on the response object (response.usage.input_tokens, response.usage.output_tokens). Return these from the activity and accumulate them in the workflow.

Warning: cost caps based on token counts underestimate actual spend when the workflow includes retried activities. If an activity is attempted several times before succeeding, the workflow pays for the attempted calls but the retry attempts before the successful one may not return token counts to the workflow state. Track cost in terms of activity dispatch count as a supplementary signal, not only successful token counts.

The observability data model for production AI covers how to surface this spend data as an alertable metric — cost-per-workflow-run trending above baseline is an early signal that retry rates are increasing, which often indicates provider instability before it becomes an outage.

Provider Failover Inside Activities

Provider failover belongs inside the activity, not at the workflow level. This is a sequencing decision with real cost implications.

If you implement failover at the workflow level — catching ActivityError and dispatching a second activity against the secondary provider — each failover burns one full retry attempt and one full activity dispatch latency. For a degraded primary provider, this means every request takes twice as long before reaching the secondary.

Implementing failover inside the activity means the switch happens within a single activity execution, before Temporal sees any failure. The workflow remains unaware that a switch occurred; it receives a successful LLMCallResult regardless of which provider answered.

The tradeoff is visibility. Because the workflow does not see the failover, it does not appear in the workflow event history as a distinct event. Include the provider field in the activity’s return value (as in the code above) and log it. The observability data model covers trace-level logging patterns for capturing these events in a queryable format.

The model provider risk assessment framework covers which provider characteristics warrant having a failover path at all — not every provider pair is worth the engineering overhead, and the decision depends on the SLA scope, behavioral drift risk, and exit cost of the primary provider.

The Dead Letter Pattern

When an activity exhausts all retry attempts, Temporal raises ActivityError in the workflow. The default behavior for most teams is to let that exception propagate, which terminates the workflow with a failed status.

For LLM workflows where each item has business value — document processing queues, customer-facing generation pipelines, audit-sensitive workflows — silent failure is unacceptable. The dead letter pattern captures the failed request before the workflow exits.

The capture includes: the original input, the number of retry attempts consumed, the final exception message, and the workflow execution context (workflow ID, run ID, task queue). This is the minimum that makes the dead letter item actionable — a human or recovery process can identify what failed, reproduce the input, and re-enqueue it when the root cause is resolved.

Dead letter storage can be a Temporal task queue (dispatching a separate workflow to handle the failed item), a database table, or a message queue. The choice depends on your existing infrastructure. The key architectural constraint is that the dead letter write must be a separate Temporal activity with its own retry policy — not a bare Python function call in the exception handler, which will not survive worker restarts.

This connects to the recovery patterns described in recovery patterns for production AI agent failures — the dead letter sink is the terminus of the retry path, not a separate concern. Design the recovery path from the dead letter queue before the first item lands in it.

Non-Retryable Error Classification

The non_retryable_error_types field in RetryPolicy takes a list of exception class names as strings. Temporal matches these against the exception type raised by the activity. If the activity raises LLMBadRequestError and that string appears in non_retryable_error_types, Temporal marks the activity failed immediately without consuming retry attempts.

This requires two things to work correctly. First, the activity must raise typed exceptions rather than generic Exception objects. A requests.HTTPError or anthropic.APIStatusError is not the same as LLMBadRequestError — you have to catch the provider exception and re-raise your domain exception, as the code above demonstrates. Second, the exception class name must match exactly, including module path if Temporal uses it for disambiguation.

The practical implication: your exception hierarchy for LLM activities is part of your retry policy design. Define it before wiring the retry policy. Adding non-retryable error types after a production incident, when you have already burned retry budget on 400 errors, is the expensive way to learn this.

For the indestructible AI agent architecture that Temporal enables, non-retryable error classification is what prevents the retry mechanism from becoming a source of unnecessary spend. The goal is that every retry attempt has a meaningful chance of succeeding — which means exhausting retries only on failures where the recovery is outside the workflow’s control.

Deployment Checklist

Set a conservative initial_interval for LLM activities — immediate retry floods providers on rate-limit recovery.
Classify all provider HTTP errors as retryable or non-retryable before wiring the retry policy — 400, 401, and 422 must be in non_retryable_error_types.
Configure heartbeat_timeout on every LLM activity and call activity.heartbeat() regularly during long completions.
Set start_to_close_timeout to reflect realistic worst-case provider latency — large context windows on degraded providers can take much longer than small requests.
Implement cost cap checks in the workflow function using token counts returned by activity results — not inside activities or in external shared state.
Implement provider failover inside the activity, not at the workflow level, to avoid burning retry attempts on each failover step.
Wire a dead letter activity in every ActivityError exception handler for workflows where silent failure creates audit or business gaps.

FAQ

What non-retryable errors should be excluded from Temporal LLM activity retry policies?

Exclude errors that retry cannot fix: 400 Bad Request (malformed prompt or exceeding context limits), 401 Unauthorized (invalid API key), 422 Unprocessable Entity (content policy rejection), and any application-level exception that indicates a logic error in the workflow itself. Retry only on 429 rate limit, 500 internal server error, 503 service unavailable, and network-level timeouts. Retrying on 400 or 422 errors wastes budget and masks bugs that need code fixes, not retry attempts.

How does heartbeat timeout differ from start-to-close timeout for LLM activities?

Start-to-close timeout is the wall-clock budget for the entire activity execution — if the activity has not completed by this deadline, Temporal cancels it and marks it failed. Heartbeat timeout is a liveness check: if the activity does not call activity.heartbeat() within the heartbeat interval, Temporal assumes the worker died and reschedules the activity on a different worker. For long-running LLM calls, set a short heartbeat interval and call heartbeat() after each chunk or processing step.

What is the dead letter pattern in Temporal, and when should LLM workflows use it?

The dead letter pattern routes activities that exhaust all retry attempts to a separate workflow or queue for inspection rather than silently discarding them. For LLM workflows, this means failed activities are captured with their input, the full retry history, and the final exception. Use it when the business cost of a silently failed LLM call is non-trivial: customer-facing content generation, document processing pipelines where every item must be accounted for, and audit-sensitive workflows where dropped requests create compliance gaps.

How should cost caps be implemented without blocking the Temporal workflow thread?

Cost cap state must live in the workflow function, not in external shared state or inside activities. Track cumulative token spend using the token counts returned by each activity's result, and check the accumulated spend at the top of each workflow loop iteration before scheduling the next activity. If the cap is exceeded, the workflow returns a budget-exhausted result rather than scheduling more activities. Do not implement the cap check inside the activity itself — the activity runs outside the workflow's deterministic context and cannot safely update shared workflow state.

The decision rule

If your LLM workflows are burning budget on retried 400 errors, running without heartbeats, or failing silently when activities exhaust their retry budget, adjust the architecture before scaling volume. Start with retry classification, heartbeat coverage, budget caps, and terminal-failure visibility. The Enterprise AI Assessment Kit can frame the first workflow architecture assessment.

Temporal Activity Retry Patterns for LLM API Calls: Backoff, Circuit Breaking, and Cost Caps

Retry Policy Structure for LLM Activities

Timeout Hierarchy: Schedule-to-Start, Start-to-Close, Heartbeat

Circuit Breaking via Heartbeat and Workflow Signals

Cost Cap Implementation

Provider Failover Inside Activities

The Dead Letter Pattern

Non-Retryable Error Classification

Deployment Checklist

FAQ

What non-retryable errors should be excluded from Temporal LLM activity retry policies?

How does heartbeat timeout differ from start-to-close timeout for LLM activities?

What is the dead letter pattern in Temporal, and when should LLM workflows use it?

How should cost caps be implemented without blocking the Temporal workflow thread?

The decision rule

Bring the system under review

Igor Bobriakov

AI Agents & Autonomous Systems

Related Articles

The RAG Failure Taxonomy: 12 Ways Production Retrieval Pipelines Break

The 6 Dimensions To Score Before Recommending an AI Engagement

What To Measure Before You Expand An AI Rollout