Temporal Workflow Versioning for AI Pipelines

During a routine release window, a team pushed a new version of their document classification model. The model was faster, scored higher on their internal benchmark, and had been tested thoroughly in staging. The deployment completed cleanly. Shortly afterward, the support queue started filling with workflow failures.

The failure mode was not the model itself. The new model returned structured output with a confidence_scores field as a flat array of floats. The previous model returned it as a dictionary keyed by class name. Downstream activities in in-flight Temporal workflows were unpacking result["confidence_scores"]["invoice"]. They now received a list. The workflows crashed with TypeError. The model upgrade that was supposed to improve classification accuracy had stopped the pipeline.

This is exactly the failure mode Temporal workflow versioning exists to prevent — and most teams encounter it for the first time in production, not in a design review.

Why AI Pipelines Have a Different Upgrade Problem

Traditional service upgrades often have a cleaner boundary: deploy new code, old requests complete on old instances, new requests go to new instances. The state lives in the request. When the request is done, it is gone.

Temporal workflows are different. A workflow can run across long-lived business processes. Its state — every activity result, every intermediate decision — is persisted in the event history. When a worker processes a workflow step, it replays the full history to reconstruct the workflow’s current state. If the worker’s code has changed in a way that produces different decisions from the same history, Temporal detects the divergence and fails the workflow.

This means you cannot simply deploy new code and expect in-flight workflows to pick it up cleanly. The code that started a workflow must be compatible with any code that resumes it, for the entire duration of the workflow’s execution. For AI pipelines where model output schemas evolve, this constraint requires deliberate versioning strategy.

There are five mechanisms for managing this. They are not mutually exclusive — production systems typically combine two or three.

This problem often surfaces after the pipeline has grown beyond prototype traffic: the team has been iterating on the model, the pipeline has started to handle real load, and the first time a version bump breaks an in-flight workflow is when the team realizes the versioning problem existed from the start. The code was always going to need branching — it was just not written that way yet.

Strategy	How It Works	Best For	Operational Cost
Workflow Patching API	workflow.patched() creates a branch: old history takes the old path, new workflows take the new path. Both coexist in the same binary.	Minor model bumps where output schema is backward-compatible or can be normalized at the activity boundary.	Low — one binary, one worker pool. Patch markers accumulate over time and must be cleaned up.
Activity Versioning	New activity function registered alongside old one. Workflow code conditionally calls one or the other based on version flag or patching.	When the activity implementation changes substantially but the workflow control flow is unchanged.	Low to medium — requires careful naming to avoid registration conflicts.
Task Queue Routing	New model version runs on a separate task queue with separate workers. Workflow code routes activities to the appropriate queue based on a configuration flag.	Significant model version changes, A/B testing, canary deployments. Full isolation between versions.	High — two worker pools during transition. Operationally intensive but maximally safe.
Shadow Deployment	New model runs in parallel with old model. Results are compared but only old model's output is used downstream. New model is promoted when validated.	Validating new model output quality and schema compatibility before committing to the upgrade.	Medium — extra inference cost during shadow period. Requires comparison logic in the workflow.
Blue-Green Workflow Drain	Stop new workflow starts on old version, let in-flight workflows drain, then cut over to new version entirely.	Breaking schema changes where no backward-compatible migration path exists.	High — requires coordinated cutover, monitoring drain completion, and accepting temporary reduced throughput.

Temporal’s Patching API: The Mechanics

The patching API is the foundation of in-place workflow versioning. It works by writing a “patch marker” into the workflow’s event history the first time a patched branch executes. On replay, Temporal checks whether the patch marker exists in the history. If it does, the workflow takes the new branch. If it does not, it takes the old branch.

import asyncio
from datetime import timedelta
from dataclasses import dataclass
from typing import Optional
from pydantic import BaseModel
from temporalio import workflow, activity
from temporalio.client import Client
from temporalio.worker import Worker


class ModelVersionConfig(BaseModel):
    model_id: str
    version: str
    endpoint: str
    output_schema_version: int
    max_tokens: int = 4096
    temperature: float = 0.0


class DeploymentStrategy(BaseModel):
    primary_version: ModelVersionConfig
    shadow_version: Optional[ModelVersionConfig] = None
    task_queue_override: Optional[str] = None
    canary_fraction: float = 0.0


# Activity result schemas — versioned Pydantic models
class ClassificationResultV1(BaseModel):
    label: str
    confidence: float
    raw_scores: dict[str, float]  # {"invoice": 0.92, "receipt": 0.05, ...}


class ClassificationResultV2(BaseModel):
    label: str
    confidence: float
    raw_scores: list[float]       # [0.92, 0.05, ...] — positional, class order in metadata
    class_order: list[str]        # ["invoice", "receipt", ...] — added in v2


@activity.defn
async def classify_document_v1(
    document_text: str,
    config: ModelVersionConfig,
) -> ClassificationResultV1:
    """Original classification activity — returns dict-keyed confidence scores."""
    # Call model endpoint, parse response
    raw_response = await call_model_endpoint(config.endpoint, document_text, config)
    return ClassificationResultV1(
        label=raw_response["label"],
        confidence=raw_response["confidence"],
        raw_scores=raw_response["scores"],
    )


@activity.defn
async def classify_document_v2(
    document_text: str,
    config: ModelVersionConfig,
) -> ClassificationResultV2:
    """New classification activity — returns positional confidence scores with class_order."""
    raw_response = await call_model_endpoint(config.endpoint, document_text, config)
    return ClassificationResultV2(
        label=raw_response["label"],
        confidence=raw_response["confidence"],
        raw_scores=raw_response["scores"],
        class_order=raw_response["class_order"],
    )


@activity.defn
async def shadow_classify_document(
    document_text: str,
    primary_result: ClassificationResultV1,
    shadow_config: ModelVersionConfig,
) -> dict:
    """Shadow activity — runs new model, compares schema, logs divergence. Result is discarded."""
    try:
        shadow_response = await call_model_endpoint(
            shadow_config.endpoint, document_text, shadow_config
        )
        # Log schema compatibility check — does not affect primary path
        schema_valid = validate_v2_schema(shadow_response)
        label_match = shadow_response.get("label") == primary_result.label
        await log_shadow_comparison(
            primary_label=primary_result.label,
            shadow_label=shadow_response.get("label"),
            schema_valid=schema_valid,
            label_match=label_match,
        )
    except Exception as e:
        # Shadow failures never propagate to the primary workflow path
        await log_shadow_failure(error=str(e))
    return {}


@workflow.defn
class DocumentClassificationWorkflow:
    """
    Workflow with explicit version branching via Temporal's patching API.

    The PATCH_CLASSIFY_V2 marker determines which activity runs:
    - Workflows started before the patch: take the v1 branch (no marker in history)
    - Workflows started after the patch: take the v2 branch (marker written on first run)

    This allows a single worker binary to serve both in-flight v1 workflows
    and new v2 workflows simultaneously.
    """

    @workflow.run
    async def run(
        self,
        document_text: str,
        deployment_strategy: DeploymentStrategy,
    ) -> dict:
        # Branch on patch marker — old in-flight workflows take v1 path
        if workflow.patched("PATCH_CLASSIFY_V2"):
            # New path: v2 model with positional confidence scores
            result_v2 = await workflow.execute_activity(
                classify_document_v2,
                args=[document_text, deployment_strategy.primary_version],
                start_to_close_timeout=timedelta(seconds=60),
            )
            classification_output = {
                "label": result_v2.label,
                "confidence": result_v2.confidence,
                # Normalize to dict for downstream consumers during transition
                "scores": dict(zip(result_v2.class_order, result_v2.raw_scores)),
            }
        else:
            # Old path: v1 model with dict-keyed confidence scores
            result_v1 = await workflow.execute_activity(
                classify_document_v1,
                args=[document_text, deployment_strategy.primary_version],
                start_to_close_timeout=timedelta(seconds=60),
            )
            classification_output = {
                "label": result_v1.label,
                "confidence": result_v1.confidence,
                "scores": result_v1.raw_scores,
            }

            # Shadow the new model if configured — fire-and-forget
            if deployment_strategy.shadow_version:
                await workflow.execute_activity(
                    shadow_classify_document,
                    args=[document_text, result_v1, deployment_strategy.shadow_version],
                    start_to_close_timeout=timedelta(seconds=90),
                )

        # Downstream activities receive normalized output regardless of version path
        return await process_classification_result(classification_output)

Principle: The patching API does not version the model — it versions the code path that calls the model. Normalize the output at the activity boundary so downstream activities in the same workflow receive a consistent contract regardless of which model version ran. This insulates the rest of the workflow from the upgrade.

Activity Versioning vs. Workflow Versioning

There is a distinction worth making explicit: you can version at the activity level or the workflow level, and the right choice depends on what changed.

Version at the activity level when the model’s output schema changed but the workflow’s control flow is identical. The workflow still calls the same sequence of activities in the same order — only the activity’s implementation and return type changed. You register both the old and new activity functions, use the patching API to branch at the activity call site, and normalize the output before it reaches the next activity.

Version at the workflow level when the upgrade changes what the workflow does — which activities it calls, in what order, with what branching logic. A new model that adds a secondary verification step, or one that sometimes routes to a human review queue, requires changes to the workflow’s control flow, not just its activities.

The common mistake is treating every model upgrade as a workflow-level change. Most model upgrades — same task, different weights, evolved output schema — are activity-level changes. Keeping the versioning at the activity level means the workflow’s event history remains structurally compatible, which simplifies replay and rollback.

The deeper principle: Temporal’s event history is a contract. Every change to the code that executes it is a potential contract violation. Teams that treat model upgrades as infrastructure swaps — same interface, new backend — avoid most of this complexity. Teams that let models bleed their internal format directly into workflow state pay for that decision every time the format evolves.

Task Queue-Based Routing for Model Version Isolation

When you need hard isolation between model versions — different compute resources, different retry budgets, or a canary rollout to a fraction of traffic — task queue routing gives you that without the binary-level complexity of the patching API.

The pattern: register old and new model activities on separate task queues. Workers subscribed to model-v1-queue run the old model; workers subscribed to model-v2-queue run the new one. The workflow selects the task queue at activity scheduling time based on a deployment configuration.

@workflow.defn
class RoutedClassificationWorkflow:
    @workflow.run
    async def run(
        self,
        document_text: str,
        deployment_strategy: DeploymentStrategy,
    ) -> dict:
        # Route to appropriate task queue based on deployment config
        target_queue = (
            deployment_strategy.task_queue_override
            or workflow.info().task_queue  # default: same queue as workflow
        )

        result = await workflow.execute_activity(
            classify_document_v2,
            args=[document_text, deployment_strategy.primary_version],
            task_queue=target_queue,
            start_to_close_timeout=timedelta(seconds=60),
        )
        return {"label": result.label, "confidence": result.confidence}

During a canary, you run two sets of workers. New workflows are split by a configured canary fraction: a small share goes to the v2 queue while the rest stays on the v1 queue. This splitting logic lives in the workflow’s calling code — the scheduler or the API endpoint that triggers the workflow — not in the workflow itself.

Warning: When using task queue routing, the activities registered on each queue must be compatible with each other if a workflow could ever need to call activities from both queues within a single execution. An activity started on one task queue and replayed on another will fail if the activity function is not registered on the second queue's workers. Keep task queue boundaries aligned with workflow boundaries to avoid this class of failure.

Handling In-Flight Workflows During Upgrades

The hardest part of a model upgrade is not the new workflows — it is the workflows that are already running.

In-flight workflows have already written part of their history using the old code. They will replay using whatever worker is available when they resume. If that worker is running the new code and the new code’s execution path diverges from the history, the workflow fails.

The patching API handles gradual upgrades cleanly: old workflows replay on the old branch, new workflows start on the new branch, both coexist. But it requires that the old code path remain in the binary for as long as any workflow that started before the patch might still be running.

For long-running AI pipelines — document processing queues that might keep workflows open beyond the deploy window — this means the old code path may need to stay in the binary after the upgrade. That is acceptable: the patch marker cleanup is a future, separate operation.

The procedure for a safe upgrade:

Deploy new worker binary with both old and new code paths, gated by workflow.patched().
Confirm that new workflow starts are taking the new code path (check Temporal UI or workflow output schemas).
Monitor in-flight workflows on the old path — they continue executing on the old branch.
Once all pre-patch workflows have completed (or been terminated for workflows that are stuck), remove the old code path and the patch marker in a subsequent release.

The Temporal activity retry patterns post covers how retry policies interact with this: if the old-path activity fails due to an unrelated error, Temporal will retry it on whatever worker is available. As long as the old code path is still registered, the retry will succeed. If you remove the old code path before all in-flight workflows complete, retries will fail.

Rollback Strategies

When a model upgrade fails in production, the rollback strategy depends on how far the deployment has progressed.

Scenario A: New binary deployed, no workflows started on new path yet. Revert the worker binary. Redeploy. No workflows are affected because the patching marker has not been written to any workflow’s history.

Scenario B: Some workflows have started on the new path, and they are failing. You have two sub-cases:

If the new-path workflows are failing due to a bug in the activity, fix the bug and redeploy the new code. Temporal will retry the activity on the fixed code.
If the new-path workflows are failing because the model itself is producing bad output, you need to terminate the affected workflows. Use the Temporal CLI or SDK to terminate specific workflow IDs. Revert the worker binary to remove the new code path. Start replacement workflows using the old model configuration.

Scenario C: Full rollback after all workflows have migrated. This is the case where you removed the old code path and the new code is now the only path. A full rollback means reintroducing the old code path, deploying it as a new “patch” over the current code. The patch API can be used in both directions: the new-as-of-now code becomes the old branch, and the reverted code becomes the new branch.

The rollback patterns post covers the broader class of AI agent rollbacks. For Temporal-specific cases, the key insight is that rollback complexity is proportional to how much history has been written on the new path. Keep deployment windows short and monitor aggressively immediately after a model version change.

Testing Version Transitions

Testing a Temporal workflow version upgrade requires more than running the new model in a test environment. You need to test the transition itself: what happens when a workflow started on the old code is resumed by a worker running the new code.

The test setup:

Start a workflow using the old worker binary. Pause it mid-execution using a Temporal signal or by designing the workflow to wait at a checkpoint.
Deploy the new worker binary (with the patch applied) to the test environment.
Resume the paused workflow. Confirm it takes the old branch (no patch marker in its history).
Start a new workflow. Confirm it takes the new branch (patch marker written).
Let both workflows complete. Compare outputs.

import pytest
from temporalio.testing import WorkflowEnvironment
from temporalio.worker import Worker

@pytest.mark.asyncio
async def test_version_transition():
    async with await WorkflowEnvironment.start_local() as env:
        async with Worker(
            env.client,
            task_queue="test-queue",
            workflows=[DocumentClassificationWorkflow],
            activities=[
                classify_document_v1,
                classify_document_v2,
                shadow_classify_document,
            ],
        ):
            # Start workflow — will take v1 branch (first-ever run, no patch marker yet)
            # Note: In a real transition test, you would use Temporal's history replay testing
            # to simulate a workflow that started before the patch
            handle = await env.client.start_workflow(
                DocumentClassificationWorkflow.run,
                args=["test document text", deployment_strategy_v1],
                id="test-transition-v1",
                task_queue="test-queue",
            )
            result_v1 = await handle.result()
            assert "scores" in result_v1
            assert isinstance(result_v1["scores"], dict)

            # Start new workflow — will take v2 branch
            handle_v2 = await env.client.start_workflow(
                DocumentClassificationWorkflow.run,
                args=["test document text", deployment_strategy_v2],
                id="test-transition-v2",
                task_queue="test-queue",
            )
            result_v2 = await handle_v2.result()
            # Normalized output should be identical structure regardless of version
            assert "scores" in result_v2
            assert isinstance(result_v2["scores"], dict)

            # Downstream schema compatibility — same keys, same types
            assert set(result_v1.keys()) == set(result_v2.keys())

For schema compatibility testing, use the shadow deployment pattern before promoting a new model version to production. The shadow activity runs the new model against production traffic, logs schema diffs, and alerts if the output contract changes in ways the normalization layer does not handle.

Schema Evolution at the Activity Boundary

The root cause of the incident described at the opening was not the model upgrade — it was a missing normalization layer at the activity boundary. The downstream activities were consuming raw model output instead of a defined, versioned schema.

The fix is not just to version workflows correctly. It is to define explicit Pydantic models for activity outputs and to normalize those outputs before they leave the activity function. Downstream activities receive a contract, not raw model output.

class NormalizedClassificationResult(BaseModel):
    """Canonical schema — version-independent. Always what downstream activities receive."""
    label: str
    confidence: float
    scores_by_class: dict[str, float]  # Always dict — normalized from v1 or v2 format
    model_version: str
    schema_version: int = 2


def normalize_v1_result(v1: ClassificationResultV1, config: ModelVersionConfig) -> NormalizedClassificationResult:
    return NormalizedClassificationResult(
        label=v1.label,
        confidence=v1.confidence,
        scores_by_class=v1.raw_scores,
        model_version=config.version,
    )


def normalize_v2_result(v2: ClassificationResultV2, config: ModelVersionConfig) -> NormalizedClassificationResult:
    return NormalizedClassificationResult(
        label=v2.label,
        confidence=v2.confidence,
        scores_by_class=dict(zip(v2.class_order, v2.raw_scores)),
        model_version=config.version,
    )

This pattern — normalize at the boundary, expose a versioned canonical schema — is the same principle covered in the AI system versioning post. The model version changes; the downstream contract does not.

For Temporal specifically, this matters because activity return values are serialized into the workflow’s event history. If the return type changes between deployments, replaying the workflow with the new code will fail to deserialize the old activity result. Keeping the canonical schema stable — or making it backward-compatible — prevents deserialization failures during replay.

Pre-Upgrade Checklist

Define explicit Pydantic models for all activity outputs — do not pass raw model API responses to downstream activities.
Identify the output schema differences between old and new model versions before writing any Temporal code.
Implement a normalization layer that maps both old and new model output schemas to the canonical downstream contract.
Run shadow deployment before promoting the new model version — compare outputs and log schema divergences over a representative window.
Deploy the new worker binary with both old and new code paths active (patching API) — confirm in-flight workflows continue on the old branch before proceeding.
Monitor the first tranche of workflows on the new path: check output schema validity, downstream activity success rates, and replay success rate in Temporal UI.
Set a drain deadline for old-path workflows — schedule the old code path removal only after all pre-patch workflows have completed or been explicitly terminated.

Frequently Asked Questions

What happens to in-flight Temporal workflows when you deploy new code that changes activity behavior?

In-flight workflows continue replaying from their event history. Temporal's determinism guarantee means that when a worker replays a workflow, it must produce the same decisions from the same history. If you change code in a way that alters the execution path — different activities called, different output schemas expected — the replay will diverge from the history and the workflow will fail with a non-deterministic error. The patching API (workflow.patched()) is the correct mechanism to branch execution paths: workflows that started before the patch take the old branch; workflows that started after take the new one. Both branches can coexist in the same worker binary.

When should you use task queue routing instead of Temporal's patching API for model version upgrades?

Use task queue routing when the model change is significant enough that you want new and old workflows completely isolated: different worker pools, different resource quotas, different retry policies. The patching API is appropriate for minor model version bumps within the same output schema. Task queue routing is appropriate when the new model version has a different output contract, requires different compute resources, or when you want to canary a small fraction of traffic before full cutover. The tradeoff is operational overhead: task queue routing requires running two worker pools during the transition period.

How do you test a Temporal workflow version transition before deploying to production?

A reliable approach is a shadow deployment: run the new model version in parallel with the existing one on representative traffic, comparing outputs without routing the new version's results to downstream systems. In Temporal, this means defining a shadow activity that calls the new model, running it alongside the existing activity, logging both outputs, and asserting schema compatibility. The shadow activity's result is discarded at the workflow level — downstream activities only see the production path's output. Once you confirm the new model's output schema matches expectations and quality metrics meet thresholds, promote the shadow to the primary path using the patching API.

What is the correct rollback strategy for a Temporal workflow version upgrade that fails in production?

The rollback strategy depends on how far the deployment progressed. If the new version is deployed but no workflows have started on it yet, revert the worker binary and redeploy. If some workflows have started on the new version but not completed, you have two options: let them complete on the new code path if the failure is isolated, or terminate them using the Temporal workflow termination API and replay from the last checkpoint using the old code path. For a full rollback, ensure the old worker binary is running, the task queue is pointing at old workers, and any workflows that started on the new version are either completed or terminated. The patching API prevents this complexity by making old and new paths coexist — the cleanest rollback is keeping both paths in the binary until all in-flight workflows on the old path have drained.

The decision rule

Model version upgrades in Temporal require deliberate architecture decisions before the first deployment. If your team is operating Temporal-based AI pipelines or planning to introduce Temporal into an existing model-serving architecture, versioning strategy belongs in the initial design: activity boundary design, schema evolution planning, version transition testing, and operational runbooks for model upgrades. The Enterprise Agentic Assessment Kit covers the workflow orchestration patterns and model-versioning tradeoffs that separate production-ready AI pipelines from prototypes.

Temporal Workflow Versioning for AI Pipelines: Deploying New Model Versions Without Downtime

Why AI Pipelines Have a Different Upgrade Problem

Temporal’s Patching API: The Mechanics

Activity Versioning vs. Workflow Versioning

Task Queue-Based Routing for Model Version Isolation

Handling In-Flight Workflows During Upgrades

Rollback Strategies

Testing Version Transitions

Schema Evolution at the Activity Boundary

Pre-Upgrade Checklist

Frequently Asked Questions

What happens to in-flight Temporal workflows when you deploy new code that changes activity behavior?

When should you use task queue routing instead of Temporal's patching API for model version upgrades?

How do you test a Temporal workflow version transition before deploying to production?

What is the correct rollback strategy for a Temporal workflow version upgrade that fails in production?

The decision rule

Bring the system under review

Igor Bobriakov

ML & Data Science

Telos: Deterministic AI Video Infrastructure

Related Articles

When Your AI Pipeline Needs Temporal and When It Does Not: The Complexity Threshold

Temporal Observability for AI Workflows: What to Instrument Beyond Workflow Status

Building Durable RAG Pipelines with Temporal: Ingestion, Embedding, and Index Management