Netflix TechBlog - Medium

Scaling ArchUnit with Nebula ArchRules

Netflix Technology Blog — Fri, 08 May 2026 15:55:59 GMT

Introduction

At Netflix, we operate using a polyrepo strategy with tens of thousands of Java repositories. This means that we need to have ways of sharing common build logic across these repositories. On the JVM Ecosystem team within Java Platform, we build tooling such as the Nebula suite of Gradle plugins to provide standard ways to build projects, keep dependencies up-to-date, and publish artifacts reliably across the Java ecosystem. Our mission also entails providing build-time feedback to the developer when they deviate from the paved road, or when their code base contains technical debt.

Case Study

After a Netflix incident relating to a library releasing a backwards-incompatible change, our team was asked to provide some tooling and practices to improve the Java library lifecycle management. This was not a simple case of a library making a reckless breaking change. The code removed had been deprecated for years. Library authors often struggle to know when it is safe to remove deprecated code, or refactor code that is not meant to be used by downstream applications. Fleet-wide migrations, such as upgrading major Spring Boot versions, also involve deprecated code removal. To help with this, we established a suite of API lifecycle annotations:

@Deprecated from the Java standard library
@Public A custom annotation to use on APIs meant to be used downstream
@Experimental A custom annotation for new APIs which may not yet be stable
All other APIs are assumed to be “internal”

Library authors can annotate their APIs with these annotations. However, how will they know which downstream projects are using their API incorrectly, based on these?

As we sought to improve the paved road for JVM-based libraries at Netflix, we needed a good way of identifying this kind of technical debt, not only for the benefit of the Java Platform-provided libraries, but any team delivering shared libraries to the organization. For this, we looked at ArchUnit.

ArchUnit is a popular OSS library (3.5k stars, 84 contributors) used to enforce “architectural” code rules as part of a JUnit suite. It is used internally by Gradle, Spring, and is provided as part of the Spring Modulith platform. The rules engine, which is built directly on top of ASM, can be used for a wide variety of use cases. It is powerful enough to be a general purpose static analysis tool with the following distinctive features:

1. Works cross-language (JVM), because it uses ASM/bytecode, not AST parsing.

2. Exposes a builder API pattern that makes it easy to write rules

3. Also has a lower level API ideal for writing more complex custom rules.

The limitation of ArchUnit is that it is designed to be used as part of a JUnit suite in a single repository. The Nebula ArchRules plugins give organizations the ability to share and apply rules across any number of repositories. Rules can be sourced from OSS libraries or private internal libraries. This makes the plugin generally useful for any JVM+Gradle engineering organization.

Why ArchUnit?

Before we go into how ArchRules works, it is good to understand why we would want to use ArchUnit in this way instead of other static analysis tools.

AST vs Bytecode

Some tools, such as PMD, process rules against an AST (abstract syntax tree). An AST is a structured representation of source code. This kind of tool will have rules that are syntax dependent. Rules that need to support multiple JVM languages, such as Kotlin or Scala, often need to be rewritten for each language. It also allows code which should be found to be hidden under syntactic sugar not anticipated by the rule author. ArchUnit uses ASM to analyze actual compiled bytecode, which means it doesn’t matter how that code was produced. What is analyzed is the actual code that will be run.

Rule Authorship

Tools like PMD and Spotbugs are not optimized for custom rule authorships. Most usage of these tools run built-in provided rules, or add in pre-made third party plugins. Take a look at what a custom rule for PMD might look like:

 //AllocationExpression/ClassOrInterfaceType[
   @Image='DateTime' and (
       (count(..//Name[@Image='DateTimeZone.UTC'])<=0)
       and
       (count(..//Name[@Image='DateTimeZone.forID'])<=0)
    ) or (
       (
           (count(..//Name[@Image='DateTimeZone.UTC'])>0)
             or
           (count(..//Name[@Image='DateTimeZone.forID'])>0)
       ) and (../Arguments/ArgumentList and count(../Arguments/ArgumentList/Expression) = 1)
   )
 ]
]]>

This rule ensures that DateTimes are not instantiated without an explicit zone. This is a raw string meant to be used within PMD’s xpath parser. There is no IDE guidance on crafting it. To test it, a whole separate PMD process needs to be wired up to interpret the rule and evaluate it against a source file. Let’s see how a similar rule would look with ArchUnit:

ArchRuleDefinition.priority(Priority.MEDIUM)
.noClasses()
.should()
.callConstructorWhere(
    // constructor does not have a zone arguement
    target(doesNot(have(rawParameterTypes(DateTimeZone.class))))
   // constructor is for DateTime
        .and(targetOwner(assignableTo(DateTime.class)))
)

This is type-safe Java code with a fluent API. It is also simple to unit test, as ArchUnit has a method to pass a rule object and class references to evaluate the rule against those classes.

Class Relations

Because ArchUnit processes the entire classpath with ASM, it retains a graph of the class data, allowing rules to easily traverse class relationships and call sites. This allows rules to have much more context about the code it is evaluating.

Rules Libraries

The first step was to build the ability to write ArchUnit rules which can be shared and published. In order to do this, we have the ArchRules Library Plugin. This plugin adds an additional source set to your Gradle project called archRules. In this source set, you can create a class which implements the ArchRulesService interface. This interface has a single abstract method which returns a Map. The keys of this map are the names of your rules, and the ArchRule is the rule you would like to define using the standard ArchUnit API. Here is an example:

public class GuavaRules implements ArchRulesService {
  static final ArchRule OPTIONAL = ArchRuleDefinition.priority(Priority.MEDIUM)
        .noClasses()
        .should()
        .dependOnClassesThat()
        .haveFullyQualifiedName("com.google.common.base.Optional")
        .because("Java Optional is preferred over Guava Optional");

    @Override
    public Map getRules() {
        Map rules = new HashMap<>();
        rules.put("guava optional", OPTIONAL);
        return rules;
    }
}

This code and its dependencies will not be bundled with your main code. It is bundled into a separate Jar with the arch-rules classifier. When publishing, your library will publish this jar as a separate variant with the usage attribute set to arch-rules. This means that in order for downstream projects to use these rules, they must use Gradle Module Metadata for dependency resolution. There are 2 flavors of rules Libraries: Standalone rules libraries, bundled rule libraries.

Standalone Rule Libraries

A Standalone Rule library contains no main code: only archRules. These are useful for defining rules for code you don’t own, such as Core Java APIs or OSS libraries. They are also useful for generic rules that can apply to any code, such as “don’t use code marked as @Deprecated”. We maintain a collection of OSS Standalone rule libraries which anyone is free to use, and serve as examples of the types of rules you may want to write yourself. However, the real power of ArchRules is in “bundled rule libraries”.

Bundled Rule Libraries

A bundled rule library is a library with both main and archRules sources. The main source set will contain useful library code, whatever it may be. The archRules will contain rules specific to the usage of that library. For example, rules scoped to that library’s package, or referencing that library’s specific API. Whenever possible, we recommend writing rules in this bundled way. That is because the ArchRules Runner Plugin will be able to automatically detect these rules and run them in only the source sets that use this library as a dependency. An example of this can be seen in our Nebula Test library.

In any case, the library plugin will automatically generate a service loader registration entry for your ArchRulesService so that the runner can discover your rules.

Running Rules

The ArchRules Runner Plugin allows rules to be evaluated against your code. Standalone rule libraries can be evaluated against all source sets by adding them to the archRules configuration in your build. For example:

dependencies {
    archRules("your:rules:1.0.0")
}

As mentioned before, bundled rules will be evaluated automatically. To do this, the runner plugin creates a separate configuration for each of your source sets. In each of these configurations, the archRules classpath is combined with the runtimeClasspath with the arch-rules variant selected. This configuration is the classpath used when the ServiceLoader discovers implementations of ArchRulesService. In the following example, we have a Project which uses a test helper library as a testImplementation dependency, and also adds a standalone rules library to the archRules configuration. The test runtime classpath will only contain the implementation jar for the helper library, but the arch rules runtime will contain the archrules jar for the bundled rules and standalone rules. This all happens automatically.

Gradle configurations used by ArchRules

Once the rules classpath is determined, the runner plugin will create a Gradle work action to evaluate rules against that specific source set. This action runs with classpath isolation using the *archRuleRuntime configuration. Within this action, a ServiceLoader is used to discover rule definitions. The action ends by writing a binary serialization of rule violations to a file for reporting.

In a project running rules, you also have the ability to customize rule configurations using the archRules extension. For example, you can override a rule’s priority level:

archRules {
    ruleClass("com.netflix.nebula.archrules.deprecation") {
        priority("HIGH")
    }
}

Other customizations include disabling running rules on certain source sets and configuring the failure threshold (i.e., high priority failures will cause the build to fail).

Reporting

The ArchRules runner plugin has two built-in reports: JSON and console. The json report will collect the output from all source sets within a project and create a single json file with all of the data. The console report also collects the output from all source sets within a project, but it prints to the console an easy to read report, for example:

Console Report output

Note that failure details feature a detailed plain English description, along with a pointer to the exact line of code in violation.

For custom reporting, you can either use the JSON file, or create your own task that reads the binary files. Take a look at the source code for the ArchRules runner plugin’s report tasks for an example of how to do this.

Case Study Solution

Going back to our original problem, using ArchRules, we were able to deliver a platform for library authors to track the usage of their APIs. They write ArchRules to detect usage of the annotations, scoped to their library’s package, such as:

ArchRuleDefinition.priority(Priority.MEDIUM)
    .noClasses().that(resideOutsideOfPackage(packageName + ".."))
    .should()
    .dependOnClassesThat(resideInAPackage(packageName + "..").and(are(deprecated())))
    .orShould().accessTargetWhere(targetOwner(resideInAPackage(packageName + ".."))
        .and(target(is(deprecated())).or(targetOwner(is(deprecated())))))
    .allowEmptyShould(true)
    .because("Deprecated APIs are subject to removal");

NB: the deprecated() predicate comes from nebula-archrules.

Our internal Nebula standard Gradle wrapper and plugin suite automatically enable the ArchRules runner on every project, and provides a custom reporter which sends the report data to our Internal Developer Portal on every main-branch CI build. This way, library authors can easily see a report of all downstream consumers using their experimental, deprecated, or non-public APIs, giving them confidence to make “breaking” changes, knowing that it will not actually break downstream consumers. If their changes are currently blocked by downstream usage, they can easily see exactly which projects are reporting those usages.

OSS Rule Libraries

While the most powerful way to use ArchRules is for you to write your own rules, we have built some OSS rule libraries that anyone is free to use, or reference as examples.

Nullability

These rules enforce proper nullability annotation in Java, for example, that every public class is marked with JSpecify’s @NullMarked. It is smart enough to exclude Kotlin code, as Kotlin has built-in nullability.

Gradle Plugin Best Practices

Writing Gradle plugins can be hard, especially since there are many APIs and patterns that should not be used anymore. These rules help enforce current best practices when writing Gradle plugins.

Joda / Guava Rules

These rule libraries discourage the use of Joda Time and Guava classes (respectively) as these have been superseded by java.time and standard library enhancements.

Security Rules

These rules help mitigate CVEs by detecting usage of known vulnerable APIs. Ideally, we keep dependencies up to date to mitigate CVEs. But sometimes that is not immediately feasible, and in those cases, a compile time check to ensure the specific vulnerable API is not used is often good enough.

Conclusion

We are now running 358 (and counting) rules across over 5,000 repositories detecting over nearly 1 million issues. About 1,000 of these issues are for “High” priority rules. Being able to run these rules on this scale allows us to quickly gain insight into our large fleet of microservices, and identify the areas carrying the most critical technical debt. This makes it easier to focus and prioritize our efforts.

Going forward, we will be exploring how to tie auto-remediation solutions into the ArchRules findings. ArchUnit currently provides very specific and detailed information about failures in reports, which makes a very strong input signal to an auto remediation tool. We will explore deterministic solutions such as OpenRewrite and non-deterministic solutions such as LLMs. Pairing the easy rule authorship and deterministic results of ArchUnit with an auto-remediation tool that can correctly interpret the results to solve the issue at hand will be a very powerful combination.

We also will investigate how to get ArchRule failure information surfaced in the IDE as inspections.

If you have questions or feedback about Nebula ArchRules, reach out to us by posting in the #nebula channel on the Gradle Community Slack.

Scaling ArchUnit with Nebula ArchRules was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Democratizing Machine Learning at Netflix: Building the Model Lifecycle Graph

Netflix Technology Blog — Mon, 04 May 2026 16:01:02 GMT

Saish Sali, Nipun Kumar, Sura Elamurugu

Introduction

As Netflix has grown, machine learning continues to support our ability to deliver value to members and drive excellence across multiple areas of our business. When Netflix began investing in machine learning over a decade ago, it was primarily focused on a single domain: personalization. Scala was the industry standard, our ML teams were relatively small, and optimizing member engagement was our primary use case. Fast forward to today, and machine learning has become the backbone of Netflix’s business transformation. We now apply ML across various business domains, including:

Personalization: Optimizing engagement and helping members discover content they’ll love
Studio: Pre and post-production workflows
Payments: Fraud detection, payment routing, and recurring billing optimization
Ads: Our newest domain, requiring real-time decisioning and targeting

… and a growing number of additional use cases across the company

Each domain operates with a different tech stack, different business metrics, and a distinct organizational structure. While this diversity is a testament to how machine learning has evolved to drive value across many verticals at Netflix, this growth introduces a new challenge: enabling cross-pollination of models and data across domains.

The Challenge: A Fragmented ML Landscape

As our ML investments scaled across these domains, a critical problem emerged: the models produced largely became black boxes. Without any discovery infrastructure, ML practitioners couldn’t easily collaborate or share work across business verticals.

Consider a concrete example: content embeddings. Our Studio teams create sophisticated embeddings that identify scene boundaries, detect visual transitions, and understand content structure. These embeddings were originally built for production workflows.

But those same embeddings could be incredibly valuable elsewhere. Ads could hypothetically use content embeddings for context matching (ensuring advertisements align with the tone and content of what’s currently playing). Personalization could leverage them for episodic merchandising and recommendations (matching the topic or mood of an episode with a user’s preferred viewing preferences). Yet making this cross-pollination happen is extraordinarily difficult.

Why? Our ML tools exist in silos, each with its own backend services and user interface. The model registry is unaware of which A/B tests were using its models, and the pipeline orchestrator is unaware of downstream model dependencies. ML practitioners have to traverse multiple systems to answer basic questions about their work. Finding a model requires opening the model registry, understanding its lineage means switching to the pipeline orchestrator, and tracking which A/B tests use that model requires navigating to the experimentation platform. This fragmentation prevents practitioners from answering critical questions:

Discovery: What features exist? What data sources are available for generating features for a model?
Lineage: Which pipeline is generating data for a specific model? What data sources feed those features?
Impact: Which A/B tests are running this model? Which models will break if I change this feature? Who owns each piece of this chain?

The Hard Problem: Connecting everything

The real challenge wasn’t just building a consolidated UI. We needed to connect the different pieces of infrastructure our ML practitioners were using to perform different parts of the ML lifecycle.

Our ML ecosystem generates metadata from dozens of sources:

Pipeline orchestration systems emit execution details, stage dependencies, and data transformations
Deployed model registry tracks model versions, artifacts, staleness, and deployment history
Experimentation platform manages A/B tests and their configurations
Feature store catalog feature definitions and usage
AI Dataset platform tracks the creation, management, discovery, and loading of datasets.
Identity platform maintains user, team, and organization metadata

Each system employs different formats, identifiers, and mental models. The hard technical problem we had to solve was: How do we collect this heterogeneous metadata, transform it into a unified entity model, and build a connected graph that enables true exploration and collaboration across business domains?

The Solution: Metadata Service and the Model Lifecycle Graph

Our answer was the Metadata Service (MDS), which builds a Model Lifecycle Graph that indexes and connects ML-related entities across Netflix. MDS is optimized for real-time ingestion of ML metadata (e.g., models, features, pipelines, experiments, datasets) and to answer cross-domain questions such as “Which experiments are running this model?” or “Which models share these features?” It is the foundation that enables discovery, ingesting events from diverse sources, enriching them with context, and materializing relationships across entities.

Our vision: to make every ML asset at Netflix discoverable, understandable, and reusable by every ML practitioner, regardless of their team or domain.

Core Abstractions: The Vocabulary of the System

Before diving into the technical implementation, it’s helpful to understand the conceptual model that underpins MDS. This vocabulary enables consistent communication across teams and systems:

Component: Any object that is uniquely addressable using an AI Platform’s (AIP) Uniform Resource Identifier (URI). An AIP URI follows the formataip:////, ensuring global uniqueness. For example:

Models: aip://model/registry/ranking-v5
Users: aip://user/identity/alice
Pipelines: aip://pipeline/orchestrator/weekly-training

Entity: A component within the ML ecosystem, characterized by additional properties such as name, description, creation date, and owners. Entities represent ML-specific assets, such as models, features, and pipelines.

Entity Type: A group of entities that share the same data shape. A data shape is a set of property constraints that specify the attributes and relationships an entity must have.

Domain: A functional grouping of related entity types that defines the abstract interface for a category of ML assets. For example, the Models domain defines what a Model and Model Instance look like, while the Pipelines domain defines Schedules, Requests, and Executions.

Provider: A concrete implementation of a domain, backed by a specific source system. For example, the Models domain is currently backed by our internal model registry. This separation allows MDS to support multiple providers for the same domain. If a new model registry were introduced, it could be added as an additional provider without changing the domain interface.

We can summarize these concepts with a concrete example:

This URI-based addressing scheme is crucial as it allows any service to reference any ML asset with a single string, and MDS can resolve that reference back to rich, connected metadata.

From Events to Entities to Graph

The journey from raw system events to a queryable graph happens in stages. Let’s walk through each with a concrete example: connecting a model to its A/B tests through relationship inference.

1 Event Ingestion

MDS integrates with various source systems via Kafka and AWS SNS/SQS, consuming events in real-time. Source systems emit thin events that include an identifier and an event type.

Example event:

{
  "event_type": "model_instance_created",
  "instance_id": "ranking-model-v5-20XX0101",
  ...
}

This design keeps producers simple. Source systems only need to announce that a change occurred, without building complete payloads or understanding downstream requirements.

Each source system has dedicated event handlers in MDS:

Pipeline Orchestration: Ingests pipeline execution events, including node definitions, schedules, requests, and job attempts
Model Registry: Captures model deployments, configurations, and version updates
Feature Store: Tracks feature definitions and their versions
Experimentation Platform: Monitors A/B test configurations and allocations
Datasets: Tracks ML datasets and their versions
Identity Platform: Maintains ownership and team membership information

2 Entity Enrichment

MDS implements a hydration contract for each event type. When an event arrives, MDS:

Validates the event schema
Calls the source system’s API to fetch the complete, current state
Transforms the response into a normalized entity

This design has a crucial property: the order of events doesn’t matter. MDS always fetches the latest facts from the source of truth. This pattern decouples the event stream from state consistency. If the event bus drops a message or delivers it out of order, the next event corrects the state. The event stream becomes a notification of change rather than a log of changes.

This notification of change pattern has a few important tradeoffs. On the plus side, it keeps producers simple, makes us robust to out-of-order or dropped events, and ensures that MDS can always reconcile to the latest state by reading from the source of truth. The tradeoff is that we place additional read load on source systems during hydration and need to be deliberate about rate limiting, caching, and backoff in our enrichment workers so that we don’t overload them.

For our ranking model example, when the model_instance_created event arrives, MDS calls the Model Registry API: GET /api/v1/instances/ranking-model-v5-20XX0101

The registry responds with a full descriptor. Example response (key fields only):

{
  "id": "ranking-model-v5-20XX0101",
  "pipeline_run_id": "train-weekly-ranking-20XX0101",
  "owner_emails": ["alice@netflix.com"],
  "labels": [{"key": "team", "value": "personalization"}],
  ...
}

3 Data Transformation and Normalization

Raw events are heterogeneous and each source system has its own schema and semantics. MDS workers transform these events into a unified entity model with standardized fields.

Without normalization, downstream consumers would need to understand every source system’s schema. Normalization creates a consistent interface, allowing queries and relationships to work across all entity types. Here is an example.

Normalized MDS entity:

{
  "id": "aip://model/registry/ranking-model-v5-20XX0101",
  "pipeline_run": "aip://pipeline-run/orchestrator/train-weekly-ranking-20XX0101",
  "entity_type": "ModelInstance",
  "owners": ["aip://user/identity/alice"],
  "tags": [{"tag": "team", "value": "personalization"}],
  ...
}

The normalization process standardizes field names and formats. For example, platform-specific IDs become global AIP URIs, owner_emails becomes owners with resolved user URIs, and labels become tags. Foreign keys like pipeline_run_id are transformed into entity references. However, there’s still no reference to which A/B tests are using this model. The Model Registry doesn’t track experiments, and the Experimentation Platform doesn’t track which pipeline produced a given model. This is where knowledge enrichment becomes critical.

4 Storage and Indexing

Once normalized, entities are persisted to Datomic and immediately indexed in Elasticsearch. This happens synchronously within the event processing flow.

Datomic for Caching and Relationships
Normalized entities are first written to Datomic, which serves as both a local cache and a graph database.

Why Datomic? Datomic serves as both the system of record for MDS and the working dataset for enrichment processes. Its immutable fact model means we can continuously add relationships without losing the original entity state.

What we store:

All entity attributes as facts
Entity references (foreign keys that may point to entities not yet fully resolved)
All relationships as reified edges (added by enrichment processes)
Entity lifecycle state (tracking which entities are fully enriched vs awaiting hydration)

This enables:

Complex graph traversals: Navigate from a model to its features to their data sources in a single query
Entity relationships: Join across multiple domains without N+1 query problems
Flexible schema evolution: Easy to add new entity types and attributes as the catalog grows
Progressive enrichment: Background jobs efficiently identify and process entities requiring additional hydration, enabling gradual graph completion without reprocessing fully enriched entities

In practice, we use Datomic for relationship-heavy, navigational queries such as:

Starting from this model instance, show me all upstream datasets and downstream experiments.
Given this feature, list all consuming models and their owning teams.

These queries often span multiple hops in the graph and benefit from Datomic’s immutable fact model and efficient joins across entity relationships.

Elasticsearch for Discovery
Immediately after writing to Datomic, entities are indexed in Elasticsearch to power fast, full-text search across the catalog.

What we index:

Primary fields: Entity name, description, entity type, owner names
Relationship metadata: Names of related entities (e.g., a model’s features, pipelines, A/B tests) stored in the related field
Tags: Domain-specific metadata stored as key-value pairs (e.g., team::personalization, env::production, model.state::released)

Index structure:

Single entities index: All entity types (models, features, pipelines, etc.) are indexed in one unified index, differentiated by the entityType field
Separate owners index: Dedicated index for users and groups to enable cross-entity owner searches
Relevance boosting: Exact name matches score higher than other relevant matches

This enables:

Multi-field text search across entity names, descriptions, tags, and related metadata
Relevance ranking with boosting (exact name matches score significantly higher)
Complex filtering by entity type, ownership, tags, and domain-specific attributes (stored as tags)
Fuzzy matching to handle typos and partial queries

Elasticsearch powers the entry point into the system: users typically start with a free-text search in the AIP Portal (for a model name, a team, or a domain term), and then switch to graph navigation once they land on an entity page. Indexing happens in near real-time as part of the ingestion and enrichment workflows, so changes are usually visible in the Portal with a short delay that is acceptable for interactive use.

5 Knowledge Enrichment and Graph Formation

Once entity metadata is persisted in Datomic, scheduled background processes take over to discover and materialize relationships. These enrichment jobs run periodically, scanning for uncached or partially resolved entities (entities that exist only as references without full metadata).

The enrichment workflow:

Identify candidates: Find entities marked as uncached or with unresolved references
Hydrate relationships: Query source-of-truth systems to fetch related entity details
Materialize edges: Write discovered relationships back to Datomic
Re-index: Trigger Elasticsearch indexing for updated entities
Mark as enriched: Update entity status to prevent redundant processing

This asynchronous approach allows MDS to handle the computational cost of graph formation without blocking real-time event ingestion. It also enables retry logic and gradual enrichment as new entities become available.

Because enrichment is asynchronous, newly discovered relationships may appear with a short delay after the underlying entities are created (typically minutes rather than seconds). We track when each entity was last enriched and surface this timestamp in the AIP Portal, so practitioners can reason about staleness and know when it’s safe to rely on a particular relationship for debugging or impact analysis.

Why enrich? Source systems are purpose-built and don’t know about entities in other domains. Enrichment discovers and materializes cross-system relationships that enable powerful lineage and impact queries.

Example: Connecting Models to A/B Tests

When MDS processes a new model instance, background enrichment jobs discover relationships through multi-hop inference:

Step 1: Direct link to pipeline

The model references a pipeline_run_id. An enrichment job hydrates the pipeline and discovers its A/B test associations: GET /api/v1/pipeline-runs/train-weekly-ranking-20XX0101

Response:

{
"run_id": "train-weekly-ranking-20XX0101", "pipeline":  "weekly-ranking-trainer",
"ab_test_cells": [
   {"test_id": "12345","cell_number": 2,"cell_name": "treatment_ranking_v5"}
 ]
 ...
}

Step 2: Discover A/B test context
The enrichment job discovers the pipeline ran for A/B test cell #2 and queries the Experimentation Platform for test details: GET /api/v1/tests/12345

{
 "test_id": "12345",
 "name": "Ranking Model v5 vs v4",
 "status": "ACTIVE",
 "cells": [{"cell_number": 1, "name": "control_ranking_v4"}],
 ...
}

Step 3: Infer transitive relationships
The enrichment job now has the complete chain:

Model Instance was produced by Pipeline Run
Pipeline Run was executed for A/B Test Cell #2
The A/B Test Cell #2 belongs to A/B Test “Ranking Model v5 vs v4”
Model Instance now gets associated with this A/B Test

The job writes the inferred relationship back to Datomic and triggers re-indexing, and materializes these edges in the graph. MDS doesn’t just store what it’s told; it derives new knowledge by walking the graph in the background.

Why this matters: Without MDS, answering “Which A/B tests are using this model?” requires:

Looking up the model in the Model Registry
Finding which pipeline produced it
Checking the Pipeline Orchestrator for A/B test tags
Querying the Experimentation Platform for test details

With the model lifecycle graph, it’s a single query:

query {
  model(id: "aip://model/registry/ranking-model-v5-20XX0101") {
    name
    owners { name }
    currentInstance {
      version
      pipeline {
        name
        owners { name }
      }
      features {
        edges {
          node {
            name
            data { edges { node { name } } }
          }
        }
      }
      associatedAbTests {
        name
        cells { number name }
      }
    }
  }
}

The reverse query also works: “What models are being tested in experiment 12345?”

Enabling Exploration, Not Just Search

With the Model Lifecycle Graph in place, we shift from entity search to entity exploration. Discovery isn’t just about finding a model; It’s about traversing relationships:

Start with a model, explore its features
From features, navigate to the core data driving them
From the data, trace back to the pipelines generating it
From pipelines, see which teams own and depend on them
From experiments, understand which models are being tested

For example, imagine an engineer investigating a degraded engagement metric for a personalization model. They might:

Start with the model instance powering the affected recommendations in the AIP Portal.
Inspect the model’s features and follow a suspicious feature to its upstream dataset.
From the dataset page, see that its pipeline recently had failed runs and identify the owning team.
Confirm which A/B tests are currently running this model instance to understand which members and surfaces are impacted.

Before MDS and the Model Lifecycle Graph, this required manual checks across multiple tools (model registry, pipeline orchestrator, experiment platform). Now it’s a contiguous journey in a single interface.

This graph-based exploration answers questions that were previously impossible:

Lineage queries: What is the complete lineage of this model, from training data to production experiments?
Impact analysis: Which models will be affected if I change this feature?
Usage discovery: Which A/B tests are using this model?
Dependency mapping: What data sources does my pipeline transitively depend on?
Deprecation planning: Which entities are no longer being used and can be retired?

Every entity has deep context: its creation time, ownership, update history, and most importantly, its relationships to other entities.

The Model Lifecycle Graph is surfaced to practitioners through the AIP Portal, a unified interface that provides full-text search across all entity types, detailed entity pages with navigable relationships, and personalized views for teams and individuals.

A typical interaction in the AIP Portal looks like:

Search: Type a model, feature, dataset, or team name into the single search box backed by Elasticsearch.
Inspect: Land on an entity page that shows key metadata (description, owners, domains, tags) alongside a relationships panel.
Explore: Click through to related entities (upstream datasets, downstream experiments, and sibling model versions) to navigate the Model Lifecycle Graph without leaving the portal.

When new entity types are introduced into MDS, the portal automatically provides baseline search, entity pages, and relationship navigation, and we can then layer on domain-specific visualizations (such as model deployment history or dataset version timelines) over time.

The Road Ahead: Open Challenges

Building the ML lifecycle graph is an ongoing journey. Significant challenges remain, and these represent the future opportunities for us:

Tool Proliferation: As new ML tools emerge, we need robust integration patterns that scale. How do we design plugin architectures that make adding new sources seamless? If we don’t keep up with new tools, practitioners will be forced back into fragmented views, and the Model Lifecycle Graph will lose coverage and trust.
Domain-Specific Visualizations: Different entity types require distinct visualization experiences. Model pages should display deployment history, A/B test associations, and performance metrics. Feature pages should highlight data lineage and consuming models. Pipeline pages must show execution history, dependencies, and schedules. Dataset pages require versioning timelines and downstream consumers. How do we design a flexible UI framework that allows each entity type to have its own tailored experience while maintaining consistent navigation and interaction patterns across the portal? Without rich, domain-specific experiences, the portal risks becoming a generic catalog rather than a tool that ML practitioners rely on in their daily workflows.
Metadata Quality: Today, MDS ensures data consistency through source-of-truth hydration and schema validation at ingestion. Background enrichment jobs continuously infer relationships and materialize entities from source systems. However, challenges remain in ensuring completeness and timeliness at scale. When source systems fail to emit events, when ownership information becomes stale, or when entities lack descriptions and contextual metadata, the graph’s utility degrades. How do we build automated validation and enrichment systems to detect metadata anomalies, suggest missing relationships, and maintain quality benchmarks across millions of entities? Poor or stale metadata erodes practitioner trust: if the graph is incomplete or incorrect, teams will revert to ad hoc knowledge and one-off integrations rather than using MDS as their source of truth.
Advanced Relationship Inference: Beyond explicit relationships declared in source systems, how do we infer implicit connections? Can we detect that two models serve similar purposes based on shared features? Can we recommend features based on usage patterns from similar pipelines? We are in the early stages of exploring these ideas. Done well, they would turn MDS from a passive catalog into an active recommendation engine for ML assets, accelerating reuse and reducing duplicate work across domains.

Acknowledgments

This work represents the collective effort of stunning colleagues across the AI Platform organization: Emma Carney, Megan Ren, Nadeem Ahmad, Pat Oleniuk, Prateek Agarwal, Tigran Hakobyan, Yinglao Liu

Democratizing Machine Learning at Netflix: Building the Model Lifecycle Graph was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

State of Routing in Model Serving

Netflix Technology Blog — Fri, 01 May 2026 21:03:13 GMT

By Nipun Kumar, Rajat Shah, Peter Chng

Introduction

This is the first blog post in a multi-part series that shares technical insights into how our ML model serving infrastructure powers several personalized experiences at scale across various domains (e.g., title recommendations, commerce). In this introductory blog post, we will dive into our domain-independent API abstraction and its traffic routing capabilities that the central ML model serving platform exposes to several domain-specific microservices for model inference. This singular API, or entry point, into the ML model serving platform has significantly increased the speed of innovation for iterating on newer versions of existing ML experiences, as well as enabling completely new product experiences with ML.

Machine Learning use cases powering member experiences on Netflix require rapid iteration and evolution in response to new learnings. The success of our ML model serving infrastructure largely depends on enabling researchers to rapidly experiment with new hypotheses and safely, at scale, release their models into production. Equally important is enabling multiple microservices at Netflix to seamlessly get model inference without exposing the complexities of ML model inference. To achieve this in a uniform and scalable manner, we created a centralized ML serving platform. As of 2025, the platform serves hundreds of model types and versions, netting 1 million requests per second. In this post, we’ll zoom in on a core challenge of any large-scale ML serving system: How to route traffic to the right model instance, on the right cluster shard, for the right user and use case, while preserving a simple abstraction for both client services and model researchers.

Background

Models at Netflix

To properly frame our discussion, let’s first clarify the distinction between model serving and model inference. At Netflix, the definition of an ML model has historically been somewhat unique. While model inference typically focuses only on an infer(features) -> score capability, models at Netflix act as self-contained workflows that transform inputs to outputs. A “model” encapsulates pre- and post-processing, feature computation logic, and an optional ML-trained component, all packaged in a standard format suitable for use across multiple contexts. We refer to the end-to-end execution of this workflow as model serving. This distinction matters because our routing and API abstractions operate at the level of workflows, not just individual scoring functions.

A few simplified examples of model serving use cases:

Use case: Personalized Continue Watching row on Netflix Homepage

Input: UserId, Country, Device ID
Output: Ranked List of movies and shows (aka title): [titleId1, titleId2, titleId3,…]

Use case: Payment Fraud Detection

Input: UserId, Country, Payment Transaction details
Output: Probability of the transaction being fraudulent

A typical flow of this serving workflow is depicted below:

To achieve this higher level of abstraction, the model definition contains a list of facts (raw, unprocessed data or observations built as states in different business workflows) that it needs to compute features, and it relies on the model serving platform to supply these facts at serving time by calling several other microservices. Likewise, during offline training, Netflix’s ML fact store provides snapshots for bulk access to facilitate feature computation.

The important takeaway from this model definition is that the calling services only need to provide standard request context (such as userId, country, device), and the relevant domain context (such as titles to rank, or payment transaction for fraud detection), and the model can itself compute features and perform inference as part of the execution flow. This common set of request contexts across domains enables them to share a standard API abstraction and standardizes how various client microservices can uniformly integrate with the serving app. Furthermore, clients are shielded from the model selection and execution, allowing the model architecture and data inputs to evolve with minimal client coordination.

This post focuses on showcasing the technical details to support this design paradigm. We’ll first describe how we implemented this abstraction with Switchboard, a centralized routing service, and then discuss the operational challenges we encountered at scale and how they led us to the Lightbulb architecture.

ML Model Serving Platform Principles

We envisioned a central model serving platform for all of Netflix’s member-facing ML Model serving needs. This ambitious effort required principled thinking to provide the right level of abstraction for both the researchers and client applications. The following ideas, which are relevant to the topic of this blog post, ensured that the platform acts as an enabler of rapid ML innovation and limits the exposure of ML model iterations to the client apps:

Model innovation independent of client apps: There should be only a one-time integration effort by the calling app with the ML serving platform for a new use case. After that, almost all model iterations, including intermediate model A/B experiments, should be mostly opaque to the calling apps. This implies that the platform should handle tasks such as model selection based on a user’s A/B allocation, fetching additional data needed by experimental models, logging for further training or observability, and more. This also benefits the ML researcher, as they only need to coordinate with one platform for model innovation.
Decouple clients from model sharding: Models are distributed across multiple serving compute cluster shards, each with its own Virtual IP (VIP) Address. Various factors, such as traffic patterns, SLAs, model architecture, and CPU/Memory availability, affect model-to-cluster mapping, and changes to this mapping result in changes to the VIP address at which a model is reachable. The serving platform should make clients agnostic to such frequent VIP address changes while ensuring high availability.
Flexible traffic routing rules: Support flexible mechanisms to introduce new traffic routing rules. This includes supporting traffic routing based on A/B experiments, providing a knob to slowly shift traffic to new models and VIP addresses, and allowing client overrides.

Introducing Switchboard

Standard out-of-the-box API Gateway solutions (such as AWS API Gateway, a standalone Service Mesh proxy) did not meet all our requirements. In particular, we needed first-class integration with Netflix’s experimentation platform, the ability to expose gRPC endpoints to clients, and the ability to use rich domain-specific context for routing customizations, which generic proxies were not designed to handle. Furthermore, the platform required customizations to model-specific lifecycle stages (shadow mode, canaries, rollbacks) to enable safe rollouts and migrations.

Hence, we embarked on building a custom service that serves as a flexible proxy layer for all traffic, handling over 1 million requests per second while maintaining high availability and reliability. We named it Switchboard.

Switchboard serves as the central entry point for the system, acting as a mandatory interface for all clients to access the appropriate model based on their context. Its role is to perform context-aware routing and to apply any configured context enrichment to the model inputs.

Here is a visual representation of the request flow from different clients to different serving clusters:

Objective Abstraction

To support this system design, we introduce the concept of an “Objective”. It’s an Enumeration defined by the serving platform that every request into the system must provide. It has three key purposes:

In short, an Objective is the serving platform’s name for a specific business use case (e.g., ContinueWatchingRanking), which decouples clients from concrete models and guides the platform’s routing and model selection decisions.

Key Capabilities of Switchboard

To summarize, these are the key capabilities of Switchboard:

Common Client Abstraction: Switchboard provides a single point of contact for all our clients’ model needs. When clients wish to consume additional models for new ML applications addressing the same business need, there is no new service dependency to introduce or new clients to manage to make requests to the models. From an ML Ops perspective, this also gives us knobs to control client rate limits across model versions and manage central concurrency limits to deal with bad clients.
Context-Aware Routing: Switchboard can route a request based on a rich set of contextual features, such as the user’s current device, locale, ranking surface type (e.g., home page vs. search results), or the current A/B test a user is in.
Dynamic Traffic Splitting: It enables real-time traffic splitting for canary deployments and experimentation. This allows engineers to safely roll out a new model version to a small, controlled percentage of users before a full launch.
Model Versioning and Lifecycle Management: Switchboard inherently manages concurrent request traffic to multiple versions of the same model. This is crucial for:

Shadow Mode Testing: Routing production traffic to a new model version without affecting the user experience, enabling performance comparisons.
Instant Rollback: Immediate switching of traffic away from a problematic new model version back to a stable one.

But is this the whole story? Not quite. Introducing this routing layer adds complexity to our model deployment cycles. In addition, we need a mechanism to collect the context-based routing information from the researchers when they choose to deploy model variants.

The Glue — Switchboard Rules

Given that Objectives serve as the contract between clients and the serving platform, we needed a way for researchers to attach model variants, experiments, and traffic splits to those Objectives without changing client code. This is where Switchboard Rules comes in.

The primary UX for model researchers to define models associated with an objective in a flexible manner is a JavaScript configuration, which we call Switchboard Rules. It’s used to produce a set of rules (typically a JSON file) that primarily dictate the following things to the serving platform:

The default model to use for a given Objective
A/B experiments to configure for a set of Objectives and the corresponding models to load for those experiments
Customizations to gradually shift traffic to a new model

Here is an example of an A/B test rule in the context of the Continue Watching row:

/**
Configuration rule written by a Model Researcher to add an A/B experiment in the Model Serving system.
Cell 1: Uses the default, currently productized model
Cell 2 and Cell 3: Use different experimental (candidate) models
**/

function defineAB12345Rule() {
    const abTestId = 12345;

    const objectives = Objectives.ContinueWatchingRanking;
    const abTestCellToModel = {
        1: {name: "netflix-continue-watching-model-default"},
        2: {name: "netflix-continue-watching-model-cell-2"},
        3: {name: "netflix-continue-watching-model-cell-3"}
    };

    return {
        cellToModel: abTestCellToModel,
        abTestId: abTestId,
        targetObjectives: [objectives],
        modelInputType: constants.TITLE_INPUT_TYPE,
        modelType: 'SCORER'
    };
}

These rules are consumed by both the Switchboard and the Model Serving clusters. Given these rules, the serving platform components can take various actions, some detailed below:

Control Plane Flow:

Assignment: Produce model-to-cluster shard assignment.
Validation: Load all specified models into the Serving Cluster Shard and validate model dependencies to ensure successful execution.
Mapping: Provide the model-to-shard VIP address mapping to Switchboard.

Data Plane Flow:

Allocation: If the request is for Objective=ContinueWatchingRanking, query the Experimentation Platform for the userId’s cell allocation.
Model Selection: Use the allocation and A/B test rule to select the appropriate model.
Request Routing: Route the request to the serving cluster shard with the selected model and context.
Model Execution (on the serving host): Run the model workflow steps and return the response.

A key highlight of this setup is the decoupling of the experimentation config from the serving platform code. This includes having an independent release cycle for the rules, separate from the code deployments. Netflix’s Gutenberg system provides an excellent ecosystem that enables a flexible pub-sub architecture, facilitating proper versioning, dynamic loading, easy rollbacks, and more. Both Switchboard and the Serving Cluster Host subscribe to the same Switchboard Rules configuration.

To prevent race conditions and ensure proper sync of the dynamic Switchboard Rules configuration, the following flow is considered:

Evolving Challenges

Switchboard solved the primary problem of improving model iteration and innovation velocity, and provided an excellent ML serving abstraction to over 30 service clients. However, as the system scale increased, a few challenges and problems with this design became apparent:

Single point of failure: The presence of Switchboard in the critical request path clearly highlights the risks of shutting down access to all serving hosts in extreme cases, such as unintentional bugs or noisy neighbors sending excessive traffic.
Why this matters: Switchboard became a shared dependency whose failure would degrade or disable multiple ML-powered experiences at Netflix.
Added latency due to additional network hop: Switchboard in the request path adds between 10–20ms of latency due to serialization-deserialization operations, depending on payload size. Additionally, it further exposes a request to tail latency amplification.
Why this matters: The added latency is unacceptable for some latency-sensitive clients, resulting in end-user impact due to service timeouts.
Reduced Client flexibility: Switchboard obscures visibility into client request origins from the serving clusters. Consequently, distinguishing data logged for real vs artificial traffic, which is essential for model training, is difficult and requires ongoing customization and increased MLOps overhead.
Why this matters: It makes it harder to do tenant separation and test traffic isolation.

What Next? — Lightbulb

The aforementioned challenges of operating Switchboard at scale forced us to rethink the core implementation while retaining its key features. Our goal was not to throw away Switchboard’s design, but to refactor where and how its responsibilities were executed, keeping the benefits while reducing risk and latency. Particularly:

Common Client Abstraction
Decouple clients from model sharding
Flexible traffic routing rules
Lightweight system client
Single place to define model and experimentation config
Fast experimentation config propagation
Fallback and client-side caching in case of failures

However, we did want to address some of the previous design choices to move forward with:

Remove the routing service from the direct request path: Having a single service in the active request path introduces another failure mode and limits fallback flexibility. While routing rules change infrequently, maintaining consistency comes at the cost of increased availability risks.
Separate model inputs from the request metadata: In certain cases, the request payload could be quite large. Needing to deserialize and then re-serialize the payload as it flowed through Switchboard to make a routing decision was a significant contributor to latency and increased serving costs.
Provide better isolation for the routing layer: Consolidating multiple use cases (tenants) into a single routing cluster poses two main challenges. First, error propagation posed a risk, as a surge of problematic requests from one tenant could cascade errors back to Switchboard, potentially impacting other users. Second, the cluster had to accommodate diverse latency requirements because the requests from different use cases varied significantly in complexity.

This required some changes in our setup flow: While it largely remained unchanged, however, we created separate components for Routing and Model Selection (Lightbulb):

We now take the rules for an Objective and break them into distinct sets of configuration:

Model Serving Configuration: This allows us to determine which model should be used at request time, along with the required metadata
Routing Rules: Given a model we want to serve at request time, this tells us which VIP the request should be routed to.

The Data Plane changes also reflect this separation, as we now rely on Envoy to take care of the routing details:

Envoy is already used for all egress communication between apps at Netflix, and it can route requests to different clusters (VIPs) based on the configurable Routing Rules published from our control plane. However, it lacks the information needed to make routing decisions and the ability to enrich the request body with additional serving parameters required for A/B testing model variants. We introduced Lightbulb to cover this gap:

Lightbulb consumes the minimal request context, which contains use-case information, and provides the metadata mapping required for routing at the Envoy layer.
Lightbulb resolves the request context to determine a routingKey configuration along with the ObjectiveConfig — this is where we place the model id along with other request-specific configurations required for model execution. This is done to separate the config resolution associated with the request from the placement and routing information needed to reach it on the inference cluster.
While the routingKey is added to the headers for Envoy proxy to consume, the client adds the ObjectiveConfig parameters to the request itself. This is done to avoid bloating the request headers while passing additional parameters for the model to process the request appropriately.
The routing of the actual request is performed by the Envoy proxy, which has the metadata to map the routingKey to the actual cluster VIP running the model. Because the routingKey is in a header, this determination can be made with minimal overhead.

These changes retain the advantages of Switchboard, such as a single integration point, abstraction of model id from use case, context-aware routing, while addressing the challenges we observed over time.

Conclusion

The evolution from Switchboard to Lightbulb marks a significant architectural refinement in our ML model serving infrastructure. While Switchboard provided the initial abstraction layer critical for rapid innovation, its latency and single-point-of-failure risk posed scaling hurdles. The subsequent adoption of Lightbulb, a decoupled service focused solely on routing metadata, and its integration with Envoy successfully resolved these challenges. This sophisticated new architecture preserves the key benefits — seamless client integration and flexible experimentation — while ensuring reliable, efficient, and scalable delivery of personalized member experiences, positioning us well for future ML growth.

In future posts in this series, we’ll dive deeper into other aspects of our ML serving platform, including inference and feature fetching, and how they interact with the routing architecture described here.

Special thanks to Sura Elamurugu, Sri Krishna Vempati, Ed Maddox, and Sreepathi Prasanna for their invaluable feedback and partnership in iterating on this idea and bringing this blog post to life.

State of Routing in Model Serving was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Scaling Camera File Processing at Netflix

Netflix Technology Blog — Fri, 24 Apr 2026 15:06:01 GMT

Orchestrating Media Workflows Through Strategic Collaboration

Authors: Eric Reinecke, Bhanu Srikanth

Introduction to Content Hub’s Media Production Suite

At Netflix, we want to provide filmmakers with the tools they need to produce content at a global scale, with quick turnaround and choice from an extraordinary variety of cameras, formats, workflows, and collaborators. Every series or film arrives with its own creative ambitions and technical requirements. To reduce friction and keep productions moving smoothly, we built Netflix’s Media Production Suite (MPS) with the goal of automating repeatable tasks, standardizing key workflows, and giving productions more time to focus on creative collaboration and craftsmanship.

A critical part of this effort is how we handle image processing and camera metadata across the hundreds of hours and terabytes of camera footage that Netflix productions ingest on a daily basis. Rather than build every component from scratch, we chose to partner where it made sense–especially in areas where the industry already had trusted, battle-tested solutions.

This article explores how Netflix’s Media Production Suite integrates with FilmLight’s API (FLAPI) as the core studio media processing engine in Netflix’s cloud compute infrastructure, and how that collaboration helps us deliver smarter, more reliable workflows at scale.

Why We Built MPS

As Netflix’s production slate grew, so did the complexity of file-based workflows. We saw recurring challenges across productions:

File wrangling sapping time from creative decision-making
Inconsistent media handling across shows, regions, or vendors
Difficult to audit manual processes that are prone to human error
Duplication of effort as teams reinvented similar workflows for each production

Content Hub Media Production Suite was created to address these pain points. MPS is designed to:

Bring efficiency, consistency, and quality control to global productions
Streamline media management and movement from production through post-production
Reduce time spent on non-creative file management
Minimize human error while maximizing creative time

To achieve this, MPS needed a robust, flexible, and trusted way to handle camera-original media and metadata at scale.

The Right Tool for the Job

From the start, we knew that building a world-class image processing engine in-house is a significant, long-term commitment: one that would require deep, continuous collaboration with camera manufacturers and the wider industry.

When designing the system, we set out some core requirements:

Inspect, trim, and transcode original camera files and metadata for any Netflix production with trusted color science
Support a wide variety of cameras and recording formats used worldwide while staying current as new ones are released
Run well in our paved-path encoding infrastructure, enabling us to take advantage of proven compute and storage scalability with robust observability

FilmLight develops Baselight and Daylight, which are commonly used in the industry for color grading, dailies, and transcoding. Their FilmLight API (FLAPI) allows us to use that same media processing engine as a backend API.

Rather than duplicating that work, we chose to integrate. FilmLight became a trusted technology partner, and FLAPI is now a foundational part of how MPS processes media.

The Media Processing Engine

MPS is not a single application; it’s an ecosystem of tools and services that support Netflix productions globally. Within that ecosystem, the FilmLight API plays the following key roles.

Parsing camera metadata on ingest

Productions upload media to Netflix’s Content Hub with ASC MHL (Media Hash List) files to ensure completeness and integrity of initial ingest, but soon after, it’s important to understand the technical characteristics of each piece of media. We call this workflow phase “inspection.”

Footage ingested with MPS is inspected using FLAPI and all metadata is indexed and stored

At this stage, we:

Use FLAPI to gather camera metadata from the original camera files
Conform the workflow critical fields to Netflix’s normalized schema
Make it searchable and reusable for downstream processes

This metadata is integral to:

Matching footage based on timing and reel name for automated retrieval
Debugging (e.g., why a shot looks a certain way after processing)
Validations and checks across the pipeline

FLAPI provides consistent, camera-aware insight into footage that may have originated anywhere in the world. Additionally, since we’re able to package FLAPI in a Docker image, we can deploy almost identical code to both cloud and our production compute and storage centers around the world, ensuring a consistent assessment of footage wherever it may exist.

2. Generating VFX plates and other deliverables

Visual effects workflows constantly push image processing pipelines to their absolute limits. For MPS to succeed, it must generate images with accurate framing, consistent color management, and correct debayering/decoding parameters — all while maintaining rapid turnaround times.

To achieve this, we leverage Netflix’s Cosmos compute and storage platform and use open standards to provide predictable and consistent creative control.

At this phase, we use the FilmLight API to:

Debayer original camera files with the correct format-specific decoding parameters
Crop and de-squeeze images using Framing Decision Lists (ASC FDL) to ensure spatial creative decisions are preserved
Apply ACES Metadata Files (AMF), providing repeatable color pipelines from dailies through finishing
Generate an array of media deliverables in varied formats

These processes are automated, repeatable, and auditable. We deliver AMFs alongside the OpenEXRs to ensure recipients know exactly what color transforms are already applied, and which need to be applied to match dailies.

Because we use FilmLight’s tools on the backend, our workflow specialists can use Baselight on their workstations to manually validate pipeline decisions for productions before the first day of principal photography.

The Media Processing Factory in the Cloud

Finding an engine that competently processes media in line with open standards is an important part of the equation. To maximize impact, we want to make these tools available to all of the filmmakers we work with. Luckily, we’re no strangers to scaled processing at Netflix, and our Cosmos compute platform was ready for the job!

Cloud-first integration

The traditional model for this kind of processing in filmmaking has been to invest in beefy computers with large GPUs and high-performance storage arrays to rip through debayering and encoding at breakneck speed. However, constraints in the cloud environment are different.

Factors that are essential for tools in our runtime environment include that they:

Are packageable as Serverless Functions in Linux Docker images that can be quickly invoked to run a single unit of work and shut down on completion
Can run on CPU-only instances to allow us to take advantage of a wide array of available compute
Support headless invocation via Java, Python, or CLI
Operate statelessly, so when things do go wrong, we can simply terminate and re-launch the worker

Operating within these constraints lets us focus on increasing throughput via parallel encoding rather than focusing on single-instance processing power. We can then target the sweet spot of the cost/performance efficiency curve while still hitting our target turnaround times.

When tools are API-driven, easily packaged in Linux containers, and don’t require a lot of external state management, Netflix can quickly integrate and deploy them with operational reliability. FilmLight API fit the bill for us. At Netflix, we leverage:

Java and Python as the primary integration languages
Ubuntu-based Docker images with Java and Python code to expose functionality to our workflows
CPU instances in the cloud and local compute centers for running inspection, rendering, and trimming jobs

While FLAPI also supports GPU rendering, CPU instances give us access to a much wider segment of Netflix’s vast encoding compute pool and free up GPU instances for other workloads.

To use FilmLight API, we bundle it in a package that can be easily installed via a Dockerfile. Then, we built Cosmos Stratum Functions that accept an input clip, output location, and varying parameters such as frame ranges and AMF or FDL files when debayering footage. These functions can be quickly invoked to process a single clip or sub-segment of a clip and shut down again to free up resources.

Elastic scaling for production workloads

Production workloads are inherently spiky:

A quiet day on set may mean minimal new footage to inspect.
A full VFX turnover or pulling trimmed OCF for finishing might require thousands of parallel renders in a short time window.

By deploying FLAPI in the cloud as functions, MPS can:

Allocate compute on demand and release it when our work queue dies down
Avoid tying capacity to a fixed pool of local hardware
Smooth demand across many types of encoding workload in a shared resource pool

This elasticity lets us swarm pull requests to get them through quickly, then immediately yield resources back to lower priority workloads. Even in peak production periods, we avoid the pain of manually managing render queues and prioritization by avoiding fixed resource allocation. All this means lightning-fast turnaround times and less anxiety around deadlines for our filmmakers.

Designed for Seasoned Pros and Emerging Filmmakers

Netflix productions range from highly experienced teams with very specific workflows to newer teams who may be less familiar with potential pitfalls in complex file-based pipelines.

MPS is designed to support both:

Industry veterans who need to configure precise, bespoke workflows and trust that underlying image processing will respect those decisions.
Productions without a color scientist on staff — those who benefit from guardrails and sane defaults that help them avoid common workflow issues (e.g., mismatched color transforms, inconsistent debayering, or incomplete metadata handling).

The partnership with FilmLight lets Netflix focus on workflow design, orchestration, and production support, while FilmLight focuses on providing competent handling of a wide variety of camera formats with world-class image science!

Collaboration and Co-Evolution

Netflix aimed to integrate MPS into a wider tool ecosystem by developing a comprehensive solution based on emerging open standards, rather than making MPS a self-contained system. Integrating FLAPI into our system requires more than an API reference–it requires ongoing partnership. FilmLight worked closely with Netflix teams to:

Align on feature roadmaps, particularly around new camera formats and open standards
Validate the accuracy and performance of key operations
Debug edge cases discovered in large-scale, real-world workloads
Evolve the API in ways that serve both Netflix and the wider industry
Create a positive feedback cycle with open standards like ACES and ASC FDL to solve for gaps when the rubber hits the road

One example of this has been with the implementation of ACES 2. FilmLight’s developers quickly provided a roadmap for support. As our engineering teams collaborated on integration, we also provided feedback to the ACES technical leadership to quickly address integration challenges and test drive updates in our pipeline.

This collaborative relationship–built on open communication, joint validation, and feedback to the greater industry–is how we routinely work with FilmLight to ensure we’re not just building something that works for our shows, but also driving a healthy tooling and standards ecosystem.

Impact

While much of this work takes place behind the scenes, its impact is felt directly by our productions. Our goal with building MPS is for producers, post supervisors, and vendors to experience:

Fewer delays caused by missing, incomplete, or incorrect media
Faster turnaround on VFX plates and other technical deliverables
More predictable, consistent handoffs between editorial, color, and VFX
Less time spent troubleshooting technical issues, and more time focused on creative review

In practice, this often shows up as the absence of crisis: the time a VFX vendor doesn’t have to request a re-delivery, or the time editorial doesn’t have to wait for corrected plates, or the time the color facility doesn’t have to reinvent a tone-mapping path because the AMF and ACES pipeline are already in place.

Looking Ahead

As camera technology, codecs, open standards, and production workflows continue to evolve, so will MPS. The guiding principles remain:

Automate what’s repeatable
Centralize what benefits from standardization
Partner where deep domain expertise already exists

The integration with FilmLight API is one example of this philosophy in action. By treating image processing as a specialized discipline and collaborating with a trusted industry partner, Netflix is delivering smarter, more reliable workflows to productions worldwide.

At its core, this partnership supports a simple goal: reduce manual workflow and tool management, giving filmmakers more time to tell stories.

Acknowledgements

This project is the result of collaboration and iteration over many years. In addition to the authors, the following people have contributed to this work:

Matthew Donato
Prabh Nallani
Andy Schuler
Jesse Korosi

Scaling Camera File Processing at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Human Infrastructure: How Netflix Built the Operations Layer Behind Live at Scale

Netflix Technology Blog — Fri, 17 Apr 2026 15:01:02 GMT

By: Brett Axler, Casper Choffat, and Alo Lowry

In the three years since our first Live show, Chris Rock: Selective Outrage, we have witnessed an incredible expansion of our live content slate and the live operations that support it. From modest beginnings of streaming just one show per month, we are now capable of streaming over nine shows in a single day, reaching tens of millions of concurrent members. This post pulls back the curtain on the Live Operations teams that enable this rapid scale.

Humble Beginnings

In March 2023, the engineers who built Netflix’s first live streaming pipeline also operated it. There was no dedicated operations team or formal command center. All of our incident response playbooks were written for SVOD, and SLAs were not designed for the speed of live. For the first live shows on the platform, the engineers who designed what is described in earlier parts of this series monitored dashboards on laptops, coordinated over Slack, and troubleshot in real time while millions of members watched.

The physical setup matched the operational workflows: improvised. Temporary control rooms were put together in conference rooms. For larger events, Netflix rented third-party broadcast facilities, hardware control panels, multiviewers, and communication panels — the kind of infrastructure that established broadcast networks had built over decades. Every show was a team effort. Engineers and leadership at all levels were involved in every event. Each live show, regardless of size, was a massive effort to launch.

Netflix’s Early Live Operations

Last month, in March 2026, Netflix streamed the World Baseball Classic live to members in Japan. 47 matches over two weeks, with peak concurrent viewership exceeding 17.9 million for a single game, operations running 24/7 from permanent facilities in Los Gatos and Los Angeles, with international coverage extending to Tokyo. In March alone, Netflix launched approximately 70 live events. That is three events shy of the total number Netflix streamed live in all of 2024. The technical systems that make this possible have been covered in detail across this series. What hasn’t been told is the operational story: the people, procedures, and facilities Netflix built to run those systems in real time, under pressure, with no ability to pause or roll back.

The Architecture of Live Operations

The Architecture of Live Operations: Evolving the Broadcast Operations Center

When a technology company transitions into live broadcasting, it faces a unique challenge: blending traditional broadcast television practices with massive-scale live-streaming engineering. At the heart of this intersection is the Broadcast Operations Center (BOC).

The Transmission Operations Center in Los Angeles

The BOC serves as the critical “cockpit” for live events. It is the physical command center where a fully produced video feed is received directly from a stadium or venue and then handed off to the live streaming infrastructure. Everything from signal ingest, inspection, and conditioning to closed-captioning, graphics insertion, and ad management happens within these walls. By utilizing a hub-and-spoke model with highly redundant architectures, such as dual internet circuits and SMPTE 2022–7 seamless switching technologies, the BOC replaces direct, vulnerable paths from the venue to the live streaming pipeline, making each live event highly repeatable and far less dependent on the quirks of individual event locations.

Securing the Signal: Reliability from the Venue Before the BOC can work its magic, we have to guarantee the video and audio feeds actually survive the journey from the production site to our facility. To ensure absolute reliability from the venue, Netflix enforces strict specifications for live signal contribution.

For any show-critical feed, meaning the primary feed our members will watch live, we require three completely discrete transmission paths. We utilize a strict hierarchy of approved transmission methods, prioritizing dedicated video fiber and single-feed satellite links, followed by dedicated enterprise-grade internet and robust SRT contribution systems.

We don’t just rely on redundant transport lines; we require full hardware redundancy out of the production truck itself. This includes using separate router line cards and discrete transmission hardware to prevent any single point of failure. Furthermore, every single piece of transmission hardware at the venue must be powered by two discrete power sources, protected by uninterruptible power supply (UPS) batteries, and surge-conditioned.

Finally, before we ever go live to millions of viewers, our operators execute exhaustive “FACS/FAX” (facilities checks) testing during rehearsals and before every show. This involves running specialized Audio/Video sync tests, latency tests, and quality tests to guarantee perfect audio and video synchronization, validating closed captions, and touring the backup switcher inputs.

Building the Human Infrastructure: Building the human operational model to run a facility like the BOC didn’t happen overnight. For a platform scaling from its very first live comedy special to streaming over 400 global events a year, the operational strategy had to undergo a massive, multi-year evolution.

Phase 1: The “All-Hands” Engineering Era. In the earliest days of live streaming, there was no dedicated operations team or formal broadcast operations center. The software engineers who wrote the code and built the live-streaming infrastructure were the same people manually operating the events on launch night. Every show was an “all-hands-on-deck” scenario. While this raw, startup-style approach worked for initial milestones, having core developers manually set up and tear down software configurations for every single broadcast was fundamentally incapable of scaling.

Phase 2: The Shift to Specialized Engineering (SOEs and BOEs). To separate event execution from core software development, the operational model matured to introduce specialized engineering teams. First, the Streaming Operations Engineering (SOE) team was established. These are highly skilled streaming engineers whose sole focus is to configure the full event on the live pipeline and support it during the broadcast. By having SOEs act as the first line of escalation, the core software developers were freed up to focus on building new live-streaming pipeline features.

However, as the physical broadcast facilities grew, it became clear that supporting the streaming pipeline wasn’t enough; the physical broadcast hardware and facility workflows needed dedicated oversight too. To solve this, Broadcast Operations Engineers (BOEs) were introduced to work alongside the SOEs. The BOE acts as the primary escalation point for all physical broadcast facility and hardware issues, overseeing the operation of all shows during a given shift.

Phase 3: The “Co-Pilot” Control Room Model. With specialized engineers in place to handle the deep technical infrastructure, the day-to-day operation of the actual video and audio feeds was handed over to dedicated operators. Initially, the Broadcast Control Rooms were structured much like an airplane cockpit.

This approach utilized a “first and second captain” workflow, pairing two Broadcast Control Operators (BCOs) together to run a single event, functioning exactly like a pilot and co-pilot. This collaborative model allowed for intense focus and high-quality execution, making it the ideal setup for running just one or two live events per day. However, as the ambition grew to stream up to 10 concurrent events a day for massive global tournaments, a 1:1 scale of pairing operators simply required too much space and manpower. A new model had to be adopted.

Phase 4: The Transmission Operations Center (TOC) Fleet Model. To manage high-density event days and continuous tournament coverage, the workflow was completely reimagined with the launch of the Transmission Operations Center (TOC) model. Rather than treating every live broadcast as an isolated launch in its own room, the TOC treats live events like a fleet. It centralizes operations and distinctly separates the traditional broadcast functions from the streaming functions to maximize human efficiency.

The TOC model divides the labor across three highly specialized, tiered roles:

Transmission Control Operator (TCO): The TCO is responsible for managing all inbound signals arriving from the event venues, such as fiber optic, SRT, and satellite feeds. They ensure these incoming feeds meet strict quality, latency, and operational thresholds. Thanks to centralized dashboarding, a single TCO can manage up to five events concurrently.
Streaming Control Operator (SCO): While the TCO handles what comes in, the SCO manages what goes out. They oversee all outbound feeds, including the streams heading to the live streaming pipeline and any syndication feeds sent to third parties for commercial distribution. Like the TCOs, SCOs can manage up to five events concurrently.
Broadcast Control Operator (BCO): With the inbound and outbound transmission mechanics handled by the broader TOC, the BCO is able to focus entirely on the creative and qualitative execution of the event. Operating on a strict 1:1 ratio (one operator per event), the BCO seamlessly switches between backup inbound feeds if an issue arises, ensures audio and video remain in perfect synchronization, and performs rigorous quality control. They also monitor critical metadata, such as closed captions and digital ad-insertion messages (SCTE), right before the final polished feed is handed into the live streaming pipeline.

The Big Bet Exception. While the fleet-style TOC model enables immense concurrency for daily programming, the most critical, high-visibility events, like major holiday football games, utilize a specialized Big Bet Model. For these flagship broadcasts, an entire Broadcast Operations Center is dedicated exclusively to a single event. This hyper-focused environment strips away the multi-event ratios, providing operators with advanced instrumentation and dedicated facility engineers to ensure the absolute highest level of reliability for events where failure is simply not an option.

Operational Workflow at a Glance (Courtesy of Melissa “Mouse” Merencillo)

The Live Command Center (LCC)

The Live Command Center (LCC) is not an MCR (Master Control Room). Nor is it a traditional Network Operations Center (NOC). The LCC holds the end-to-end view of quality, health metrics, and reliability for every live stream — from signal ingest at the production venue through cloud encoding, CDN delivery, and playback on member devices — and coordinates the human response when any part of that chain breaks.

What makes this hard is the data and speed requirements. Standard monitoring tools incur propagation delays of minutes. However, during a live stream, a signal degradation that goes undetected for three minutes can affect millions of members before any mitigation begins. The LCC runs a purpose-built observability stack, the Live Control Center, that aggregates telemetry from across the entire pipeline in near real time: concurrent viewer counts, start failure rates, rebuffer ratios, CDN health, encoder status, and signal path health from the contribution feed forward.

Live Control Center (Courtesy of Chris Carey)

During live events, the system ingests up to 38 million events per second. The LCC’s job is to make that volume of data meaningful and actionable for the small team of operators watching it live.

Two roles staff the LCC leading up to and during live events. LCC Operations Leads are the shift supervisors and incident commanders. They triage anomalies, make escalation decisions, and own the incident response process from detection through resolution.

Live Technical Launch Managers (TLMs) function as air traffic controllers: they maintain cross-functional context across more than 45 technical, product, and services teams from encoding, CDN, and playback to social media, customer service, and security teams. TLMs start coordinating with these teams months and sometimes years ahead of a live event to ensure escalation paths and playbooks are in place when the LCC needs to translate a CDN engineer’s concern into a product decision at 2am while a game is still in progress. Together, these roles form the operational leadership layer that keeps engineers focused on building rather than watching dashboards.

The live operations teams rank shows by three categories:

Low-Profile Events: These are lightweight, often lack new features, and anticipate low viewership. They are typically managed with a small team of 1–2 operators and automated alerting.
High-Profile Events: These are mid-tier events that warrant more attention due to their size, unique features, or anticipated viewership.
Big Bet Events: These represent the highest operational weight, such as an NFL game, with massive viewership expectations and special features. They require the full support of the LCC: a fully staffed physical operations room for the entire duration, active incident command structures, and key engineering teams on standby to support their specific product areas.

In addition to a show’s event category, the TLMs deployed a Live Operational Level (LOL) model that helps engineers determine whether they need to be on standby, live online, or even in the LCC for any given show.

Based on the show’s event category, special features, expected viewership, and overall risk, non-operational teams are put into one of four categories:

Red: Non-operational teams must remain online for the duration of the event. This is most often seen in large boxing matches and sporting events, such as the NFL Christmas Day games.

Orange: Non-operational teams are required to check in online ~30 minutes prior to show and are asked to monitor the health of their systems through the first commercial breaks until the LCC releases them to LOL Yellow.

Yellow: Non-operational teams are not required to be online, but should be reachable by page in 2 minutes. Special PagerDuty rotations and verifications are in place to ensure these teams are reachable.

Grey: Business as usual. Teams will be reached out to by their normal pager rotation if their help is needed during the show.

Visual Representation of LOL Levels (Courtesy of Gemini Nano Banana Pro)

By tiering events, Netflix ensures that resource allocation is proportionate to operational needs, preventing a continuous “crisis” mentality and allowing our non-operational partners to focus on their day jobs.

As of April 2026, most engineering teams are Yellow or Grey, with Ops and Site Reliability Engineers making up most of the teams online to support shows, in addition to engineers performing feature tests.

Building the Model

The first lesson from 2023 was straightforward: what worked for one show a month would not work for ten shows a week. The engineers who built the pipeline were also the ones operating it, which meant the people best positioned to fix problems were also the ones most likely to be paged at 2am. There was no operational layer to absorb that load.

In 2024, Netflix streamed 72 live events and began building the team that would eventually run them. The first version of the LCC looked nothing like it does today: a cluster of desks, monitors on stands, and laptops running dashboards, set up in the middle of the office. The TLM team was stood up to own cross-functional coordination for live launches and began formalizing the runbooks, event tiering structure, and incident management protocols that would later enable Netflix to scale operations to support hundreds of shows per year.

By the time Jake Paul vs. Mike Tyson and the first NFL Christmas Games arrived, the LCC had moved into a dedicated conference room, and partnerships with device and labs teams were producing more effective monitoring tools. But the biggest operational lesson of that period came from communications.

For Tyson/Paul, Netflix had over 300 people online across engineering, product, and business functions. Some people were online because their support was needed, while many others were just excited to be part of it. Coordinating that many people over Slack and Zoom during an active event with 64 million concurrent streams was unmanageable.

That experience drove the implementation of a squad model: defined teams with clear roles, scoped communication channels, and a single escalation path into the LCC. Around the same time, the LCC began integrating with IP-based communications systems, finally bridging the gap between the command center and the Broadcast Operations Center that had been operating largely in a fractured parallel until then.

Visual Representation of Squad Operations Model (Courtesy of Gemini Nano Banana Pro)

2025 brought 220 live events and a permanent LCC facility, along with a dedicated operations team, the Live Command Center Operations Leads. With the growing number of shows, TLMs were getting spread thin, spending more than half their week operating shows late into the evening and over weekends, then getting called back into the office at 9 am to lead critical launch meetings. The addition of the LCC Ops Leads resolved the bandwidth issue by separating planning and operations into distinct roles within a single centralized team.

As the slate continued to grow and large series like the World Baseball Classic and FIFA Women’s World Cup were announced, the vendor-operator model was introduced, creating an elastic workforce that could scale up for large series events without carrying full-time headcount year-round to support peak capacity. The key enabler was documentation: standardized runbooks and onboarding materials detailed enough that a trained operator could reach full effectiveness within their first week. WWE RAW became a weekly operation, normalizing what had previously felt exceptional. By early 2026, multi-event days were no longer a test of capacity but had become the expected operating condition.

The next chapter is international. Netflix has begun standing up regional Live Operations Center coverage to support live events outside North America, with EMEA operations soon running out of London. The model draws on the same runbooks, tooling, and escalation structures developed in Los Gatos, with follow-the-sun shift handoffs connecting EMEA and US teams across time zones. Looking further ahead, Netflix is planning to bring the LCC and BOC under one roof — a single integrated facility that combines broadcast operations and cloud monitoring into a unified space. The physical separation between those two functions has always introduced friction at the seams. Closing it is the logical next step.

Operational Principles for Live at Scale

Building a live operations discipline means accepting one constraint above all others: you cannot optimize for efficiency before you have built for reliability.

Netflix designed for quality first: Standardized runbooks, tiered event structures, pre-documented failure modes, so the 50th show runs as smoothly as the fifth. Off-the-shelf monitoring tools with propagation delays don’t meet that bar. The Netflix Live Control Center and Live Control Room platforms exist because observability at live scale is a product decision that demands the same design rigor as the pipeline it monitors, turning millions of telemetry events per second into something a small team can act on in real time. Technical systems and human systems have to scale together, and the most reliable incident response plan is always the one written before anyone needs it.

The operational model is also a cultural one. Bringing contingent operators into a proprietary tech stack requires deliberate onboarding design. The vendor model only works when documentation is built to be followed confidently by someone new within their first week. Beyond process, the most durable parts of how Netflix runs live operations reflect something the Netflix culture memo makes explicit: the best ideas come from anywhere. In practice, that means frontline operators catching issues that engineers miss, vendor staff surfacing workflow friction that improves the system for everyone who follows, and a team that treats candid feedback as standard practice rather than an exception. The technology, the slate, and the scale keep changing. The discipline stays current by staying curious and iterating on the tools, the runbooks, and the team.

Conclusion: What’s Next

With 2026 already off to a successful start in operational scaling, we’re excited to shift our focus to the upcoming launch of our new Live Broadcast Operations Center in Los Angeles and our new Live Operations Center (LOC) in West London. The LOC will initiate Netflix’s follow-the-sun coverage as live content continues to grow with over 400 live events in 2026, including the launch of 24/7 linear free-to-air broadcast channels with TF1 this summer. On the technical front, further development of automated alerting tools and monitoring by exception will continue to reduce operations’ manual workload.

In 2023, the engineers led the operations. By 2026, they had developed systems that mostly ran themselves, with a dedicated operational team ensuring they operated smoothly for millions of members. The technology behind Netflix’s Live content has been documented throughout this series, but what runs alongside the tech stack is a set of operational principles, rehearsed incident management processes, and monitoring infrastructure that had to be created from scratch and continues to develop.

A special thanks to Te-Yuan Huang, Rob Saltiel, Tara Kozuback, Chris Carey, Di Li, Patrick Li, Anne Aaron, and Melissa “Mouse” Merencillo for their support on this article.

The Human Infrastructure: How Netflix Built the Operations Layer Behind Live at Scale was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Evaluating Netflix Show Synopses with LLM-as-a-Judge

Netflix Technology Blog — Fri, 10 Apr 2026 16:26:01 GMT

by Gabriela Alessio, Cameron Taylor, and Cameron R. Wolfe

Introduction

When members log into Netflix, one of the hardest choices is what to watch. The challenge isn’t a lack of options — there are thousands of titles — but finding the most intriguing one is complex and deeply personal. To help, we surface personalized promotional assets, especially the show synopsis — a brief description highlighting key plot elements, with cues like genre or talent.

Strong synopses help members scan, understand, and choose. Poor synopses frustrate, mislead, and drive abandonment. Ensuring high-quality synopses is essential, but scaling quality validation is hard. We host hundreds of thousands of synopses, usually with multiple variants per show. We need to ensure quality at scale so every member gets a consistently great experience every time they read a synopsis. This approach helps us scale high‑quality synopsis coverage for our rapidly expanding catalog, enabling greater speed and coverage without sacrificing quality.

This report outlines our LLM-based approach for evaluating synopsis quality. Using recent advances in agents, reasoning, and LLM-as-a-Judge, we score four key synopsis quality dimensions, achieving 85%+ agreement with creative writers. Additionally, we show that higher LLM judge quality is correlated with key streaming metrics, allowing us to proactively identify and fix impactful issues weeks or months before a show debuts on Netflix.

The Making of a “Good” Synopsis

Writing high-quality synopses requires creative expertise. Our expert creative leads are best positioned to craft the creative approaches and define quality standards. However, AI can help us consistently evaluate these expert-driven quality criteria at scale. Synopsis quality at Netflix, which our system aims to predict, is viewed along two dimensions:

Creative Quality: members of our creative writing team assess synopsis quality according to our internal writing guidelines and rubrics.
Member Implicit Feedback: we measure the relative impact of a particular show synopsis on core streaming metrics.

These two definitions of quality capture distinct and important aspects of quality, one focused upon creative excellence and the other upon utility to members.

Creative Quality

For this project, we evaluate synopses against a subset of our creative writing quality rubric — the same criteria to which human writers would adhere. These quality rubrics change over time as quality standards evolve. Given Netflix’s distinctive voice and elevated editorial standards, the quality bar is high. Each criterion has extensive guidelines with examples across regions, genres, and synopsis types.

Human evaluation. We began by partnering with a group of creative writing experts to iteratively refine our definition of creative quality. We initially labeled ~1,000 diverse synopses, where three expert writers scored each against the criteria and explained their ratings. Due to the subjectivity of the task, early instance-level agreement was low. To reach a better consensus, we conducted calibration rounds (~50 synopses per round), surfaced disagreements, and evolved our quality scoring guidelines. Key interventions that were found to improve agreement include:

Using binary scores (instead of 1–4 Likert scores).
Allowing writers to reference past examples.
Maintaining a searchable taxonomy of common errors.

Golden evaluation data. After eight calibration rounds, writer agreement reached ~80%. To further stabilize labels, we used a model-in-the-loop consensus where:

Multiple writers score each synopsis.
An LLM, guided by the rubric, aggregates to a final label.
Writers review cases with substantial disagreement.

The result is a golden set of ~600 synopses with binary, criteria-level scores and explanations — our North Star for aligning an LLM judge with expert opinion.

Member Implicit Feedback

Netflix gauges implicit member feedback on a synopsis with two metrics:

Take Fraction: how often members who see a title’s synopsis choose to start watching it.
Abandonment Rate: how often members start a title but stop watching soon after.

Higher take fraction indicates more choosing, while lower abandonment suggests authentic, non-misleading presentation. Both of these metrics have been validated via A/B testing to serve as short-term behavioral proxies for long-term member retention. As part of evaluating our system, we also study the ability of LLM-derived quality scores to predict short-term engagement metrics. This step confirms that our scores capture behaviorally meaningful signals and assesses our ability to forecast member response to a given synopsis.

Scaling Quality Scoring with LLM-as-a-Judge

We begin our experiments by creating simple, per-criteria prompts that:

Supply criterion-specific show metadata.
Summarize the relevant quality guidelines.
Use zero-shot chain-of-thought prompting to elicit an explanation.
Request a binary decision for the synopsis.

Using a single prompt to evaluate all quality criteria is found to overload the LLM and yields poor performance — dedicated judges for each criteria perform better. Because criteria are unique, each task has its own setup, but there are some shared components:

We use the same LLM for all criteria.
The judge always outputs an explanation before its final score.
Final scores are binary.

Due to our use of binary scoring, judges can be evaluated with simple accuracy metrics over the golden dataset. Next, we summarize the experiments that led to our final system.

Prompt optimization. Because LLMs are sensitive to prompt phrasing, we apply Automatic Prompt Optimization (APO) over a ~300-sample dev set. Scoring guidelines are provided as additional context to the prompt optimizer. After APO, we manually refine candidate prompts with the help of an LLM, yielding initial prompts with accuracies shown below. These prompts work well for some criteria (e.g., precision) but poorly for others (e.g., clarity), highlighting criterion-specific nuances.

Improved reasoning. Many failures of our initial system arise due to a lack of accurate reasoning through highly-subjective evaluation examples. To improve reasoning accuracy, we leverage two forms of inference-time scaling:

Longer rationales: increase the length of the rationale or explanation generated by the LLM prior to producing a final score.
Consensus scoring: sample several outputs from the LLM and aggregate their scores to produce the final result.

Tiered rationales. Using tone as an example, we tested whether longer rationales are helpful by defining three rationale length tiers (shown above) and comparing their accuracies. Accuracy rises with longer rationales but returns are diminishing. Medium rationales noticeably outperform short ones, while long rationales offer only a slight additional gain; see below.

Longer rationales improve performance but degrade human-readability, which is problematic given that explanations are key pieces of evidence for creative experts. As a solution, we adopt tiered rationales: the judge reasons at any length but concisely summarizes its reasoning process prior to the final score. Tiered rationales preserve the benefits of extended reasoning, make outputs easier to inspect, and even benefit scoring accuracy. For example, our tone evaluator improves from 86.55% to 87.85% binary accuracy when using tiered rationales.

Consensus scoring. We can also allocate more inference-time compute by sampling multiple outputs per synopsis and aggregating their scores. We aggregate via a rounded average to ensure that the final score remains binary. For tone and clarity criteria with tiered rationales, 5× consensus scoring yields a clear accuracy boost as shown below.

Consensus scoring on the precision evaluator, which uses a vanilla (short) chain-of-thought, yields no benefit. As an explanation, we notice that longer rationales increase variance in scores across multiple outputs, while short rationales yield consistent scores. Consensus may be most useful for evaluators with longer rationales, where it helps to stabilize score variance. When shorter rationales are used, all scores tend to be the same, making consensus less meaningful.

What about reasoning models? While our setup elicits reasoning from a standard LLM, we also explored quality scoring with true reasoning models (i.e., models that generate long reasoning trajectories prior to final output). For tone, using a reasoning model with 5× consensus yields improving accuracy with increasing reasoning effort, even outperforming tiered rationales at the highest reasoning effort; see below. However, we skip reasoning models in our final system, as they significantly increase inference costs for only a marginal performance gain.

Agents-as-a-Judge for factuality. Synopses have four common types of factuality errors:

Incorrect plot information.
Incorrect metadata (e.g., genre, location, release date).
Incorrect on- or off-screen talent.
Incorrect award information.

Detecting these factuality errors requires comparing the synopsis to ground-truth context, where necessary context varies per criteria. For example, plot information requires a plot summary or script, while award information needs a list of awards. As we have learned, simplicity drives reliability: too much context or too many criteria harms accuracy. Motivated by this idea, we adopt factuality agents, where each agent evaluates one narrow aspect of factuality.

An agent receives context tailored to one facet of factuality and produces both a rationale and a binary factuality score. The final score of the Agents-as-a-Judge system is the minimum factuality score across agents — any failed aspect yields an overall fail. All rationales are fed to an LLM aggregator to produce a combined rationale to accompany the final score. As shown below, leveraging factuality agents significantly benefits scoring accuracy. Further benefits are achieved by using tiered rationales and consensus scoring within each agent.

Final system. In summary, our automatic evaluation system uses a combination of standard LLM-as-a-Judge, tiered rationales, consensus scoring, and Agents-as-a-Judge to maximize binary scoring accuracy for each criteria. A summary of the techniques used for each criteria and the associated binary scoring accuracy is provided below.

Member Validation of LLM-as-a-Judge

Beyond expert agreement, we also study how LLM-as-a-Judge scores relate to member behavior. This analysis serves two goals:

Further validating LLM-judge accuracy.
Linking creative quality to member-perceived quality.

Framed as predictors of member outcomes, LLM judges help us assess how promotional assets affect viewing and determine which creative attributes matter most to members discovering content they enjoy. To perform this analysis, we take advantage of the fact that most shows have multiple, personalized synopses (i.e., a synopsis “suite”). Using this suite, we can measure the causal effect of synopsis selection on metrics like take fraction and abandonment rate.

Our methodology. We correlate synopsis performance (take fraction or abandonment) with LLM quality scores. Specifically, within each show s, we relate changes in a synopsis’s LLM score to changes in its performance, normalizing by the show-level standard deviation and clustering standard errors by show; see below.

β captures the average association between within-show changes in LLM score and changes in performance. While we don’t have clean, experimental variation in LLM scores, this analysis still validates predictive value and practical utility.

Member-focused results. We report correlations for individual LLM criteria and a “Weighted Score” that combines all criteria to reduce noise and maximize signal from behavioral data. As shown below, results show promising prediction of take fraction and abandonment. Precision and clarity are especially predictive, and the weighted score provides a statistically useful signal of higher take and lower abandonment. In short, LLM evaluators capture factors that matter to members, making them a valuable tool for monitoring synopsis quality and engagement.

Closing Remarks

The LLM-as-a-Judge system used to evaluate show synopses at Netflix is the result of extensive experimentation grounded in both creative expertise and member outcomes. Building an automatic evaluation system that works reliably in practice is hard, and the approach we have described reflects countless lessons learned through iteration to improve accuracy and scalability. We have validated the system extensively with human evaluation at both the system and component levels, and we have shown that its outputs correlate with key streaming metrics. As a result, we are confident that it captures the dimensions of synopsis quality that matter most — both creatively and from the member perspective — which has driven its widespread adoption in the Netflix synopsis authoring workflow.

Evaluating Netflix Show Synopses with LLM-as-a-Judge was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Stop Answering the Same Question Twice: Interval-Aware Caching for Druid at Netflix Scale

Netflix Technology Blog — Mon, 06 Apr 2026 22:15:14 GMT

By Ben Sykes

In a previous post, we described how Netflix uses Apache Druid to ingest millions of events per second and query trillions of rows, providing the real-time insights needed to ensure a high-quality experience for our members. Since that post, our scale has grown considerably.

With our database holding over 10 trillion rows and regularly ingesting up to 15 million events per second, the value of our real-time data is undeniable. But this massive scale introduced a new challenge: queries. The live show monitoring, dashboards, automated alerting, canary analysis, and A/B test monitoring that are built on top of Druid became so heavily relied upon that the repetitive query load started to become a scaling concern in itself.

This post describes an experimental caching layer we built to address this problem, and the trade-offs we chose to accept.

The Problem

Our internal dashboards are heavily used for real-time monitoring, especially during high-profile live shows or global launches. A typical dashboard has 10+ charts, each triggering one or more Druid queries; one popular dashboard with 26 charts and stats generates 64 queries per load. When dozens of engineers view the same dashboards and metrics for the same event, the query volume quickly becomes unmanageable.

Take the popular dashboard above: 64 queries per load, refreshing every 10 seconds, viewed by 30 people. That’s 192 queries per second from one dashboard, mostly for nearly identical data. We still need Druid capacity for automated alerting, canary analysis, and ad-hoc queries. And because these dashboards request a rolling last-few-hours window, each refresh changes slightly as the time range advances.

Druid’s built-in caches are effective. Both the full-result cache and the per-segment cache. But neither is designed to handle the continuous, overlapping time-window shifts inherent to rolling-window dashboards. The full-result cache misses for two reasons.

If the time window shifts even slightly, the query is different, so it’s a cache miss.
Druid deliberately refuses to cache results that involve realtime segments (those still being indexed), because it values deterministic, stable cache results and query correctness over a higher cache hit rate.

The per-segment cache does help avoid redundant scans on historical nodes, but we still need to collect those cached segment results from each data node and merge them in the brokers with data from the realtime nodes for every query.

During major shows, rolling-window dashboards can generate a flood of near-duplicate queries that Druid’s caches mostly miss, creating heavy redundant load. At our scale, solving this by simply adding more hardware is prohibitively expensive.

We needed a smarter approach.

The Insight

When a dashboard requests the last 3 hours of data, the vast majority of that data, everything except the most recent few minutes, is already settled. The data from 2 hours ago won’t change.

What if we could remember the older portions of the result and only ask Druid for the part that’s actually new?

This is the core idea behind a new caching service that understands the structure of Druid queries and serves previously-seen results from cache while fetching only the freshest portion from Druid.

A Deliberate Trade-Off

Before diving into the implementation, it’s worth being explicit about the trade-off we’re making. Caching query results introduces some staleness, specifically, up to 5 seconds for the newest data. This is acceptable for most of our operational dashboards, which refresh every 10 to 30 seconds. In practice, many of our queries already set an end time of now-1m or now-5s to avoid the “flappy tail” that can occur with currently-arriving data.

Since our end-to-end data pipeline latency is typically under 5 seconds at P90, a 5-second cache TTL on the freshest data introduces negligible additional staleness on top of what’s already inherent in the system. We decided it was better to accept this small amount of staleness in exchange for significantly lower query load on Druid. But a 5s cache on its own is not very useful.

Exponential TTLs

Not all data points are equally trustworthy. In real-time analytics, there’s a well-known late-arriving data problem. Events can arrive out of order or be delayed in the ingestion pipeline. A data point from 30 seconds ago might still change as late-arriving events trickle in. A data point from 30 minutes ago is almost certainly final.

We use this observation to set cache TTLs that increase exponentially with the age of the data. Data less than 2 minutes old gets a minimum TTL of 5 seconds. After that, the TTL doubles for each additional minute of age: 10 seconds at 2 minutes old, 20 seconds at 3 minutes, 40 seconds at 4 minutes, and so on, up to a maximum TTL of 1 hour.

The effect is that fresh data cycles through the cache rapidly, so any corrections from late-arriving events in the most recent couple of minutes are picked up quickly. Older data lingers much longer, because our confidence in its accuracy grows with time.

For a 3-hour rolling window, the exponential TTL ensures the vast majority of the query is served from the cache, leaving Druid to only scan the most recent, unsettled data.

Bucketing

If we were to use a single-level cache key for the query and interval, similar to Druid’s existing result-level cache, we wouldn’t be able to extract only the relevant time range from cached results. A shifted window means a different key, which means a cache miss.

Instead, we use a map-of-maps. The top-level key is the query hash without the time interval; the inner keys are timestamps bucketed to the query granularity (or 1 minute, whichever is larger) and encoded as big-endian bytes so lexicographic order matches time. This enables efficient range scans; fetching all cached buckets between times A and B for a query hash. A 3-hour query at 1-minute granularity becomes 180 independent cached buckets, each with its own TTL; when the window shifts (e.g., 30 seconds later), we reuse most buckets from cache and only query Druid for the new data.

How It Works

Today, the cache runs as an external service integrated transparently by intercepting requests at the Druid Router and redirecting them to the cache. If the cache fully satisfies a request, it returns the result; otherwise it shrinks the time interval to the uncached portion and calls back into the Router, bypassing the redirect to query Druid normally. Non-cached requests (e.g., metadata queries or queries without time group-bys) pass straight through to Druid unchanged.

This intercepting proxy design allows us to enable or disable caching without any client changes and is a key to its adoption. We see this setup as temporary while we work out a way to better integrate this capability into Druid more natively.

When a cacheable query arrives, those that are grouping-by time (timeseries, groupBy), the cache performs the following steps.

Parsing and Hashing. We parse each incoming query to extract the time interval, granularity, and structure, then compute a SHA-256 hash of the query with the time interval and parts of the context removed. That hash is the cache key: it encodes what is being asked (datasource, filters, aggregations, granularity) but not when, so the same logical query over different overlapping time windows maps to the same cache entry. There are some context properties that can alter the response structure or contents, so these are included in the cache-key.

Cache Lookup. Using the cache key, we fetch cached points within the requested range, but only if they’re contiguous from the start. Because bucket TTLs can expire unevenly, gaps can appear; when we hit a gap, we stop and fetch all newer data from Druid. This guarantees a complete, unbroken result set while sending at most one Druid query, rather than “filling gaps” with multiple small, fragmented queries that would increase Druid load.

Fetching the Missing Tail. On a partial cache hit (e.g., 2h 50m of a 3h window), we rebuild the query with a narrowed interval for the missing 10 minutes and send only that to Druid. Since Druid then scans just the recent segments for a small time range, the query is usually faster and cheaper than the original.

Combining. The cached data and fresh data are concatenated, sorted by timestamp, and returned to the client. From the client’s perspective, the response looks identical to what Druid would have returned, same JSON format, same fields.

Asynchronous Caching. The fresh data from Druid is parsed into individual time-granularity buckets and written back to the cache asynchronously, so we don’t add latency to the response path.

Negative Caching

Some metrics are sparse. Certain time buckets may genuinely have no data. Without special handling, the cache would treat these empty buckets as gaps and re-query Druid for them every time.

We handle this by caching empty sentinel values for time buckets where Druid returned no data. Our gap-detection logic recognizes these empty entries as valid cached data rather than missing data, preventing needless re-queries for naturally sparse metrics.

However, we’re careful not to negative-cache trailing empty buckets. If a query returns data up to minute 45 and nothing after, we only cache empty entries for gaps between data points, not after the last one. This avoids incorrectly caching “no data” for time periods where events simply haven’t arrived yet, which would exacerbate the chart delays of late arriving data.

The Storage Layer

For the backing store, we use Netflix’s Key-Value Data Abstraction Layer (KVDAL), backed by Cassandra. KVDAL provides a two-level map abstraction, a natural fit for our needs. The outer key is the query hash, and the inner keys are timestamps. Crucially, KVDAL supports independent TTLs on each inner key-value pair, eliminating the need for us to manage cache eviction manually.

This two-level structure gives us efficient range queries over the inner keys, which is exactly what we need for partial cache lookups: “give me all cached buckets between time A and time B for query hash X.”

Results

The biggest win is during high-volume events (e.g., live shows): when many users view the same dashboards, the cache serves most identical queries as full hits, so the query rate reaching Druid is essentially the same with 1 viewer or 100. The scaling bottleneck moves from Druid’s query capacity to the much cheaper-to-scale cache, and with ~5.5 ms P90 cache responses, dashboards load faster for everyone.

On a typical day, 82% of real user queries get at least a partial cache hit, and 84% of result data is served from cache. As a result, the queries that reach Druid scan much narrower time ranges, touching fewer segments and processing less data, freeing Druid to focus on aggregating the newest data instead of repeatedly re-querying historical segments.

An experiment validated this, showing about a 33% drop in queries to Druid and a 66% improvement in overall P90 query times. It also cut result bytes and segments queried, and in some cases, enabling the cache reduced result bytes by more than 14x. Caveat: the size of these gains depends heavily on how similar and repetitive the query workload is.

Looking Ahead

This caching layer is still experimental, but results are promising and we’re exploring next steps. We’ve added partial support for templated SQL so dashboard tools can benefit without writing native Druid queries.

Longer term, we’d like interval-aware caching to be built into Druid: an external proxy adds infrastructure to manage, extra network hops, and workarounds (like SQL templating) to extract intervals. Implemented inside Druid, it could be more efficient, with direct access to the query planner and segment metadata, and benefit the broader community without custom infrastructure. We’d likely ship it as an opt-in, configurable, result-level cache in the Brokers, with metrics to tune TTLs and measure effectiveness. Please leave a comment if you have a use-case that could benefit from this feature.

More broadly, this strategy, splitting time-series results into independently cached, granularity-aligned buckets with age-based exponential TTLs, isn’t Druid-specific and could apply to any time-series database with frequent overlapping-window queries.

Summary

As more Netflix teams rely on real-time analytics, query volume grows too. Dashboards are essential at our scale, but their popularity can become a scaling bottleneck. By inserting an intelligent cache between dashboards and Druid, one that understands query structure, breaks results into granularity-aligned buckets, and trades a small amount of staleness for much lower Druid load, we’ve increased query capacity without scaling infrastructure proportionally, and hope to deliver these benefits to the Druid community soon as a built-in Druid feature.

Sometimes the best way to handle a flood of queries is to stop answering the same question twice.

Stop Answering the Same Question Twice: Interval-Aware Caching for Druid at Netflix Scale was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Powering Multimodal Intelligence for Video Search

Netflix Technology Blog — Sat, 04 Apr 2026 00:44:32 GMT

Synchronizing the Senses: Powering Multimodal Intelligence for Video Search

By Meenakshi Jindal and Munya Marazanye

Today’s filmmakers capture more footage than ever to maximize their creative options, often generating hundreds, if not thousands, of hours of raw material per season or franchise. Extracting the vital moments needed to craft compelling storylines from this sheer volume of media is a notoriously slow and punishing process. When editorial teams cannot surface these key moments quickly, creative momentum stalls and severe fatigue sets in.

Meanwhile, the broader search landscape is undergoing a profound transformation. We are moving beyond simple keyword matching toward AI-driven systems capable of understanding deep context and intent. Yet, while these advances have revolutionized text and image retrieval, searching through video, the richest medium for storytelling, remains a daunting “needle in a haystack” challenge.

The solution to this bottleneck cannot rely on a single algorithm. Instead, it demands orchestrating an expansive ensemble of specialized models: tools that identify specific characters, map visual environments, and parse nuanced dialogue. The ultimate challenge lies in unifying these heterogeneous signals, textual labels, and high-dimensional vectors into a cohesive, real-time intelligence. One that cuts through the noise and responds to complex queries at the speed of thought, truly empowering the creative process.

Why Video Search is Deceptively Complex

Since video is a multi-layered medium, building an effective search engine required us to overcome significant technical bottlenecks. Multi-modal search is exponentially more complex than traditional indexing: it demands the unification of outputs from multiple specialized models, each analyzing a different facet of the content to generate its own distinct metadata. The ultimate challenge lies in harmonizing these heterogeneous data streams to support rich, multi-dimensional queries in real time.

Unifying the Timeline
To ensure critical moments aren’t lost across scene boundaries, each model segments the video into overlapping intervals. The resulting metadata varies wildly, ranging from discrete text-based object labels to dense vector embeddings. Synchronizing these disjointed, multi-modal timelines into a unified chronological map presents a massive computational hurdle.
Processing at Scale
A standard 2,000-hour production archive can contain over 216 million frames. When processed through an ensemble of specialized models, this baseline explodes into billions of multi-layered data points. Storing, aligning, and intersecting this staggering volume of records while maintaining sub-second query latency far exceeds the capabilities of traditional database architectures.
Surfacing the Best Moments
Surface-level mathematical similarity is not enough to identify the most relevant clip. Because continuous shots naturally generate thousands of visually redundant candidates, the system must dynamically cluster and deduplicate results to surface the singular best match for a given scene. To achieve this, effective ranking relies on a sophisticated hybrid scoring engine that weighs symbolic text matches against semantic vector embeddings, ensuring both precision and interpretability.
Zero-Friction Search
For filmmakers, search is a stream-of-consciousness process, and a ten-second delay can disrupt the creative flow. Because sequential scanning of raw footage is fundamentally unscalable, our architecture is built to navigate and correlate billions of vectors and metadata records efficiently, operating at the speed of thought.

Figure 1: Unified Multimodal Result Processing

The Ingestion and Fusion Pipeline

To ensure system resilience and scalability, the transition from raw model output to searchable intelligence follows a decoupled, three-stage process:

1. Transactional Persistence

Raw annotations are ingested via high-availability pipelines and stored in our annotation service, which leverages Apache Cassandra for distributed storage. This stage strictly prioritizes data integrity and high-speed write throughput, guaranteeing that every piece of model output is safely captured.

{
  "type": "SCENE_SEARCH",
  "time_range": {
    "start_time_ns": 4000000000,
    "end_time_ns": 9000000000
  },
  "embedding_vector": [
    -0.036, -0.33, -0.29 ...
  ],
  "label": "kitchen",
  "confidence_score": 0.72
}

Figure 2: Sample Scene Search Model Annotation Output

2. Offline Data Fusion

Once the annotation service securely persists the raw data, the system publishes an event via Apache Kafka to trigger an asynchronous processing job. Serving as the architecture’s central logic layer, this offline pipeline handles the heavy computational lifting out-of-band. It performs precise temporal intersections, fusing overlapping annotations from disparate models into cohesive, unified records that empower complex, multi-dimensional queries.

Cleanly decoupling these intensive processing tasks from the ingestion pipeline guarantees that complex data intersections never bottleneck real-time intake. As a result, the system maintains maximum uptime and peak responsiveness, even when processing the massive scale of the Netflix media catalog.

To achieve this intersection at scale, the offline pipeline normalizes disparate model outputs by mapping them into fixed-size temporal buckets. This discretization process unfolds in three steps:

Bucket Mapping: Continuous detections are segmented into discrete intervals. For example, if a model detects a character “Joey” from seconds 2 through 8, the pipeline maps this continuous span of frames into seven distinct one-second buckets.
Annotation Intersection: When multiple models generate annotations for the same temporal bucket, such as character recognition “Joey” and scene detection “kitchen” overlapping in second 4, the system fuses them into a single, comprehensive record.
Optimized Persistence: These newly enriched records are written back to Cassandra as distinct entities. This creates a highly optimized, second-by-second index of multi-modal intersections, perfectly associating every fused annotation with its source asset.

Figure 3: Temporal Data Fusion with Fixed-Size Time Buckets

The following record shows the overlap of the character “Joey” and scene “kitchen” annotations during a 4 to 5 second window in a video asset:

{
  "associated_ids": {
    "MOVIE_ID": "81686010",
    "ASSET_ID": "01325120–7482–11ef-b66f-0eb58bc8a0ad"
  },
  "time_bucket_start_ns": 4000000000,
  "time_bucket_end_ns": 5000000000,
  "source_annotations": [
    {
      "annotation_id": "7f5959b4–5ec7–11f0-b475–122953903c43",
      "annotation_type": "CHARACTER_SEARCH",
      "label": "Joey",
      "time_range": {
        "start_time_ns": 2000000000,
        "end_time_ns": 8000000000
      }
    },
    {
      "annotation_id": "c9d59338–842c-11f0–91de-12433798cf4d",
      "annotation_type": "SCENE_SEARCH",
      "time_range": {
        "start_time_ns": 4000000000,
        "end_time_ns": 9000000000
      },
      "label": "kitchen",
      "embedding_vector": [
        0.9001, 0.00123 ....
      ]
    }
  ]
}

Figure 4: Sample Intersection Record for Character + Scene Search

3. Indexing for Real Time Search

Once the enriched temporal buckets are securely persisted in Cassandra, a subsequent event triggers their ingestion into Elasticsearch.

To guarantee absolute data consistency, the pipeline executes upsert operations using a composite key (asset ID + time bucket) as the unique document identifier. If a temporal bucket already exists for a specific second of video, perhaps populated by an earlier model run, the system intelligently updates the existing record rather than generating a duplicate. This mechanism establishes a single, unified source of truth for every second of footage.

Architecturally, the pipeline structures each temporal bucket as a nested document. The root level captures the overarching asset context, while associated child documents house the specific, multi-modal annotation data. This hierarchical data model is precisely what empowers users to execute highly efficient, cross-annotation queries at scale.

Figure 5: Simplified Elasticsearch Document Structure

Multimodal Discovery and Result Ranking

The search service provides a high-performance interface for real-time discovery across the global Netflix catalog. Upon receiving a user request, the system immediately initiates a query preprocessing phase, generating a structured execution plan through three core steps:

Query Type Detection: Dynamically categorizes the incoming request to route it down the most efficient retrieval path.
Filter Extraction: Isolates specific semantic constraints such as character names, physical objects, or environmental contexts to rapidly narrow the candidate pool.
Vector Transformation: Converts raw text into high-dimensional, model-specific embeddings to enable deep, context-aware semantic matching.

Once generated, the system compiles this structured plan into a highly optimized Elasticsearch query, executing it directly against the pre-fused temporal buckets to deliver instantaneous, frame-accurate results.

Fine-Tuning Semantic Search

To support the diverse workflows of different production teams, the system provides fine-grained control over search behavior through configurable parameters:

Exact vs. Approximate Search: Users can toggle between exact k-Nearest Neighbors (k-NN) for uncompromising precision, and Approximate Nearest Neighbor (ANN) algorithms (such as HNSW) to maintain blazing speed when querying massive datasets.
Dynamic Similarity Metrics: The system supports multiple distance calculations, including cosine similarity and Euclidean distance. Because different models shape their high-dimensional vector spaces distinctly based on their underlying training architectures, the flexibility to swap metrics ensures that mathematical closeness perfectly translates to true semantic relevance.
Confidence Thresholding: By establishing strict minimum score boundaries for results, users can actively prune the long tail of low-probability matches. This aggressively filters out visual noise, guaranteeing that creative teams are not distracted and only review results that meet a rigorous standard of mathematical similarity.

Textual Analysis & Linguistic Precision

To handle the deep nuances of dialogue-heavy searches, such as isolating a character’s exact catchphrase amidst thousands of hours of speech, we implement a sophisticated text analysis strategy within Elasticsearch. This ensures that conversational context is captured and indexed accurately.

Phrase & Proximity Matching: To respect the narrative weight of specific lines (e.g., “Friends don’t lie” in Stranger Things), we leverage match-phrase queries with a configurable slop parameter. This guarantees the system retrieves the correct scene even if the user’s memory slightly deviates from the exact transcription.
N-Gram Analysis for Partial Discovery: Because video search is inherently exploratory, we utilize edge N-gram tokenizers to support search-as-you-type functionality. By actively indexing dialogue and metadata substrings, the system surfaces frame-accurate results the moment an editor begins typing, drastically reducing cognitive load.
Tokenization and Linguistic Stemming: To seamlessly support the global scale of the Netflix catalog, our analysis chain applies sophisticated stemming across multiple languages. This ensures a query for “running” automatically intersects with scenes tagged with “run” or “ran” collapsing grammatical variations into a single, unified search intent.
Levenshtein Fuzzy Matching: To account for transcription anomalies or phonetic misspellings, we incorporate fuzzy search capabilities based on Levenshtein distance algorithms. This intelligent soft-matching approach ensures that high-value shots are never lost to minor data-entry errors or imperfect queries.

Aggregations and Flexible Grouping

The architecture operates at immense scale, seamlessly executing queries within a single title or across thousands of assets simultaneously. To combat result fatigue, the system leverages custom aggregations to intelligently cluster and group outputs based on specific parameters, such as isolating the top 5 most relevant clips of an actor per episode. This guarantees a diverse, highly representative return set, preventing any single asset from dominating the search results.

Search Response Curation

While temporal buckets are the internal mechanism for search efficiency, the system post-processes Elasticsearch results to reconstruct original time boundaries. The reconstruction process ensures results reflect narrative scene context rather than arbitrary intervals. Depending on the query intent, the system generates results based on two logic types:

Figure 6: Depiction of Temporal Union vs Intersection

Union: Returns the full span of all matching annotations (3–8 sec), which prioritizes breadth, capturing any instance where a specified feature occurs.
Intersection: Returns only the exact overlapping duration of matching signals (4–6 sec). The intersection logic focuses on co-occurrence, isolating moments when multiple criteria align.

{
  "entity_id": {
    "entity_type": "ASSET",
    "id": "1bba97a1–3562–4426–9cd2-dfbacddcb97b"
  },
  "range_intervals": [
    {
      "intersection_time_range": {
        "start_time_ns": 4000000000,
        "end_time_ns": 8000000000
      },
      "union_time_range": {
        "start_time_ns": 2000000000,
        "end_time_ns": 9000000000
      },
      "source_annotations": [
        {
          "annotation_id": "fc1525d0–93a7–11ef-9344–1239fc3a8917",
          "annotation_type": "SCENE_SEARCH",
          "metadata": {
            "label": "kitchen"
          }
        },
        {
          "annotation_id": "5974fb01–93b0–11ef-9344–1239fc3a8917",
          "annotation_type": "CHARACTER_SEARCH",
          "metadata": {
            "character_name": [
              "Joey"
            ]
          }
        }
      ]
    }
  ]
}

Figure 7: Sample Response for “Joey” + “Kitchen” Query

Future Extensions

While our current architecture establishes a highly resilient and scalable foundation, it represents only the first phase of our multi-modal search vision. To continuously close the gap between human intuition and machine retrieval, our roadmap focuses on three core evolutions:

Natural Language Discovery: Transitioning from structured JSON payloads to fluid, conversational interfaces (e.g., “Find the best tracking shots of Tom Holland running on a roof”). This will abstract away underlying query complexity, allowing creatives to interact with the archive organically.
Adaptive Ranking: Implementing machine learning feedback loops to dynamically refine scoring algorithms. By continuously analyzing how editorial teams interact with and select clips, the system will self-tune its mathematical definition of semantic relevance over time.
Domain-Specific Personalization: Dynamically calibrating search weights and retrieval behaviors to match the exact context of the user. The platform will tailor its results depending on whether a team is cutting high-action marketing trailers, editing narrative scenes, or conducting deep archival research.

Ultimately, these advancements will elevate the platform from a highly optimized search engine into an intelligent creative partner, fully equipped to navigate the ever-growing complexity and scale of global video media.

Acknowledgements

We would like to extend our gratitude to the following teams and individuals whose expertise and collaboration were instrumental in the development of this system:

Data Science Engineering: Nagendra Kamath, Chao Pan, Prachee Sharma, Ying Liao and Carolyn Soo for the critical media model insights that informed our architectural design.
Product Management: Nimesh Narayan, Ian Krabacher, Ananya Poddar, Meghan Bailey and Anita Kuc for defining the user requirements and product vision.
Media Production Suite Team: Szymon Borodziuk, Mike Czarnota, Dominika Sarkowicz, Bohdan Koval and Sasha Sabov for their work in engineering the end-user search experience.
Asset Management Platform Team: For their collaborative efforts in operationalizing this design and bringing the system into production.

Powering Multimodal Intelligence for Video Search was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Smarter Live Streaming at Scale: Rolling Out VBR for All Netflix Live Events

Netflix Technology Blog — Thu, 02 Apr 2026 21:46:27 GMT

By Renata Teixeira, Zhi Li, Reenal Mahajan, and Wei Wei

On January 26, 2026, we flipped an important switch for Live at Netflix: all Live events are now encoded using VBR (Variable Bitrate) instead of CBR (Constant Bitrate). It sounds like a small configuration change, but it required us to revisit some of the foundational assumptions behind how we deliver Live video at global scale.

VBR lets us tailor the bitrate to the actual complexity of the scene, instead of sending every second of video at roughly the same bitrate. When a scene is simple, VBR “shaves off” bits that wouldn’t improve what you see on screen; when a scene is complex, it spends more bits to preserve quality. (The more general idea is often referred to as capped variable bitrate, or capped VBR.) That makes our encodes more efficient and our network more scalable. But it also makes traffic much less predictable: large bitrate swings can overload servers and CDNs, and our old assumptions about “what bitrate equals what quality” no longer hold. As a result, we have to rethink both how we manage delivery and capacity, and which bitrates we offer for each version of the stream. In our live pipeline, we currently use AWS Elemental MediaLive, where this “capped” VBR is implemented using the QVBR (Quality‑Defined Variable Bitrate) setting.

| Why Move Live from CBR to VBR?

Our initial Live encoding pipeline used constant bitrate (CBR). For each encoded stream, we configured a resolution and a nominal bitrate — for example, a 1080p stream targeting 5 Mbps — and the actual bitrate stayed close to that target over time. This predictability made both capacity planning and day‑to‑day operations easier. If a server could safely deliver around 100 Gbps of Live traffic, and each stream averaged close to its nominal rate, we could admit on the order of twenty thousand concurrent sessions per server and be confident we were operating within limits. During an event, the total traffic sent by a server would change mainly when members joined or left; as long as concurrency was stable, traffic stayed relatively flat. The network saw a smooth, easy‑to‑reason‑about load profile, and large changes in throughput almost always reflected a real change in usage, not just a different scene on screen.

The problem is that content isn’t constant. A talking‑head segment in a studio or a simple animation is much easier to compress than a sequence of rapid camera moves and fast‑moving athletes in front of a highly detailed crowd. With CBR, both the easy and the hard segments get the same bitrate. In simple scenes, we spend more bits than we need; in complex scenes, we sometimes don’t spend enough.

VBR flips the objective. Rather than aiming for a fixed bitrate, the encoder aims for a target quality and is allowed to raise or lower the bitrate according to scene complexity. When the picture is easy to encode, VBR can drop the bitrate substantially below the old CBR level while keeping quality constant. When the action heats up, it can temporarily use more bits to avoid visible artifacts.

The figure below shows the per‑segment bitrate over time for the same episode of WWE RAW, encoded once with CBR and once with VBR at a nominal 8 Mbps. With CBR (blue), the bitrate wobbles a bit from segment to segment but stays close to the target; if you average it over a minute, it’s practically a straight line. We end up spending roughly the same number of bits on simple scenes, like the waiting room at the start of the stream (shaded region), as on complex scenes, like the confetti‑filled shot later in the show. With VBR (orange), the encoder can drop the bitrate for the waiting room to a small fraction of the nominal rate, while allowing the confetti‑filled shot to use a much higher bitrate to preserve quality.

Per-segment bitrate over a WWE RAW episode for CBR and VBR encodes at a nominal of 8 Mbps.

“Waiting room” scene: visually simple and easy to compress, so VBR can safely use a low bitrate.

“Confetti-filled” shot: visually complex and noisy, so VBR spends many more bits to maintain quality.

At Netflix scale, the shift to VBR has several important effects. The most important is efficiency on the network: we reduce the average number of bytes we need to deliver a full event, which in turn reduces the traffic needed to fill all of the servers in Open Connect, our content delivery network (CDN), and the traffic needed to serve segments to members. The second is quality of experience. Because VBR sends fewer bits for similar video quality, we see fewer rebuffers and lower start‑up delay. Across multiple A/B tests on different Live events, we observed about 5% fewer rebuffers per hour, while transferring roughly 15% fewer bytes on average and around a 10% reduction in traffic at the peak minute.

| When Efficiency Fights Stability

The central challenge with VBR for Live is not that bitrate varies at all (it does under CBR as well), but that VBR can have much deeper, longer dips in bitrate, tightly coupled to what’s happening in the content.

Under CBR, a 5 Mbps stream is effectively that: on a per‑minute basis, traffic is remarkably flat. A server that is comfortably handling, say, ten thousand such sessions now is likely to still be comfortable in a minute, barring a wave of new joins. It’s safe for our steering logic to look at current traffic, see plenty of headroom, and route additional sessions to that server.

Under VBR, the same stream behaves very differently. During a slow, easy sequence, the encoder might only generate 2 Mbps for the 5 Mbps stream — or even less — to maintain its target quality, and it can stay at that lower level for an extended period. The server then appears to have plenty of unused capacity: per‑session bitrate is low and aggregate traffic is well below its limits. Our steering systems naturally interpret this as a signal that the server is under‑utilized and can accept more sessions.

The problem surfaces when the content changes. A fight starts, confetti begins to fall, or the camera cuts to a highly detailed, fast‑moving shot. To preserve quality, VBR may increase the bitrate to 6, 7, or 8 Mbps on the very next segments. If the server has admitted many additional sessions during the preceding low‑bitrate period, the aggregate traffic can suddenly exceed what the network link or NIC can sustain. Latency rises, packets are dropped, and devices start to experience stalls or quality downshifts. In extreme cases, this pattern of “bitrate dips followed by spikes” can destabilize parts of the system.

| Making Servers Aware of Bitrate Variability

These long VBR bitrate dips are great for efficiency, but — as we just saw — they can trick our delivery systems into thinking servers are under‑utilized and safe to load up. Under CBR, that behavior was predictable enough that current traffic was a good proxy for how “full” a server was; under VBR, it isn’t.

Our fix was to change how we decide whether a server can take more sessions. Instead of basing that decision only on current traffic, we reserve capacity based on each stream’s nominal bitrate, not just what it happens to be using at that moment. Even if a VBR stream is currently in a very cheap, low‑bitrate phase, we still treat it as something that can quickly return to its nominal rate.

This keeps our traffic‑steering behavior consistent between CBR and VBR and avoids the key failure mode where a server accepts too many sessions during a long low‑bitrate period and then becomes overloaded when bitrate rises again.

| Tuning VBR Nominal Bitrates to Match CBR Quality

The WWE example above already hints at why this isn’t automatic. We looked at a single 8 Mbps stream, encoded once with CBR and once with VBR. Both encodes have the same nominal bitrate, but the figure shows how differently they behave. The CBR encode stays clustered around 8 Mbps with frequent short spikes, while the VBR encode often drops far below that level and only spikes up when the content gets complex. That’s more efficient, but it also means that “same nominal bitrate” does not imply “same average number of bits” anymore — so simply reusing our CBR settings risks giving VBR less bitrate on average and losing some quality.

In practice, of course, we don’t just encode a single stream; we produce a set of streams at different resolutions and nominal bitrates — often called a bitrate ladder — so devices can adapt to their current network conditions by switching between them. When we first applied VBR using the existing CBR ladder, offline analysis with VMAF (a perceptual video quality metric) confirmed the concern from the WWE example: time‑averaged quality dropped slightly on a few streams, especially at the lowest bitrates. Early A/B tests showed the same pattern: overall VMAF about one point lower than CBR, with most of the gap at the bottom of the ladder.

To fix this, we compared CBR and VBR encodes rung by rung and looked at per‑stream VMAF. Wherever VBR fell more than about one VMAF point below CBR, we increased its nominal bitrate just enough to close the gap. Higher‑bitrate streams, where VBR quality was already very close to CBR, were left largely unchanged, including the 8 Mbps stream from the figure.

The result is a VBR ladder with slightly higher nominal bitrates on a few low‑end streams, but lower overall traffic, because VBR still drops the bitrate on simple scenes. This lets us match the quality of our CBR ladder while keeping the efficiency and stability gains that motivated the switch to VBR in the first place.

| What’s Next for Live VBR

With VBR in production for all Live events, our focus now is on using it more intelligently.

First, we are testing how to use the actual sizes of upcoming segments in our adaptive bitrate algorithms on devices, instead of relying only on nominal bitrates. This should help devices pick streams that better match how VBR will behave in the next few seconds, not just how a stream is labeled on paper.

Second, we are experimenting with making our capacity reservation less conservative. Today we reserve based on nominal bitrates to keep servers safe; by carefully applying a “discount” informed by real VBR behavior, we hope to free up additional headroom without sacrificing stability.

This work was the result of a broad, cross‑team effort. We’d like to thank Mariana Afonso, Dave Andrews, Mark Brady, Jake Freeland, Te-Yuan Huang, Ivan Ivanov, Yeshwenth Jayaraman, Patrick Kunka, Zheng Lu, Anirudh Mendiratta, Chris Pham, David Pfitzner, Jon Rivas, Garett Singer, Brenda So, Stan Surmay, Bowen Tan, Devashish Thakur, and Allan Zhou for their many contributions to making Live VBR a reality at Netflix.

Smarter Live Streaming at Scale: Rolling Out VBR for All Netflix Live Events was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Scaling Global Storytelling: Modernizing Localization Analytics at Netflix

Netflix Technology Blog — Fri, 06 Mar 2026 15:01:27 GMT

Valentin Geffrier, Tanguy Cornuau

Each year, we bring the Analytics Engineering community together for an Analytics Summit — a multi-day internal conference to share analytical deliverables across Netflix, discuss analytic practice, and build relationships within the community. This post is one of several topics presented at the Summit highlighting the breadth and impact of Analytics work across different areas of the business.

At Netflix, our goal is to entertain the world, which means we must speak the world’s languages. Given the company’s growth to serving 300 million+ members in more than 190+ countries and 50+ languages, the Localization team has had to scale rapidly in creating more dubs and subtitle assets than ever before. However, this growth created technical debt within our systems: a fragmented landscape of analytics workflows, duplicated pipelines, and siloed dashboards that we are now actively modernizing.

The Challenge: “Who Made This Dub?”

Historically, business logic for localization metrics was replicated across isolated domains. A question as simple as “Who made this dub/subtitle?” is actually complex — it requires mapping multiple data sources through intricate and constantly changing logic, which varies depending on the specific language asset type and creation workflow.

When this logic is copied into isolated pipelines for different use cases it creates two major risks: inconsistency in reporting and a massive maintenance burden whenever upstream logic changes. We realized we needed to move away from these vertical silos.

Our Modernization Strategy

To address this, we defined a vision centered on consolidation, standardization, and trust, executed through three strategic pillars:

1. The Audit and Consolidation Playbook

We initiated a comprehensive audit of over 40 dashboards and tools to assess usage and code quality. Our focus has shifted from patching frontend visualizations to consolidating backend pipelines. For example, we are currently merging three legacy dashboards related to dubbing partner KPIs (around operational performance, capacity, and finances), focusing first on a unified data and backend layer that can support a variety of future frontend iterations.

2. Reducing “Not-So-Tech” Debt

Technical debt isn’t just about code; it is also about the user experience. We define “Not-So-Tech Debt” as the friction stakeholders feel when tools are hard to interpret or can benefit from better storytelling. To fix this, we revamped our Language Asset Consumption tool — instead of reporting dub and subtitle metrics independently, we combine audio and text languages into one consumption language that helps differentiate Original Language versus Localized Consumption and measure member preferences between subtitles, dubs, or a combination of both for a given language. This unlocks more intuitive insights based on actual recurring stakeholder use cases.

3. Investing in Core Building Blocks

We are shifting to a write once, read many architecture. By centralizing business logic into unified tables — such as a “Language Asset Producer” table — we solve the “Who made this dub?” problem once. This centralized source now feeds into multiple downstream domains, including our Dub Quality and Translation Quality metrics, ensuring that any logic update propagates instantly across the ecosystem.

The Future: Event-Level Analytics

Looking ahead, we are moving beyond asset-level metrics to event-level analytics. We are building a generic data model to capture granular timed-text events, such as individual subtitle lines. This data helps us understand how subtitle characteristics (e.g. reading speed) affect member engagement and, in turn, refine the style guidelines we provide to our subtitle linguists to improve the member experience with localized content.

Ultimately, this modernization effort is about scaling our ability to measure and enhance the joy and entertainment we deliver to our diverse global audience, ensuring that every member, regardless of their language, has the best possible Netflix experience.

Scaling Global Storytelling: Modernizing Localization Analytics at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.