Netflix TechBlog | Medium Note

Netflix TechBlog | Medium

Netflix Tech Blog offers insights into how Netflix handles technology. They provide research on data science, engineering, design, and technology innovations. They showcase their innovations, like their proprietary content delivery network and provide insights into their service reliability efforts.

Thread Of Notes

Dynamically Splitting Wide Partitions in Cassandra for Time Series Workloads

Netflix's TimeSeries Abstraction ingests and queries petabytes of temporal event data with millisecond latency, using Apache Cassandra as its storage. Wide partitions, where a single partition accumulates a large volume of events over time, pose a significant challenge for TimeSeries workloads. This leads to high read latencies, timeouts, increased CPU utilization, and garbage collection pauses in Cassandra clusters. To address this, TimeSeries data is partitioned into discrete time chunks, creating manageable segments. The initial provisioning strategy relied on user-specified workload characteristics and Monte Carlo simulations to determine optimal infrastructure and partition configurations. However, this approach proved insufficient when workloads were unknown, inaccurately estimated, evolved over time, or contained data outliers. To automate adjustments, a background worker was introduced to monitor partition histograms and dynamically re-partition future time slices based on observed data density. This Time Slice Re-Partitioning strategy effectively reduces read latencies and timeouts when most data exhibits similar wide partition behavior. However, this strategy doesn't address scenarios where only a small percentage of IDs within a table are wide. For such cases, and when callers require all data even with elevated latencies, Dynamic Partitioning per ID was developed. This asynchronous pipeline detects wide partitions during read operations and transparently splits them into optimal sizes. The process involves detection, planning and splitting, and serving reads by re-routing queries to the split partitions. Detection occurs when a read operation exceeds a configured byte threshold, emitting an event to Kafka. The system initially focuses on immutable partitions for simplicity. The planning stage reads the entire partition to create a split plan, using checkpointing to handle failures. Splitting involves delegating the data division to specific strategies, like assigning more event buckets to a time bucket. Validating splits is crucial, with checksums ensuring data integrity before marking a split as complete. Finally, the TimeSeries servers use in-memory Bloom filters to efficiently divert read queries to the split partitions, making the diversion practically invisible to callers.
CdXz5zHNQW_JhVMWuRvRR.png

High-Throughput Graph Abstraction at Netflix: Part I

Netflix built a Graph Abstraction to handle high-throughput, low-latency graph operations, particularly for use cases like Real-Time Distributed Graphs and Social Graphs. It operates within two categories: OLAP for in-depth analysis and OLTP for streaming user experiences. The abstraction uses a Property Graph model with strongly typed nodes and edges, organized into isolated namespaces. Each namespace has a predefined graph schema managed through a Data Gateway Control Plane. This schema enables optimizations like data quality enforcement and efficient query planning. The real-time index uses key-value storage for nodes and edges, employing separate indexes for links and properties. Edge links are indexed by source-destination relationships. To ensure access regardless of direction, this organizes identifiers lexicographically. Caching is used to minimize write and read amplification. The Abstraction architecture prioritizes high performance and incorporates strategies like write-aside caching.
CdXz5zHNQW_cFaJPOpvqm.png

From Silos to Service Topology: Why Netflix Built a Real-Time Service Map

Netflix developed a "living map" of its distributed infrastructure, called Service Topology, to help engineers understand service dependencies and troubleshoot issues. The map addresses the need for quicker identification of service relationships and potential impact during outages. It answers crucial questions like which services depend on each other and identifies the source of problems. They gathered data from three sources: eBPF network flows, IPC metrics, and end-to-end tracing. Each data source provides a unique perspective – network connectivity, application-level details, and actual request flows. This multi-layered architecture combines these views into a unified, real-time map. The system uses separate graph databases for each data source to maintain independence and enable parallel querying. Engineers can view each graph individually or combine them for a comprehensive understanding of service interactions. The system ingests flow logs from Kafka in multiple regions, processes them, and stores the data in graph databases. This living map helps Netflix engineers quickly diagnose and resolve problems, ensuring a smooth streaming experience.
CdXz5zHNQW_fZOTLLTM4l.png

Scaling ArchUnit with Nebula ArchRules

Netflix uses a polyrepo strategy with thousands of Java repositories, requiring efficient build logic sharing. They built the Nebula suite of Gradle plugins, including ArchRules, to manage dependencies and enforce code standards. This initiative arose from a need to improve Java library lifecycle management following a backwards-incompatible change incident. Netflix utilizes API lifecycle annotations (@Deprecated, @Public, @Experimental) to identify potential issues with deprecated code. ArchUnit, a popular library for enforcing architectural rules, was chosen to detect the misuse of these annotations and other technical debt issues. Nebula ArchRules plugins enable sharing and applying ArchUnit rules across many repositories, providing enhanced capabilities. ArchRules uses bytecode analysis (ASM) for cross-language support and an easy to use builder pattern simplifying rule creation. Rules can be bundled with libraries or defined in standalone rule libraries, automatically detected and run by the plugin. The ArchRules Runner plugin evaluates rules against source sets and generates JSON and console reports, enabling better reporting capabilities. By using ArchRules, Netflix provides a platform for library authors to track API usage and detect deprecated API usage.
CdXz5zHNQW_UkqZEpugr5.png

Democratizing Machine Learning at Netflix: Building the Model Lifecycle Graph

Netflix utilizes machine learning across various business domains like personalization, studio production, payments, and advertising. As ML adoption grew, a challenge emerged: a fragmented landscape where models and data were siloed, hindering collaboration and discovery. ML practitioners struggled to understand model lineage, feature sources, and impact across different systems. This fragmentation prevented easy answers to questions about existing features, data sources, pipeline dependencies, and the effects of changes. The core difficulty lay in connecting disparate ML infrastructure components that generated metadata. Dozens of systems, from pipeline orchestrators to experimentation platforms and feature stores, produced data in various formats. Solving this required collecting heterogeneous metadata, transforming it into a unified model, and building a connected graph for exploration. The solution is the Metadata Service (MDS), which constructs a Model Lifecycle Graph to interconnect ML entities at Netflix. MDS ingests ML metadata in real-time, enabling cross-domain queries like identifying experiments using a specific model or models sharing certain features. The vision is to make all ML assets discoverable, understandable, and reusable across the company. MDS operates on core abstractions: Components, each with a unique AIP URI; Entities, which are ML-specific assets with properties; Entity Types, defining data shapes; Domains, which group related entity types; and Providers, concrete implementations of domains from source systems. This URI-based addressing allows any service to reference any ML asset universally. The process of building the graph involves several stages. First, MDS integrates with source systems via Kafka and AWS SNS/SQS to consume thin events indicating changes. Dedicated event handlers process events from systems like the Pipeline Orchestration, Model Registry, Feature Store, Experimentation Platform, and Identity Platform. Second, MDS implements a hydration contract, validating events and calling source system APIs to fetch complete state, which is then transformed into a normalized entity. This "notification of change" pattern ensures robustness against event order issues but places read load on source systems. Third, raw events are transformed into a unified entity model with standardized fields, creating a consistent interface for downstream consumers. Normalized entities standardize field names, formats, and convert platform-specific IDs into global AIP URIs. Finally, normalized entities are persisted to Datomic for caching and relationship storage, and simultaneously indexed in Elasticsearch. Datomic, with its immutable fact model, supports complex graph traversals and entity relationships, enabling queries across multiple domains without inefficient N+1 query patterns.
CdXz5zHNQW_EXeyjNVmx8.png

State of Routing in Model Serving

The blog post discusses the technical insights into how Netflix's ML model serving infrastructure powers personalized experiences at scale across various domains. The central ML model serving platform exposes a domain-independent API abstraction and traffic routing capabilities to several domain-specific microservices for model inference. This singular API has increased the speed of innovation for iterating on newer versions of existing ML experiences and enabling new product experiences with ML. The success of the ML model serving infrastructure depends on enabling researchers to rapidly experiment with new hypotheses and safely release their models into production. The platform serves hundreds of model types and versions, netting 1 million requests per second, and operates at the level of workflows, not just individual scoring functions. The model definition contains a list of facts that it needs to compute features, and it relies on the model serving platform to supply these facts at serving time by calling several other microservices. The calling services only need to provide standard request context, and the relevant domain context, and the model can itself compute features and perform inference as part of the execution flow. The platform acts as an enabler of rapid ML innovation and limits the exposure of ML model iterations to the client apps. The key principles of the platform include model innovation independent of client apps, decoupling clients from model sharding, and flexible traffic routing rules. The platform uses a custom service called Switchboard, which serves as a flexible proxy layer for all traffic, handling over 1 million requests per second while maintaining high availability and reliability. Switchboard provides a single point of contact for all clients' model needs and can route a request based on a rich set of contextual features. The platform also introduces the concept of an "Objective", which is an enumeration defined by the serving platform that every request into the system must provide. The Objective decouples clients from concrete models and guides the platform's routing and model selection decisions. Switchboard Rules is a JavaScript configuration that allows researchers to attach model variants, experiments, and traffic splits to Objectives without changing client code. The rules dictate the default model to use for a given Objective, A/B experiments to configure for a set of Objectives, and customizations to gradually shift traffic to a new model. The rules are consumed by both Switchboard and the Model Serving clusters, and the serving platform components can take various actions based on these rules. Overall, the platform provides a scalable and flexible solution for serving ML models at Netflix, enabling rapid innovation and experimentation while minimizing the impact on client apps.
CdXz5zHNQW_XdCPf6klQA.png

Scaling Camera File Processing at Netflix

Netflix built its Media Production Suite (MPS) to streamline media workflows for global productions, aiming for efficiency and consistency. The core of MPS leverages FilmLight's API (FLAPI) for image processing, a collaboration that avoids building everything from scratch. MPS addresses challenges like file wrangling and inconsistent media handling, automating tasks and minimizing errors. The FLAPI is used to inspect camera file metadata, normalize it, and make it searchable, ensuring consistency. FLAPI also generates VFX plates and deliverables with accurate color management and debayering. MPS is cloud-integrated using Cosmos, allowing for CPU-only instances in a Dockerized environment, maximizing efficiency. Production workloads are handled elastically to scale on demand, ensuring quick turnaround times. MPS is designed for both experienced teams and those needing guidance, with FLAPI handling the complexities. This partnership allows for open communication, joint validation, and benefits the standardization ecosystem. The overall impact is fewer delays, faster turnaround, and more efficient production processes.
CdXz5zHNQW_aUoM5pxXdJ.jpeg

The Human Infrastructure: How Netflix Built the Operations Layer Behind Live at Scale

Netflix's live streaming operations have rapidly expanded since their first live show, now streaming multiple events daily. Initial live shows relied on engineers and improvised setups, lacking dedicated operational teams. The evolution required building dedicated Broadcast Operations Centers (BOCs) to manage the live broadcast process. To ensure signal reliability, Netflix enforces strict specifications on video and audio feed contributions from venues. Netflix's operational model evolved through several phases, from engineering-led operations to specialized engineering teams. The Transmission Operations Center (TOC) model was developed to improve efficiency and manage a high volume of events concurrently. Within the TOC model distinct roles like Transmission, Streaming, and Broadcast Control Operators are crucial. For high-profile events, a specialized "Big Bet" model provides dedicated resources. The Live Command Center (LCC) monitors the entire live streaming pipeline, offering real-time data to operators. The LCC uses a custom observability stack to manage massive data volumes and react to issues promptly. LCC Operations Leads and Technical Launch Managers are responsible for incident response.
CdXz5zHNQW_kK6Y54I9YV.jpeg

Evaluating Netflix Show Synopses with LLM-as-a-Judge

Netflix aims to improve the synopsis selection process for its vast content library. Thousands of titles complicate the user's choice, making synopses crucial for decision-making. High-quality synopses enhance user engagement, while poor ones lead to frustration and abandonment. The challenge lies in scaling quality validation across hundreds of thousands of synopses. Netflix developed an LLM-based approach to evaluate synopsis quality across four key dimensions. This system achieves over 85% agreement with creative writers, ensuring expert-driven quality standards. Synopsis quality is assessed through both creative quality defined by writers and member implicit feedback via streaming metrics. The LLM-as-a-Judge system employs techniques like tiered rationales, consensus scoring, and factuality agents to optimize accuracy. These LLM-derived quality scores are correlated with key streaming metrics like take fraction and abandonment rate. This allows Netflix to proactively identify and fix synopsis issues, improving the member experience and content discovery.
CdXz5zHNQW_s8spGD1mD0.png

Stop Answering the Same Question Twice: Interval-Aware Caching for Druid at Netflix Scale

Netflix implemented an experimental caching layer to improve Druid query performance at its massive scale. This caching system addresses the problem of redundant queries from dashboards, especially during high-profile events. The core idea is to cache portions of results and only query Druid for the freshest data, introducing a small staleness trade-off. Exponential time-to-live (TTL) values are used, with older data cached longer to handle late-arriving events. The caching system uses a map-of-maps structure for efficient range scans of cached data. The cache intercepts requests at the Druid Router, serving results or querying Druid as needed. The solution combines cached and fresh data, returning a complete result and caching the fresh data asynchronously. Negative caching is used for empty buckets, with precautions against caching trailing empty buckets. The system uses Netflix's Key-Value Data Abstraction Layer (KVDAL), backed by Cassandra, which offers independent TTLs on each data point. The caching layer has significantly reduced Druid query load and improved query times, particularly during high-volume events. The caching system is still experimental, with future goals including integration into Druid itself.
CdXz5zHNQW_ZIJYgW0wPc.png

Powering Multimodal Intelligence for Video Search

The article discusses the challenges and solutions for building an advanced video search engine. The core problem is the overwhelming amount of video footage and the difficulty in extracting relevant moments. The solution involves a multimodal approach, integrating various AI models to analyze different aspects of video content. This approach requires unifying diverse data from specialized models into a cohesive, real-time intelligence system. The system's architecture focuses on processing at scale, handling billions of data points efficiently. A three-stage process involving transactional persistence, offline data fusion, and indexing for real-time search ensures data integrity and responsiveness. The process uses a decoupled pipeline to avoid bottlenecking during the ingestion process. The search service offers features like query preprocessing, fine-tuning semantic search, and advanced textual analysis for precise results. It also includes phrase matching, N-gram analysis, and fuzzy matching to enhance search accuracy. The system provides aggregations and flexible grouping to allow for more nuanced searching. Overall, the aim is to empower filmmakers by providing a powerful tool for discovering relevant video content.
CdXz5zHNQW_YdZx5oYSdr.png

Smarter Live Streaming at Scale: Rolling Out VBR for All Netflix Live Events

Netflix transitioned its Live events from Constant Bitrate (CBR) to Variable Bitrate (VBR) encoding for improved efficiency. VBR dynamically adjusts bitrate based on scene complexity, leading to reduced average traffic and improved quality of experience. A major challenge with VBR is the unpredictability of traffic, which can lead to server overload and instability due to large bitrate swings. To address this, Netflix implemented a new capacity reservation system based on nominal bitrates, mitigating the risks associated with VBR's variable nature. They also adjusted nominal bitrates across their streaming ladder to match the quality of CBR streams, using VMAF as a guiding metric. The shift to VBR allowed for efficiency gains like lower rebuffering and initial start delays. Netflix is now working on improving its adaptive bitrate algorithms and capacity reservation to further optimize VBR. The implementation of VBR was a company-wide project involving several teams and contributors.
CdXz5zHNQW_f7OyE9DYza.png

Scaling Global Storytelling: Modernizing Localization Analytics at Netflix

Netflix's Analytics Engineering community gathered for an Analytics Summit to share insights across the business. The core focus is on enhancing the localization process for a global audience of over 300 million members. This involves scaling dubbing and subtitling to meet the needs of diverse language preferences. Previously, localization metrics were fragmented, leading to inconsistencies and maintenance issues. Netflix is modernizing its analytics through consolidation, standardization, and trust. The first pillar involves auditing and consolidating dashboards and backend pipelines. The second pillar tackles "Not-So-Tech Debt" by improving user experience and reporting on unified metrics. The third pillar is investing in a write-once, read-many architecture by centralizing business logic. This approach creates unified tables, solving problems once and propagating updates instantly. The future involves event-level analytics, analyzing granular timed-text data like subtitle lines. The goal is to understand how subtitle characteristics impact member engagement and refine guidelines. Ultimately, this modernization effort aims to improve the Netflix experience for all members worldwide.

Optimizing Recommendation Systems with JDK’s Vector API

Netflix's Ranker service, responsible for personalized recommendations, had a CPU hotspot in its video serendipity scoring feature. Initially, the scoring utilized a costly, nested loop structure for calculating similarities. The first optimization involved batching calculations to leverage matrix multiplication, improving the algorithm. However, an initial implementation using matrix multiplication introduced overhead, like GC pressure and poor cache behavior. Further optimization involved switching to flat buffers, improving data layout and cache locality. Then, a thread-local buffer reuse strategy eliminated per-request allocations. BLAS libraries were explored but didn't provide significant gains due to JNI overhead. Finally, the JDK Vector API was integrated, enabling SIMD, with a fallback to optimized scalar code for compatibility. This streamlined approach resulted in a 7% drop in CPU utilization and a 12% drop in average latency. Ultimately, the CPU consumption of the scoring feature was reduced from 7.5% down to roughly 1%, and the CPU per request significantly decreased by 10%.
CdXz5zHNQW_TmUGwTILLl.png

Mount Mayhem at Netflix: Scaling Containers on Modern CPUs

Netflix faced container scaling bottlenecks during their container runtime modernization. The issue arose because the new runtime used user namespaces for enhanced container security, employing kernel idmap features for efficient resource allocation, triggering frequent mount operations. These mount operations heavily contended for global kernel locks, particularly affecting nodes with many container layers. This contention was amplified on r5.metal instances, which have dual NUMA architecture. Benchmarks revealed instance type differences, with older Intel architectures showing higher container launch times. The bottleneck was rooted in Linux VFS path lookup code and was linked to the kernel's mount table's global locking. NUMA architecture amplified the problem due to memory access latency. Disabling hyperthreading improved performance by reducing resource competition. Centralized cache architectures in some CPUs exacerbated lock contention due to a single queueing structure. Distributed cache architectures, like those in AMD chips, showed better performance as contention was spread across multiple domains. Microbenchmarks validated the impact of NUMA, hyperthreading, and CPU architecture.
CdXz5zHNQW_w2NFMfjNQ5.png

MediaFM: The Multimodal AI Foundation for Media Understanding at Netflix

Netflix developed MediaFM, an in-house multimodal content embedding model to understand its vast catalog. This model aims to create a machine-readable understanding of content by integrating audio, video, and text. MediaFM uses a Transformer-based encoder to generate shot-level embeddings, capturing temporal relationships. The model's inputs are shot-level embeddings from video, audio, and timed text, concatenated and processed. A Masked Shot Modeling objective trains the model to predict original embeddings. Evaluation involves task-specific linear layers, proving MediaFM's superior performance compared to baselines. The model significantly improves in tasks demanding nuanced understanding, such as ad relevancy. Contextualization plays a critical role with an added modality. MediaFM benefits applications across Netflix, including ad relevancy and clip popularity prediction. The team is now exploring how to use pre-trained multimodal LLMs for future model development. The blog post concludes with gratitude for contributors and a glimpse into future work.
CdXz5zHNQW_C2skKyzkh4.png

Scaling LLM Post-Training at Netflix

Netflix developed a post-training framework to adapt large language models (LLMs) for their specific needs in recommendation, personalization, and search. This framework addresses the complexities of post-training at scale, including data pipelines, distributed computing, and workflow orchestration. The framework aims to help model developers focus on model innovation rather than infrastructure. Key challenges in post-training include data preparation, model sharding, and setting up training loops for production. The Netflix framework provides modular components for data, model, compute, and workflow management to simplify training. The framework supports SFT, DPO, RL, and knowledge distillation training recipes with reusable utilities. The framework evolved to support RL by integrating open-source tools and adopting a hybrid execution model. It prioritizes compatibility with the Hugging Face ecosystem to facilitate easy integration of models and tools. The post-training framework allows teams to experiment with advanced techniques and iterate more quickly.
CdXz5zHNQW_vgbgklctA7.png

Automating RDS Postgres to Aurora Postgres Migration

In 2024, Netflix standardized on Amazon Aurora PostgreSQL for its relational database needs after evaluating various technologies. This decision was driven by PostgreSQL's existing prevalence, industry momentum, and Aurora's scalability and feature advantages. A key initiative was migrating existing users to Aurora PostgreSQL, starting with RDS migrations. Migrations involve more than data transfer; they require orchestrated data and functionality transitions. Netflix faced substantial operational challenges with nearly 400 PostgreSQL clusters, necessitating a self-service migration workflow. This workflow addressed issues like zero data loss, minimal downtime, and the lack of direct access to credentials, promoting a seamless process. The chosen strategy utilized Aurora Read Replicas, trading complexity for reduced client downtime. Netflix's Data Access Layer managed secure connections between applications and databases, facilitating transparent migrations. A fully automated, self-service workflow was developed, orchestrating the entire migration process without manual intervention. The process included enabling backups, porting parameters, and creating an Aurora read replica for continuous synchronization. The quiescence phase focuses on a smooth transition to the Aurora cluster.
CdXz5zHNQW_HceAyprWfz.png

The AI Evolution of Graph Search at Netflix

Netflix's Graph Search platform is evolving to incorporate natural language search using AI, addressing limitations of its structured query language. Natural language search simplifies data retrieval, improving user experience across various applications. The core challenge involves translating user queries into valid Graph Search Filter statements. This process uses Large Language Models (LLMs) to generate queries, incorporating context engineering for accuracy. Context engineering leverages GraphQL schema information and controlled vocabularies to inform the LLM. Retrieval-Augmented Generation (RAG) is employed to manage context, focusing on relevant fields and controlled vocabulary values. Field RAG and Controlled Vocabularies RAG enhance precision by matching user intent to index elements. The process involves vector search and deduplication for efficient context management. The system provides instructions to the LLM to generate syntactically, semantically, and pragmatically correct filters. The effectiveness of the solution requires careful parameter tuning for RAG components. The described work paves the potential for building a RAG system on top of Graph Search. Future articles will detail the implementation, performance evaluation, and platform evolution.
CdXz5zHNQW_OC77NMATw9.png

How Temporal Powers Reliable Cloud Operations at Netflix

Netflix adopted Temporal, a durable execution platform, to address deployment failures in its Spinnaker continuous delivery platform. Spinnaker, used for most Netflix deployments, faced issues in its Orca and Clouddriver services, particularly with Cloud Operations. Cloud Operations, responsible for cloud infrastructure changes, experienced transient failures and operational complexity within Clouddriver. This led to a 4% failure rate in deployments, impacting engineering productivity. Temporal offers Workflows and Activities, enabling resilient execution despite failures. Netflix migrated Cloud Operations to Temporal, simplifying the process and reducing the failure rate. The migration involved transitioning Cloud Operations into Temporal workflows, leveraging the platform's retry mechanisms and state management. The integration required abstracting and dynamically configuring the system. The result of this migration was a significant reduction in deployment failures, down to 0.0001%. This improved reliability and engineering efficiency.
CdXz5zHNQW_jmfliRvvgd.png

Netflix Live Origin

The Netflix Live Origin is a custom-built origin server that plays a crucial role in the company's live streaming pipeline, acting as a broker between the cloud live streaming pipelines and the distribution system, Open Connect. The Live Origin is a multi-tenant microservice operating on EC2 instances within the AWS cloud, using standard HTTP protocol features to communicate with the Live Origin. The architecture of the Live Origin is influenced by key technical decisions, including resilience achieved through redundant regional live streaming pipelines and the implementation of epoch locking at the cloud encoder. The Live Origin features multi-pipeline and multi-region awareness, allowing it to select the first valid segment from each pipeline in a deterministic order. The system also detects segment defects via lightweight media inspection at the packager and provides this information as metadata when the segment is published to the Live Origin. To optimize interactions with the Origin Server, the proxy-caching functionality of nginx has been extended to address Live-specific needs, including millisecond grain caching and the ability to hold open requests for segments that are not yet available. The Live Origin also provides streaming metadata enhancement through the use of HTTP headers, which can be used to convey notifications of events within the stream to client devices. The system includes an invalidation system that can be used to flush all content associated with an event, as well as an enhanced cache invalidation system that takes into account the encoding pipeline and region used to generate each segment. The Origin storage architecture was initially based on AWS S3 but was later optimized to meet the unique latency and workload requirements of live streaming, using a KeyValue Storage Abstraction that leverages Apache Cassandra to provide chunked storage of large values.
CdXz5zHNQW_BNFU9LegnF.png

AV1 — Now Powering 30% of Netflix Streaming

Netflix has been working to deliver the best possible entertainment experience to its members, and one key technology enabling this is AV1, a modern open video codec. AV1 powers approximately 30% of all Netflix viewing, marking a major milestone in efforts to bring more efficient and higher-quality streaming to members. Netflix co-founded the Alliance for Open Media to develop and promote next-generation open source media technologies, with AV1 being the first major project. The AV1 codec was officially released in 2018, with goals to deliver significant improvements in compression efficiency and introduce rich features that enable new use cases. Netflix first brought AV1 streaming to Android devices in 2020, which proved valuable for mobile users who are mindful of their data usage and network conditions. The success of AV1 on Android motivated Netflix to expand support to smart TVs and other large-screen devices, where most members watch their favorite shows. Today, AV1 accounts for approximately 30% of all Netflix streaming, making it the second most-used codec, and it's on track to become number one soon. AV1's superior compression efficiency has allowed Netflix to provide high-quality streaming experiences using less data, making it more accessible and reliable. The company is also exploring AV1's unique features to unlock advanced and immersive experiences for members, including high-dynamic-range and cinematic film grain. Netflix sees significant opportunities for AV1 beyond traditional video-on-demand streaming, including live streaming and cloud gaming, and is working to productize AV1 for these use cases.
CdXz5zHNQW_jHTWfYR1sl.png

Behind the Streams: Live at Netflix. Part 1

Three years ago, Netflix asked how they could entertain the world through live streaming, a format almost as old as television itself. This question led to the development of hundreds of live events, including comedy shows, sports, and WWE events. In a series called "Behind the Streams," Netflix will share the technical journey of building live streaming capabilities. Live streaming introduced new considerations for architecture and technology choices, requiring significant building to make it work well on Netflix. The key pillars of Netflix's live architecture include dedicated broadcast facilities, cloud-based redundant transcoding and packaging pipelines, scaling live content delivery, optimizing live playback, and running discovery and playback control services in the cloud. Netflix also centralized real-time metrics in the cloud with specialized tools and facilities. Building live functionality brought fresh challenges and opportunities to learn, and Netflix is still learning every day how to deliver live events more effectively. Key learnings so far include the importance of extensive testing, regular practice, viewership predictions, graceful degradation, and retry storms. Netflix has made significant progress in building a robust live streaming system, but there is still more to learn and improve.
CdXz5zHNQW_JBiedm3Ejv.png

Netflix Tudum Architecture: from CQRS with Kafka to CQRS with RAW Hollow

Tudum.com is Netflix's official fan destination, offering exclusive content, behind-the-scenes insights, and interactive experiences to over 20 million members each month. The platform's architecture is designed to be maintainable, extensible, and flexible, using a server-driven UI approach similar to Command Query Responsibility Segregation (CQRS). Tudum's editorial team creates content, which is stored in a write database, and then converted into a read-optimized format for consumption by users. The initial architecture used Kafka to separate the write and read databases, allowing for independent scaling and eventual consistency. However, this approach introduced a delay between content edits and their reflection on the website. The team identified the source of the delay as the Page Data Service, which used a near cache to accelerate page building. To address this issue, Netflix developed RAW Hollow, an in-memory, co-located, compressed object database that provides strong read-after-write consistency and low latency. Tudum was a perfect fit to test RAW Hollow, which significantly reduced I/O and enabled synchronous data access in O(1) time. The updated architecture eliminated the need for Kafka infrastructure and reduced data propagation times, allowing writers and editors to preview changes in seconds. The migration also led to faster request times, with homepage construction time decreasing from 1.4 seconds to 0.4 seconds.
CdXz5zHNQW_0cHOBDLN2F.jpeg

Driving Content Delivery Efficiency Through Classifying Cache Misses

Netflix's Open Connect program is a content delivery network (CDN) that aims to provide the best quality of experience (QoE) to its members by localizing content delivery through partnerships with internet service providers (ISPs) worldwide. The program uses custom-built servers called Open Connect Appliances (OCAs) that are designed for efficiency and cost-effectiveness. To evaluate the efficiency of Open Connect, Netflix uses a framework to identify sources of inefficiencies, specifically cache misses, which occur when bytes are not served from the best available OCA for a given client. Cache misses are classified into three categories: content misses, health misses, and other misses. Content misses occur when files are not found on OCAs in the local site, while health misses occur when the local site's OCA hardware resources are saturated. Netflix logs two critical data components to compute cache misses: steering playback manifest logs and OCA server logs. These logs are joined to compute detailed cache miss metrics at different aggregation levels. The system architecture for computing cache misses involves log emission, log consolidation, log enrichment, and streaming window-based joins. The data model used to evaluate cache misses allows for replaying logic offline and in simulations with variable parameters to test new conditions and features without impacting production traffic.
CdXz5zHNQW_u3pdU106U1.png

AV1 @ Scale: Film Grain Synthesis, The Awakening

Netflix has introduced AV1 Film Grain Synthesis (FGS) streams, which preserve the artistic integrity of film grain while optimizing data efficiency. Film grain is a key element in storytelling, adding depth and realism to films, but it's difficult to compress using traditional algorithms. The AV1 FGS tool models film grain through two components: film grain pattern and film grain intensity. The film grain pattern is replicated using an auto-regressive model, while the film grain intensity is controlled by a scaling function. The encoding process removes film grain from the video, compresses it, and transmits the grain's pattern and intensity alongside the compressed video data. During playback, the film grain is recreated and reintegrated into the video using a block-based method. Enabling AV1 FGS has led to significant bitrate reduction, allowing for high-quality video streaming with less data. The technology has also improved visual quality, with synthesized noise effectively masking compression artifacts. Netflix has rolled out FGS across its platform, and users can now enjoy FGS-enabled streams on supported devices. The rollout has resulted in a smoother and more reliable Quality of Experience for Netflix members, with lower bitrate, decreased playback errors, and improved playback stability.
CdXz5zHNQW_VP7NReszvY.png

Model Once, Represent Everywhere: UDA (Unified Data Architecture) at Netflix

As Netflix's offerings grow, so does the complexity of the systems that support it, leading to duplicated and inconsistent models, inconsistent terminology, data quality issues, and inconsistent relationships between data. To address these challenges, a new foundation is needed to define a model once and reuse those definitions everywhere, connecting concepts to real systems and projecting those definitions outward, generating schemas and enforcing consistency across systems. This led to the development of UDA (Unified Data Architecture), which enables modeling domains once and representing them consistently across systems, powering automation, discoverability, and semantic interoperability. UDA allows users and systems to register and connect domain models, catalog and map domain models to data containers, transpile domain models into schema definition languages, move data faithfully between data containers, discover and explore domain concepts, and programmatically introspect the knowledge graph. UDA is a knowledge graph that connects business concepts to schemas and data containers, grounded in an in-house metamodel called Upper, which defines the language for domain modeling in UDA. Upper is a language for formally describing domains and their concepts, and it is used to model data container representations and mappings. UDA adopts a named-graph-first information model, ensuring resolution, modularity, and governance across the entire graph.
CdXz5zHNQW_aXb9bCvUTK.png

FM-Intent: Predicting User Session Intent with Hierarchical Multi-Task Learning

Recommender systems are essential components of e-commerce, streaming media, and social networks, driving significant product and business impact. At Netflix, these systems connect members with relevant content at the right time. The recommendation foundation model has made substantial progress in understanding user preferences, but there is an opportunity to further enhance its capabilities. By extending the foundation model to incorporate the prediction of underlying user intents, the model can enrich its understanding of user sessions beyond next-item prediction. Recent research has highlighted the importance of understanding user intent in online platforms, leading to more accurate and personalized recommendations. FM-Intent, a novel recommendation model, captures a user's latent session intent using short-term and long-term implicit signals as proxies, then leverages this intent prediction to improve next-item recommendations. The model establishes a hierarchical relationship between intent predictions and next-item recommendations, creating a more coherent and effective recommendation pipeline. FM-Intent makes three key contributions: a novel recommendation model, a hierarchical multi-task learning approach, and comprehensive experimental validation showing significant improvements over state-of-the-art models. FM-Intent has been successfully integrated into Netflix's recommendation ecosystem and can be leveraged for several downstream applications, including personalized UI optimization, analytics, and enhanced recommendation signals.
CdXz5zHNQW_QduYXjmsjM.png

Behind the Scenes: Building a Robust Ads Event Processing Pipeline

At Netflix, a robust event processing platform was built to monitor, measure, and optimize ad campaigns. The ad serving system relies on a steady stream of ad events to adjust decisions, frequency capping, pacing, and personalization. The initial ad event handling system consisted of three main components: the Microsoft Ad Server, Netflix Ads Manager, and Ad Event Handler. The system was designed to ensure the feedback loop functioned effectively, providing insights on impressions, frequency capping, and monetization processes. As the business expanded, a new persistence layer using Key-Value abstraction was introduced to address challenges such as growth in data volume and third-party tracking URLs. The event processing pipeline was further evolved to support in-house advertising technology, incorporating features such as frequency capping, pricing information, and robust reporting system. A centralized ad event collection system was planned, providing a single unified data contract to consumers and separating concerns between upstream systems and consumers. The new pipeline supported various functions such as measurement, finance/billing, reporting, frequency capping, and maintaining an essential feedback loop back to the ad server. The development of the ads event processing systems has been a carefully orchestrated journey, showcasing teamwork, planning, and coordination across various teams. The new system has significantly accelerated the ability to launch new capabilities for the business, supporting programmatic buying capabilities, sharing opt-out signals, and ensuring accurate reporting and measurement.
CdXz5zHNQW_x4VyedRjxa.png

Measuring Dialogue Intelligibility for Netflix Content

Netflix prioritizes enhancing member experience by collaborating with technology partners to improve tools and workflows. This collaboration focuses on improving dialogue intelligibility throughout the production process, addressing issues from set to screen. A key initiative is the Dialogue Integrity Pipeline, which identifies and mitigates factors impacting clarity, such as noisy environments and audio mixing. Netflix uses industry-standard loudness meters and developed an eSTOI-based measurement system to analyze dialogue intelligibility. To improve pre-delivery optimization, Netflix partnered with Fraunhofer IDMT and Nugen Audio to create the DialogCheck plugin. This plugin, integrated into Digital Audio Workstations (DAWs), provides real-time feedback to sound engineers. The collaboration leverages Fraunhofer's machine learning-based speech intelligibility solution and Nugen Audio's expertise in audio plugins. The DialogCheck plugin allows for early detection and resolution of dialogue clarity issues, ensuring artistic intent isn't compromised. The ultimate goal is delivering immersive and accessible storytelling to all viewers, regardless of listening environment. This collaborative approach underscores Netflix's commitment to audio excellence and innovation in storytelling.
CdXz5zHNQW_u9iv4wSU0L.png

How Netflix Accurately Attributes eBPF Flow Logs

Netflix uses eBPF to capture TCP flow logs at scale for enhanced network insights, but accurately attributing flow IP addresses to workload identities was a significant challenge. The initial attribution approach relied on Sonar, an internal IP address tracking service, but it led to misattribution due to delays and failures in distributed systems. Misattribution rendered the flow data unreliable for decision-making, and a workaround of holding received flows for 15 minutes before attribution did not eliminate the issue. To solve this problem, Netflix developed a new attribution method that attributes local IP addresses by determining the local workload identity from its environment. For container workloads, Netflix leveraged IPMan, a container IP address assignment service, to attribute local IP addresses. Once local IP addresses are attributed, remote IP addresses can be attributed by learning the time ranges during which each workload owns a given IP address. FlowCollector maintains an in-memory hashmap to represent this knowledge and shares learned time ranges with other nodes using Kafka. The new method achieves accurate attribution and handles transient issues gracefully, and it is also cost-effective due to its simplicity and in-memory lookups. The method is extended to attribute cross-regional IP addresses by forwarding flows to nodes in the corresponding region. Finally, the method is further extended to attribute non-workload IP addresses, such as those belonging to Netflix's content delivery network.
CdXz5zHNQW_ODQpwXb03K.png

Globalizing Productions with Netflix’s Media Production Suite

The film industry's shift to cloud-based workflows faces challenges in global implementation. Netflix aims to solve these issues with its Media Production Suite (MPS), designed for filmmakers. The MPS streamlines media management, eliminating tedious tasks and boosting creative focus. Traditional workflows using physical tapes are slow and cumbersome, hindering collaboration. Digital workflows also face challenges in distribution and standardization. The cloud offers solutions but requires overcoming operational and technological hurdles. Netflix addresses these issues by bringing people and applications to the media instead of the reverse. MPS tackles global disparities in technology and standardization, considering diverse market needs. The suite automates processes like color management and framing using industry standards. The infrastructure combines cloud and physical capabilities optimized for user performance. MPS includes tools for ingest, media library, dailies, remote workstations, and more. Over 350 titles have utilized MPS tools, with feedback from diverse global regions. The Brazilian series "Senna" adopted MPS, showcasing its ability to overcome geographical barriers.

Foundation Model for Personalized Recommendation

Netflix is developing a foundation model for recommendations to centralize member preference learning and improve efficiency across various recommendation models. The current system has specialized models that are costly to maintain and struggle to share innovations. The foundation model aims to learn from extensive user interaction histories and content data, distributing these learnings to other models. Inspired by large language models, the approach shifts to a data-centric strategy using semi-supervised learning. Netflix tokenizes user interactions, balancing data granularity and sequence compression to handle long interaction histories. Sparse attention mechanisms and sliding window sampling are used to manage computational efficiency during training. Each token contains rich, heterogeneous information about the action and content, utilizing request-time and post-action features. The model employs an autoregressive next-token prediction objective, similar to GPT, but with modifications to account for varying interaction importance. The model predicts multiple tokens and uses auxiliary prediction objectives to capture long-term dependencies and improve accuracy. Addressing entity cold-starting, the model is built with incremental training and the ability to infer with unseen entities by using metadata information of entities and inputs.
CdXz5zHNQW_VZmJPzyqeS.png

HDR10+ Now Streaming on Netflix

Netflix has started streaming HDR10+ content on AV1-enabled devices, providing a better viewing experience for certified HDR10+ devices. This is an enhancement to the existing HDR10 content, which only used static metadata. The dynamic metadata in HDR10+ improves the quality and accuracy of the picture. Netflix has been a pioneer in adopting HDR technology, and in the last five years, HDR streaming has increased by over 300%. The company now has over 11,000 hours of HDR titles available. Netflix enabled HDR10+ using the AV1 video codec, which is one of the most efficient codecs available. AV1 is already the second most streamed codec at Netflix, and with the addition of HDR10+ streams, it is expected to become the most streamed codec soon. The company is adding HDR10+ streams to both new releases and existing popular HDR titles, with the goal of providing an HDR10+ experience for all HDR titles by the end of the year. HDR10+ is one of the three prevalent HDR formats, along with Dolby Vision and HDR10. HDR10+ and Dolby Vision use dynamic metadata, which provides content image statistics on a per-frame basis, enabling optimized tone mapping adjustments for each scene. This achieves greater perceptual fidelity to the original, preserving creative intent. To receive HDR10+, members must have a Netflix Premium plan subscription, the title must be available in HDR10+ format, and the member device must support AV1 and HDR10+. The launch of HDR10+ was a collaborative effort involving multiple teams at Netflix, and the company is grateful to everyone who contributed to making this idea a reality. The commitment to innovation and quality underscores Netflix's dedication to delivering an immersive and authentic viewing experience for all its members.
CdXz5zHNQW_x7iQNmTXt1.png

Title Launch Observability at Netflix Scale

To achieve comprehensive title observability at Netflix, the company introduced observability endpoints across all services within its Personalization and Discovery Stack. Each microservice involved in the stack had to introduce a new "Title Health" endpoint that adheres to principles of accurately reflecting production behavior, standardization, and answering the Insight Triad. The Insight Triad requires the endpoint to answer whether a title is eligible for promotion, why it's not eligible, and what can be done to fix any problems. A stable proto request/response format was developed to standardize communication between the observability service and the personalization stack's observability endpoints. The high-level architecture of the solution involves establishing observability endpoints, proactive monitoring, tracking real-time title impressions, storing data in an optimized datastore, and offering easy-to-integrate APIs for the dashboard. The Title Health microservice runs a scheduled collector job every 30 minutes to retrieve relevant title health information from assigned row services. Real-time title impressions data is processed from a Kafka queue, which is polled every two minutes to retrieve impressions data. The data is then aggregated and presented as an additional health status indicator for stakeholders. The data is stored in a dedicated Hollow feed for each collector, which allows for high-performance read-only access and enables the monitoring of overall health and tracking of title history. The observability dashboard utilizes the title health service to present the status of titles to stakeholders, and the system also has a "Time Travel" capability that simulates traffic ahead of time to catch and fix issues before they impact members. The system's architecture and strategies have enabled Netflix to enhance title launch observability, ensuring a thrilling viewing experience and fostering trust with content creators and partners. The solution also lays the groundwork for future innovations, ensuring that every story reaches its intended audience and that every member enjoys their favorite titles on Netflix.
CdXz5zHNQW_bbTwixEEHc.png

Introducing Impressions at Netflix

At Netflix, images on the platform are called "impressions" and play a crucial role in personalizing the user experience. Capturing and processing these impressions is a complex task that requires a sophisticated system. The system tracks and processes billions of impressions daily, maintaining a detailed history of each profile's exposure. This impression history is essential for enhanced personalization, frequency capping, highlighting new releases, and analytical insights. The first step in managing impressions is creating a Source-of-Truth (SOT) dataset, which supports various downstream workflows and enables multiple use cases. Raw impression events are collected from the client side and processed through a custom event extractor, Apache Kafka, and Apache Iceberg. The data is then filtered, enriched, and structured using Apache Flink, establishing a definitive source of truth for Netflix's impression data. The system ensures high-quality impressions by gathering detailed metrics and alerting the team of any potential issues. The architecture is designed to handle a massive volume of impression events in real-time, with a focus on scalability, flexibility, and high availability. Future work includes addressing unschematized events, automating performance tuning, and improving data quality alerts.

Title Launch Observability at Netflix Scale

To ensure seamless title launches and discoverability at Netflix, it's essential to take a step back and understand the broader context before diving into solutions. This thoughtful approach builds resilience and scalability for the future. The first step is to understand the bigger picture by identifying stakeholders, mapping the current landscape, clarifying the core problem, and assessing business priorities. The main stakeholders involved are title launch operators, personalization system engineers, product managers, and creative representatives. Mapping the current landscape revealed that there was no established solution for title launch observability, presenting both challenges and opportunities. The core problem was to ensure every title was treated fairly by the personalization stack. To address this, a shared understanding of "Title Health" was introduced, encompassing various metrics and indicators that reflect a title's performance in terms of discoverability and member engagement. Title Health provided a framework to monitor and optimize each title's lifecycle, allowing for alignment with partners on principles and requirements before building solutions. To build a robust plan for title launch observability, issues were categorized into three primary areas: title setup, personalization systems, and algorithms. By categorizing issues, challenges can be systematically addressed, and a reliable, personalized experience can be delivered for every title. Issue analysis revealed that setup issues are the most common but easiest to fix, while algorithm issues are rare but difficult to address. Evaluating options led to the decision to focus on proactive issue detection first, catching problems before launch to ensure smoother launches, better member experiences, and stronger system reliability. This decision laid the foundation for a scalable, robust system that can grow with the complexities of the ever-evolving platform.
CdXz5zHNQW_aUcN7HOcn2.png

Part 3: A Survey of Analytics Engineering Work at Netflix

The article concludes a multi-part series on Analytics Engineering at Netflix, focusing on technical craft. It discusses dashboard design tips, emphasizing the importance of understanding user needs and leveraging existing patterns. The article also covers the design of dashboards, including how to structure applications to match user mental models, and provides guidelines for page layouts, interactive charts, and onboarding. Additionally, it shares learnings from deploying an analytics API at Netflix, highlighting key takeaways such as measuring the impact and necessity of real-time results, exploring all available solutions, aligning on performance expectations, and the importance of rigorous testing and collaboration.
CdXz5zHNQW_QJdCCQ1JwJ.png

Part 2: A Survey of Analytics Engineering Work at Netflix

Netflix's Analytics Engineering team has been working on various projects, including game analytics to measure the effectiveness of user acquisition campaigns for Netflix games. The team uses a synthetic control framework to estimate the counterfactual scenario for these campaigns, and has developed an interactive tool to provide insights into incrementality results. They are also working on an Incremental Return on Investment model to guide the design and budgeting of future campaigns. In addition, the team is focusing on measuring and validating incremental signups for Netflix games, using an approach that estimates incremental signups by leveraging other frameworks like Incremental Account Lifetime Valuation. Netflix Games' player journey is also being modeled using a state machine to track the daily state machine showing the probability of account transitions between states. This model helps in evaluating progress towards engagement goals and identifying areas to boost Monthly Active Accounts. The article also discusses content cash modeling, where Netflix forecasts cash needs for "To Be Determined" slots by modeling cash spend per day leading up to and beyond a title's launch date. Finally, the team is working on improving the efficiency of the transcription process in dubbing workflows using Assistive Speech Recognition technology, with a multi-layered measurement framework to evaluate its performance.
CdXz5zHNQW_x65HF2Cxou.png

Introducing Configurable Metaflow

Metaflow empowers teams at Netflix to efficiently develop and manage ML/AI projects. Configs, a new Metaflow feature, enables configurable flows by allowing users to define flow parameters, resource requirements, and application-specific settings in human-readable configuration files. Configs extend existing Metaflow artifacts and Parameters, providing greater flexibility and customization options. They can be dynamically loaded and modified, even when using remote execution or production deployments. Configs unlock advanced use cases such as mixing fixed deployments with runtime configurability, validating configurations, managing hierarchies of configuration files, and generating configurations on the fly. In combination with configuration managers like Hydra, Configs enable orchestrating experiments over multiple configurations and sweeping over parameter spaces. Metaboost, an internal Netflix tool, exemplifies the use of Configs in practice, providing a single interface to manage ML projects across different platforms and enhance project coherence and reduce risk.

Part 1: A Survey of Analytics Engineering Work at Netflix

Netflix's Analytics Engineering team works to empower the company to efficiently produce and effectively deliver high-quality, actionable insights across the company. They recently held an annual internal Analytics Engineering conference, which covered various topics such as DataJunction, LORE, and leveraging foundational platform data to enable cloud efficiency analytics. DataJunction is an open-source solution that allows metric definitions to be standardized and accessible, while LORE is a chatbot that uses LLMs to provide real insights to end users. The team also focuses on democratizing analytics through initiatives like Analytics Enablement, which aims to empower Netflix's analytic practitioners to efficiently produce and effectively deliver high-quality insights. The team's goal is to enable the engineering organization to make efficiency-conscious decisions when building and maintaining services that allow Netflix to function as a streaming service.

Cloud Efficiency at Netflix

Netflix uses Amazon Web Services (AWS) for its cloud infrastructure needs, including compute, storage, and networking. The company's diverse technological landscape generates extensive data from various infrastructure entities, which data engineers and analysts use to provide actionable insights to the engineering organization. The Data & Insights organization partners with engineering teams to share key efficiency metrics, empowering stakeholders to make informed business decisions. The Platform DSE team has developed a two-component solution: Foundational Platform Data (FPD) and Cloud Efficiency Analytics (CEA). FPD provides a centralized data layer for all platform data, featuring a consistent data model and standardized data processing methodology. CEA offers an analytics data layer that provides time series efficiency metrics across various business use cases. The team aims to continue onboarding platforms to FPD and CEA, striving for nearly complete cost insight coverage in the upcoming year. They also plan to extend FPD to other areas of the business, such as security and availability, and move towards proactive approaches via predictive analytics and ML for optimizing usage and detecting anomalies in cost.

Title Launch Observability at Netflix Scale

At Netflix, managing over a thousand global content launches each month is a top priority, requiring robust systems that deliver comprehensive observability. The company aims to connect every story with the right audience, but traditional system metrics like error rates and CPU utilization don't capture the nuances of title success. To bridge this gap, Netflix needs to design systems that recognize these nuances and empower every title to shine. In the early days of Netflix Originals, the launch team manually verified title placements, but this approach couldn't scale with the company's global expansion. As a result, Netflix faced operational challenges in providing accurate and timely answers to complex queries about title performance and discoverability. The company needed a scalable solution to ensure every title launches flawlessly, with metadata and assets correctly configured, data flowing seamlessly, and algorithms functioning as intended. To address these challenges, Netflix considered two options: log processing and observability endpoints in personalization systems. Log processing offers a straightforward solution for monitoring and analyzing title launches, but it has limitations, such as catching issues ahead of time and ensuring accuracy. Observability endpoints, on the other hand, provide real-time monitoring, proactive issue detection, and enhanced accuracy, but require significant initial investment and synchronization efforts. Netflix ultimately adopted a comprehensive observability strategy that includes real-time monitoring, proactive issue detection, and source of truth reconciliation. This approach has significantly enhanced the company's ability to ensure successful title launches and discovery across the platform. In the next part of the series, Netflix will share key technical insights and details on how they achieved this.

Netflix’s Distributed Counter Abstraction

Netflix's Distributed Counter Abstraction is a service designed to store and query large volumes of temporal event data with low millisecond latencies. It supports two main categories of use cases: Best-Effort and Eventually Consistent. The Best-Effort counter uses EVCache for high throughput and low latency within a single region, but lacks cross-region replication and consistency guarantees. The Eventually Consistent counter uses a durable queuing system like Apache Kafka for accurate and durable counts, but can lead to delays and challenges in rebalancing partitions. Netflix's approach combines logging each counting activity as an event and continuously aggregating these events to meet requirements for auditing and recounting.

Investigation of a Workbench UI Latency Issue

A user reported that their JupyterLab UI on their Workbench becomes slow and unresponsive when running certain notebooks. To quantify the slowness, we held down a key for 15 seconds while running the user's notebook and observed latencies ranging from 1 to 10 seconds, averaging 7.4 seconds. We ruled out CPU oversubscription and network issues as potential causes. Using py-spy, we found that a lot of CPU time was spent on a function called __parse_smaps_rollup in the jupyter-lab process. This function is part of the jupyter_resource_usage extension, which is used to get resource usage information. The function is triggered by an API endpoint called periodically from the UI. The code gets all children processes of the jupyter-lab process recursively, including both the ipykernel Notebook process and all processes created by the Notebook. This function's cost is linear to the number of all children processes. In the reproduction code, we create 96 processes, which is more than the 64 CPUs allocated to the container. This discrepancy causes the parent process to be slow.

Introducing Netflix TimeSeries Data Abstraction Layer

Netflix has developed the TimeSeries Abstraction to efficiently store and query large volumes of temporal event data with low millisecond latencies. This system is designed to handle high-throughput writes, efficient querying in large datasets, global reads and writes, tunable configuration, and cost efficiency. The TimeSeries Abstraction is built around core design principles including partitioned data, flexible storage, configurability, scalability, and sharded infrastructure. The data model consists of event items, events, time series IDs, and namespaces. Event items are key-value pairs that store data for a given event, while events are structured collections of one or more event items. Time series IDs are collections of events over a dataset's retention period, and namespaces are collections of time series IDs and event data. The TimeSeries Abstraction provides APIs for interacting with event data, including WriteEventRecordsSync, WriteEventRecords, ReadEventRecords, SearchEventRecords, and AggregateEventRecords. The storage layer consists of a primary data store and an optional index data store, with Apache Cassandra and Elasticsearch being the preferred choices for storing durable data and indexing, respectively. The primary data store uses a temporal partitioning scheme to divide data into manageable chunks based on time intervals, allowing for efficient querying of specific time ranges and optimizing storage and query performance. The data is further partitioned into time buckets and event buckets to facilitate effective range scans and manage high-throughput write operations. The TimeSeries Abstraction is designed to handle the challenges of storing and querying temporal event data at scale, including high throughput, efficient querying, global reads and writes, tunable configuration, and cost efficiency. By using a unique event data model and a scalable storage layer, the TimeSeries Abstraction provides a versatile and cost-effective solution for managing temporal data at Netflix.

Introducing Netflix’s Key-Value Data Abstraction Layer

Netflix's Key-Value (KV) Data Abstraction Layer (DAL) is a foundational abstraction service that simplifies data access and enhances infrastructure reliability. It offers a consistent interface to developers, regardless of the underlying database. The KV abstraction employs a two-level map architecture, supporting both simple and complex data models. It provides four basic CRUD APIs (PutItems, GetItems, DeleteItems) and complex MutateItems and ScanItems APIs for diverse use cases. A namespace defines where and how data is stored, providing logical and physical separation. This allows different use cases to be routed to the most suitable storage system based on specific performance, durability, and consistency needs. The PutItems and DeleteItems APIs leverage idempotency tokens to guarantee data integrity and ensure correct order of operations. Client-generated monotonic tokens are preferred for reliability. To handle large blobs, the KV abstraction uses chunking. This technique breaks large data into smaller chunks, which are then staged and committed with appropriate metadata. Pagination is a critical feature for managing large datasets. The GetItems API supports pagination using a next_page_token, ensuring efficient data retrieval across multiple requests. Tombstones, which indicate deleted data, can impact performance. KV optimizes record and range deletes to generate a single tombstone, reducing load spikes and maintaining consistent performance. Item-level deletes are handled using TTL-based deletes with jitter. This technique hides storage engine complexity and minimizes the impact of deletes on read pagination. Idempotency and chunking are essential for handling tail latencies and ensuring predictable low-latency performance. These design philosophies contribute to the reliability and performance required by Netflix's global operations.

Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future

Pushy is Netflix's WebSocket server maintaining persistent connections with devices running the Netflix app. It evolved from a message delivery service to an integral part of the Netflix ecosystem. Pushy handles hundreds of millions of concurrent WebSocket connections, delivering hundreds of thousands of messages per second, with a 99.999% message delivery reliability rate. Initially developed for FireTVs, Pushy's reach has expanded to nearly a billion devices, including mobile devices and older devices with limited capabilities. To handle this growth, Pushy has undergone several improvements, including rewriting the message processor using Netflix paved-path components, migrating the Push Registry to KeyValue for better performance and scalability, and implementing exponential scaling policies based on the number of connections. These enhancements have made Pushy more reliable, stable, and efficient, enabling it to support the growing number of devices and meet the demands of new features.

Noisy Neighbor Detection with eBPF

Netflix uses eBPF to continuously monitor run queue latency, an indicator of noisy neighbors (containers heavily utilizing server resources, degrading performance in adjacent containers). eBPF hooks (sched_wakeup, sched_wakeup_new, sched_switch) are used to capture run queue latency and associate it with containers (cgroup IDs) using kfuncs (kernel functions) for safe RCU-protected data access. A rate limiter in eBPF balances observability with performance by limiting data points sent to userspace. Userspace processes events from the eBPF ring buffer and emits metrics to Atlas, including run queue latency (runq.latency) and preemption count (sched.switch.out) by cgroup ID. Both runq.latency and sched.switch.out metrics are necessary to identify noisy neighbors, as runq.latency alone can be misleading when containers are at their CPU limit. A case study demonstrates a noisy neighbor issue causing a spike in run queue latency and preemptions when a new container fully utilizes host CPUs. System processes were identified as the noisy neighbors using the sched.switch.out metric. Optimizations to the eBPF code, including the use of BPF_MAP_TYPE_HASH, direct task struct member access, and ignoring kernel tasks, minimize overhead. A Linux kernel patch was submitted and accepted to improve the calculation of statistics in the kernel. BPFtop, an open-source eBPF process monitoring tool, was used to measure the overhead of the eBPF code.

Recommending for Long-Term Member Satisfaction at Netflix

Netflix employs personalization algorithms to recommend content to members, aiming to enhance long-term satisfaction. Recommendations are viewed as a contextual bandit problem, where the system selects actions based on context and feedback from members. Traditional recommender systems optimize for short-term metrics like clicks, which may not fully capture long-term satisfaction. Optimizing for retention alone has drawbacks, so Netflix utilizes proxy reward functions aligned with long-term member satisfaction. Click-through rate (CTR) is a simple proxy reward, but Netflix expands beyond CTR to consider various user actions and their implications on satisfaction. Reward engineering is an iterative process of refining the proxy reward function to align with long-term member satisfaction, involving hypothesis formation, reward definition, bandit policy training, and A/B testing. Netflix addresses the challenge of delayed feedback by predicting missing feedback, enabling the use of all feedback in the proxy reward function. Despite offline model improvements, online-offline metric disparity can occur when the proxy reward is not fully aligned with long-term member satisfaction. Netflix resolves this by further refining the proxy reward definition. Open questions remain, such as automating proxy reward function learning, determining the optimal waiting time for delayed feedback, and leveraging Reinforcement Learning for alignment with long-term satisfaction.