The Airbnb Tech Blog | Medium Note

The Airbnb Tech Blog | Medium

Airbnb Engineering is a portfolio of articles from Airbnb's engineering team discussing various technologies, innovations, and case studies of the industry. The site provides readers with in-depth analysis and insights into the different approaches and challenges they face in terms of software engineering, product development, scalability, performance, and more. The blog also covers new trends in technology, leadership, and team collaboration, offering wisdom for improving tech companies.

Thread Of Notes

Scaling beyond one: How Airbnb evolved its data architecture for a multi-product world

Airbnb's expansion into Homes, Experiences, and Services necessitated a revamped data modeling framework for their offline data warehouse. The core challenge was balancing consistency with flexibility to avoid data silos and technical debt. They opted for a balanced approach combining centralized principles with decentralized modeling guidelines. Three foundational principles ensured consistency: no hybrid models, consistent identifier naming, and clear namespace organization. Modeling guidelines empowered teams to choose between separate or monolithic models based on shared vs. unique attributes, future evolution, and downstream consumers. Product-specific domains like Listings, Availability, Location, and Guest interactions opted for separate models due to distinct attributes. Conversely, cross-cutting concepts like Messaging, Payments, and Customer Support benefited from a monolithic model for a unified view. The offline data warehouse acts as a crucial translation layer, standardizing raw production data for analytics. Managing data debt, including migrating legacy tables, was a significant undertaking requiring careful communication and validation. This framework provided a scalable and consistent foundation for Airbnb's evolving data needs.
CdXz5zHNQW_hwlCx2ZBf5.jpeg

Sitar-agent: Building a reliable dynamic configuration sidecar at scale

Airbnb developed the sitar agent, a lightweight Kubernetes sidecar, to reliably deliver dynamic configuration changes to thousands of service instances. The configuration delivery begins with developers creating or updating values through Git or a UI, which are then stored in the Sitar Service. Periodically, the full state of configurations is packaged into compressed snapshots and uploaded to AWS S3. When a service pod starts, the sitar agent first downloads these S3 snapshots to a mounted disk, enabling a quick bootstrap. It then synchronizes with the Sitar Service for any changes made since the snapshot was created, signaling readiness to the main application container. After startup, the agent continuously polls the Sitar Service for updates every few seconds. The main application container reads configurations from the mounted disk using a Sitar client library that caches values and detects file changes. A key design decision was maintaining the sitar agent as a separate sidecar container rather than integrating it into the main container, prioritizing reliability, operational safety, and multi-language support over minor cost savings. The system uses a pull model where the agent polls the Sitar Service, optimized with a server-side cache and token-based database access to reduce load. For its local on-disk key-value store, Airbnb opted for SQLite over the legacy Sparkey-backed implementation due to SQLite's superior concurrency, performance, and multi-language support. SQLite's built-in Write-Ahead Logging allows concurrent reads during writes, and its simpler operational model was preferred over RocksDB's higher performance but greater complexity. This robust sidecar design ensures that critical configurations are delivered quickly and reliably across Airbnb's vast service fleet.
CdXz5zHNQW_7QO75wfm4Y.jpeg

When history fails you, borrow from geography

The COVID-19 pandemic presented a forecasting challenge because it disrupted historical patterns, making traditional models unreliable. Demand recovery was asynchronous and geographically uneven, with varying timelines for vaccine rollouts and border reopenings. Waiting for local data in each recovering market would have meant forecasting blind for extended periods. Airbnb's solution was to look sideways across geographies instead of just backward in time. They observed that recovery unfolded sequentially, with some corridors experiencing changes before others. A key signal was the mean booking lead time, which compressed during disruptions and lengthened during recovery. By comparing the timing of these lead time shifts between regions like Europe and North America, they could infer how one market would likely respond to reopening based on the experience of another. This allowed them to use the posterior from an early-affected corridor as an informative prior for a later-affected one. This prior propagation mechanism enabled real-time learning and forecasting, even in markets with scarce local data. The approach required a global footprint with granular data and a hierarchical Bayesian modeling framework. This methodology is applicable beyond crises, such as for predicting the impact of new product feature rollouts or regulatory changes. It allows for immediate learning from early-adopting markets to inform forecasts for those yet to experience the change.
CdXz5zHNQW_7QiYuy1bLj.png

Scaling Airbnb’s identity graph with a unified knowledge graph infrastructure

Airbnb transitioned from using a PaaS graph database to building an internal knowledge graph infrastructure for improved performance. The identity graph, crucial for trust and safety, initially relied on relational databases and then a SaaS graph, facing scalability and latency challenges. The new, internally-managed infrastructure, built on JanusGraph and DynamoDB, provides a unified platform. Key optimizations like custom transactions and parallel query execution were implemented to meet Airbnb's performance needs. This platform supports multi-tenant operations with schema enforcement. Client-side query optimizations, including rewriting Gremlin queries, further enhanced performance. The internal graph infrastructure significantly increased scalability and improved system stability. Performance gains were achieved across various query patterns, leading to faster read API response times. The self-managed system provides a quicker resolution for incidents. The new system scaled to 10x the write QPS of the previous solution. The migration resulted in a considerable reduction in P99 latency for complex queries. Airbnb's knowledge graph now supports critical use cases.
CdXz5zHNQW_2oSLFFmQhV.jpeg

Viaduct 1.0 and the future of Airbnb’s data mesh

Viaduct is Airbnb's data-oriented service mesh, now released as a community-driven project with a stable API. It serves as a GraphQL-based interface for accessing and interacting with various data sources. The goal of Viaduct is to provide a central schema for an organization's data while enabling decentralized development. Unlike GraphQL Federation, Viaduct distributes development through modules hosted on a shared multi-tenant runtime. Teams contribute by creating modules with SDL and resolvers, focusing on domain logic. Viaduct can also function as a subgraph within a federated architecture. The 1.0 release includes new features, stability through annotations, and automated releases. Airbnb is committed to community involvement, using GitHub for architectural discussions. The article highlights upcoming discussions at GraphQLConf 2026 by Airbnb engineers, showcasing Viaduct's capabilities. These include probabilistic testing, monitoring, sharding solutions, and generating mock data using LLMs. The announcement encourages involvement from those seeking to unify data layers or contribute to the project.
CdXz5zHNQW_TNP3iklsC1.jpeg

Monitoring reliably at scale

The core problem is that observability systems can fail when the infrastructure they monitor fails, creating circular dependencies. Airbnb, like many organizations, faced this issue where their metrics pipeline depended on the same systems it observed. This dependency chain needed to be broken to ensure reliable monitoring, especially during outages. To solve this, Airbnb isolated compute by using dedicated Kubernetes clusters managed by the Cloud team. They rethought networking, building a custom Envoy-based Layer 7 ingress layer to bypass the service mesh for telemetry, ensuring prioritization and isolation. Metrics are uniquely high volume, so a dedicated network path avoids congestion and potential disruption. Airbnb also implemented meta-monitoring, monitoring the observability stack itself to detect potential issues. A crucial part of meta-monitoring is the use of a "Dead Man's Switch" mechanism to detect failures in the monitoring system. This overall approach creates a robust signal chain that protects against silent failures in the observability setup. The key takeaway is treating monitoring as a production system, ensuring its reliability surpasses that of the systems it observes. This is crucial for enabling prompt incident response and maintaining user and business confidence. The principles apply universally and involve isolating failure domains for robust system design.
CdXz5zHNQW_IA1cJHBdof.jpeg

Skipper: Building Airbnb’s embedded workflow engine

Airbnb faced the challenge of durable execution for complex workflows, like insurance claims. Existing solutions such as external orchestration engines had drawbacks, including operational complexity, and vendor lock-in. To address these issues, Airbnb built Skipper, an embedded workflow engine that runs within each service. Skipper prioritizes succinctness, no single points of failure, and existing database use. The core of Skipper's design includes workflows that define the logic and actions that encapsulate individual operations. Durability is achieved through a replay mechanism, checkpointing action results to the database. Replay allows workflows to resume from where they left off after interruptions. State fields are persisted for efficiency, using signals to update a workflow's state. Testing workflows with Skipper is simplified, as it has no queues to set up or any infrastructure to mock. This approach ensures durable execution without adding dependencies or compromising performance.
CdXz5zHNQW_p4tursIQ0n.png

Building a fault-tolerant metrics storage system at Airbnb

Airbnb developed an internal storage system to handle 50 million samples per second and 2.5 petabytes of time series data. This shift was necessary due to the immense volume of data generated by extensive code instrumentation across their evolving products and infrastructure. The primary engineering challenge was to persist and serve this massive dataset performantly. To manage this scale, Airbnb adopted a multi-tenant architecture, isolating tenants by service or process for stable grouping and attribution. They implemented shuffle sharding to isolate tenant workloads, improving fault tolerance by ensuring tenants only write to and are queried from a subset of nodes. Operational complexity, especially tenant onboarding and configuration management, was addressed by a consolidated control plane that automated onboarding and simplified configuration updates. Key requirements for the system included handling over 50 million samples per second, supporting numerous dashboards and alerts, and maintaining low query execution times. Initial validation using shadow clusters revealed reliability issues, compaction delays, and slow query performance, especially with large data payloads. Addressing these challenges began with ensuring the reliability of a single cluster, focusing on stabilizing writes, reads, and compaction through benchmarking, guardrails, and isolation of query paths. The system was made fault-tolerant with zone-aware stateful components deployed across three zones. Per-replica limits and tenant-level controls were implemented for effective fleet management and system protection. Subsequently, a multi-cluster architecture was adopted to reduce the blast radius of failures and enhance flexibility. This multi-cluster approach, however, introduced complexities in metrics discovery, querying, and operational overhead. These were mitigated by tooling for tenant-cluster mapping and automated deployment strategies using Kubernetes operators. The introduction of Promxy with custom enhancements facilitated cross-cluster querying and alerting. Key learnings from this journey include the significant cost of cross-cluster querying and the importance of deployment consistency, achieved through automation and standardized deployments. The philosophy evolved towards treating clusters as disposable resources, similar to "cattle," rather than critical, unique "pets," allowing for easier scaling and maintenance. Ultimately, building this platform required a blend of architectural innovation, operational rigor, and a cultural shift in managing expectations.
CdXz5zHNQW_oUKNz5mUz5.jpeg

Privacy-first connections: Empowering social experiences at Airbnb

Airbnb is evolving into a more social platform, emphasizing community connections through Experiences. The platform prioritizes user privacy by creating a distinction between internal user data and public-facing profiles. Users have control over their profile visibility, choosing what information is shared on Experiences. To manage this, Airbnb uses unique User IDs and Profile IDs for distinct contexts, enabling flexible privacy settings. This allows users to have multiple profiles, such as host and guest profiles, kept separate when appropriate. Airbnb employs a permission system called Himeji, ensuring least-privileged access and strong data security. The implementation involved automated audits, team collaboration, and AI-assisted refactoring to ensure proper identifier usage. Type safety and rigorous testing were crucial in maintaining data integrity during the migration. Ultimately, Airbnb's privacy principles aim to balance social features with user control over personal information. This approach is designed to foster both connection and privacy within the Airbnb community. This system is in place to allow users to confidently connect and share with others.
CdXz5zHNQW_og2U0LeKfU.png

Building a high-volume metrics pipeline with OpenTelemetry and vmagent

The migration involved moving a large metrics pipeline from StatsD to OpenTelemetry and Prometheus, aiming for frontloaded metric collection. The initial focus was on instrumenting with the OpenTelemetry Protocol (OTLP) for internal services and Prometheus for OSS workloads, with StatsD as a fallback. A dual-write approach, leveraging a shared metrics library, facilitated the transition to OTLP. This resulted in improved performance and access to Prometheus features like histograms. However, high-cardinality metrics caused performance regressions, addressed by using delta temporality for selected services. A Prometheus-based aggregation pipeline using vmagent was implemented to reduce costs and enable transformations. An issue arose where Prometheus's rate() function undercounted certain counters due to reset events. They implemented zero injection during aggregation to fix the discrepancies, which was done transparently to the user. This provided a solution to the counter problem and ensured an accurate representation of the data. Finally, the team achieved a future-ready metrics infrastructure by implementing these changes.
CdXz5zHNQW_RLxATemjkf.jpeg

My Journey to Airbnb — Jonathan Woodard

Jonathan Woodard, a former professional football player, transitioned his career to software engineering after developing a passion for coding. He joined Airbnb's Connect Engineering Apprenticeship program, finding a new environment to learn and grow. The apprenticeship program provided structured learning, team collaboration, and mentorship, crucial for his transition. During the program, he explored backend engineering, an area he hadn't initially considered. He excelled, eventually securing a full-time position on Airbnb's secure development engineering team. His work focuses on identity and access management and vulnerability management, facing new challenges with LLM integration. Woodard sees parallels between football and security engineering, especially in high-pressure situations and teamwork. He credits his team and the collaborative culture at Airbnb for his success. He shares his experience to encourage non-traditional tech career paths. He's now able to support new apprentices as an alumni host. He finds the work exciting, drawing on his experience to make quick decisions.
CdXz5zHNQW_IrlwC4WsW0.png

What COVID did to our forecasting models (and what we built to handle the next shock)

Airbnb's forecasting models, crucial for financial planning, faced severe disruption during the COVID-19 pandemic. The core challenge was that the relationship between bookings and travel dates, historically stable, became highly volatile. To address this, Airbnb separated its forecasting into two components: gross booking volume and lead-time composition. They developed B-DARMA models specifically designed to handle the changing proportions associated with booking lead times. Surprisingly, even after gross bookings recovered, lead-time compositions exhibited persistent shifts, not returning to pre-pandemic patterns. Airbnb used a distributional divergence metric to monitor and quantify these long-term changes, which became a vital tool for model health diagnosis. These durable shifts directly affected revenue forecasting, cash flow, and operational decisions, highlighting the importance of accurate distributional modeling. To improve forecasting they developed the capacity to learn from structural breaks, allowing the models to adapt to changes. By separating and analyzing these components, Airbnb created more resilient forecasting models, able to detect and adapt to structural shifts.
CdXz5zHNQW_gEWW6pSp6G.jpeg

From vendors to vanguard: Airbnb’s hard-won lessons in observability ownership

Airbnb migrated its observability platform from a vendor-managed service to an in-house solution based on open-source technology. The primary motivation was to reduce costs and gain greater control over data and workflows. The initial migration strategy involved tackling a complex service first, but this proved inefficient. A revised strategy focused on migrating a simpler service to prove feasibility and establish a solid foundation. The new approach prioritized complete migration over immediate query improvements to avoid overwhelming users with too many changes. They built translator tooling to map existing dashboards and alerts to the new system. They focused on migrating query intent instead of direct translations, correcting flawed patterns. Implementing a metadata engine ensured accurate metric typing in the new system. They adopted PromQL, a widely understood query language, and integrated AI-powered tooling to aid in query generation. The migration also enabled the replacement of an outdated alert framework. This comprehensive overhaul led to superior tooling, consistent data, and a significantly improved developer experience.
CdXz5zHNQW_LyZMbgTKXx.jpeg

Recommending Travel Destinations to Help Users Explore

Airbnb's team developed a destination recommendation model to assist users in the early stages of trip planning. This model addresses the challenge of users who haven't yet decided on a destination or travel dates. The model predicts users' destination intent by analyzing their actions on the Airbnb platform, like searches and bookings. A key innovation is integrating diverse signals, balancing active and dormant user behaviors, and incorporating location knowledge. The model uses a transformer architecture, treating user actions as tokens to understand user preferences. Training data is specifically designed to accommodate both active users near booking and dormant users in the initial planning phase. Multi-task learning is used to predict both region-level and city-level destinations, improving location understanding. The model powers autosuggest and abandoned search email notifications, helping users discover potential destinations. In autosuggest, it offers city recommendations, leading to booking gains, particularly in non-English speaking regions. Abandoned search emails feature listings in recommended areas, encouraging booking completion. By focusing on the exploration stage, the model aims to spark inspiration, reduce decision friction, and enhance user engagement. The framework provides a foundation for personalization across the entire trip planning experience, including travel times and price preferences.
CdXz5zHNQW_p9tvPsoHId.jpeg

It Wasn’t a Culture Problem: Upleveling Alert Development at Airbnb

Observability as Code (OaC) at Airbnb defines alerts, dashboards, and SLOs through code, mirroring software development practices. While this process ensures disciplined alert definitions, validating alert behavior in production was a significant challenge. This gap led to either excessive alert noise or missed incidents, hindering developer workflow. To address this, Airbnb built accessible feedback loops for previewing and validating alert behavior before code submission. This innovation dramatically reduced development cycles from weeks to minutes. The company's OaC goal is for product teams to receive out-of-the-box monitoring from platform teams, achieving "zero touch" adoption. However, managing 300,000 alerts made iterating and validating OaC changes costly and risky. Traditional code reviews and unit tests could not predict how alerts would behave with real-world data, leading to a weeks-long validation process. Airbnb rebuilt its OaC platform focusing on local-first development, Change Reports, and bulk backtesting against historical data. This allowed engineers to validate alert templates quickly and efficiently. Key learnings included prioritizing compatibility over novelty, implementing robust guardrails, and owning the full development surface area for improved developer experience. The impact has been a successful migration of 300,000 alerts to Prometheus, collapsed development cycles, and a significant reduction in alert noise, fostering a culture of alert hygiene.
CdXz5zHNQW_p1JPMKjgtj.jpeg

Academic Publications & Airbnb Tech: 2025 Year in Review

Airbnb significantly advanced its AI and machine learning research in 2025, focusing on travel and living platform improvements. They actively participated in key machine learning conferences like KDD and CIKM, while expanding their presence at NLP and optimization-focused events. Researchers published papers, presented findings, and fostered collaborations, contributing to the academic community. KDD focused on search and ranking, with papers on interleaving, counterfactual evaluation, and audience expansion. CIKM showcased advancements in recommendations, maps optimization, and ranking methods for listings. EMNLP saw Airbnb present research on LLM-based customer support, including agent-in-the-loop and summarization techniques. At COLING, Airbnb debuted with a paper on knowledge representation for customer support. Airbnb also presented at MIT CODE, exploring long-term ranking dynamics and data-driven decision-making. Their results demonstrated progress in search, recommendations, and customer experience through innovative AI applications. Airbnb's research initiatives involved mentoring emerging researchers and fostering open-source technology exploration.
CdXz5zHNQW_o6nHxfcfpR.jpeg

Safeguarding Dynamic Configuration Changes at Scale

Airbnb's Sitar platform manages dynamic configuration changes to avoid service restarts and enable swift responses. This platform is built upon four core elements: a developer-facing layer, a control plane, a data plane, and client components. The developer layer facilitates config creation and review, using a Git-based workflow for version control and collaboration. The control plane orchestrates change rollouts, ensuring schema validation, authorization, and staged releases. The data plane provides scalable storage and reliable distribution of configuration updates to service agents. Agents in each service fetch configs and cache them locally for fast access and resilience. Key design choices include Git-based workflows, staged rollouts, and separation of control and data planes. This architecture allows for safer, more predictable rollouts, giving teams more flexibility in managing configs. Incident mitigation is faster, with integrated observability tools for locating and resolving issues. The platform streamlines developer experience and continuously improves safety, observability, and usability. Airbnb is continuously enhancing Sitar, focusing on rollout strategies, testing, and incident response. This work is crucial for maintaining a robust, adaptable, and developer-friendly infrastructure.
CdXz5zHNQW_A3Qsjaew7k.png

My Journey to Airbnb — Anna Sulkina

Anna Sulkina, an Airbnb Senior Director of Engineering, shares her journey in the tech industry. She was inspired by her brother and the early days of computing in Ukraine, leading her to study programming. After immigrating to America, she faced language barriers, which were initially more challenging than programming itself. Her career progressed from hardware diagnostics to front-end and eventually to infrastructure, with increasing leadership responsibilities. She worked at Twitter during significant events, learning the importance of designing for inevitable failure in complex systems. Anna also spearheaded the adoption of GraphQL at Twitter, leading to increased productivity. She joined Airbnb because she loved the product and saw an opportunity to improve developer experience. At Airbnb, she focuses on building reliable platforms for engineers, fostering a strong culture. She highlights the company's balance of established systems and innovative problem-solving, along with the ability to work remotely. Finally, Anna draws parallels between her career, long-distance trail running, strategy and teamwork.
CdXz5zHNQW_IkUUz4P0hR.jpeg

My Journey to Airbnb: Peter Coles

Peter Coles, Airbnb's Head Economist for Policy and Director of Data Science, journeyed from a public school education in Milwaukee to a PhD in economics from Stanford. His early fascination with marketplaces began with a childhood rock stand venture. Competing avidly in academic contests, he later majored in math at Princeton before pursuing economics. His graduate studies at Stanford were influenced by Nobel laureate Jon Levin, emphasizing simplification in research. Before Airbnb, Coles taught at Harvard Business School, co-teaching with market design pioneer Al Roth. There, he researched matching mechanisms and wrote case studies on companies like Zillow and Craigslist. Transitioning to the tech industry, he joined eBay, focusing on market design and data science initiatives. At Airbnb, he established an economics team to analyze short-term rentals and their impact on cities. He also co-founded Central Strategy & Insights (CSI) to address broader organizational questions and conduct forensic-style investigations. Currently, he leads efforts to model policy considerations and evaluate Airbnb's societal impact, fostering collaborations with academic researchers.
CdXz5zHNQW_AKkOKvPMHS.jpeg

Pay As a Local

Airbnb successfully launched over 20 locally relevant payment methods worldwide in just 14 months to enhance accessibility and reduce friction for guests. These Local Payment Methods (LPMs) include digital wallets, bank transfers, and local payment schemes, moving beyond traditional card payments. Offering LPMs boosts conversion, unlocks new markets with low credit card usage, and provides access for unbanked individuals. Through extensive research, Airbnb identified over 300 payment options and used a structured framework to select the top performers for integration. Airbnb's modernized payments platform, built on a domain-driven architecture, decoupled payment logic for flexibility and scalability. This replatforming effort, known as Payments LTA, shifted from a monolithic system to a services-oriented one, speeding up time to market. The processing subdomain, crucial for integrating third-party providers, adopted a connector and plugin-based architecture. This strategy, along with the introduction of Multi-Step Transactions (MST), significantly reduced integration time for new PSPs and standardized complex payment flows. The integration of LPMs presented challenges due to diverse APIs and the need for external app interactions. Airbnb analyzed end-to-end behaviors and standardized them into three foundational flows: Redirect, Async, and Direct. This unified framework enabled significant code reusability and reduced engineering effort for new payment method integrations. Asynchronous payment orchestration was redesigned to manage external user actions and webhook notifications for successful payments. A config-driven approach, utilizing a central YAML configuration, streamlined payment method integration by consolidating logic and enabling automated code generation. This made integration largely declarative, reducing launch times from months to weeks. The payment widget dynamically renders UI and validation rules based on backend configurations, ensuring a tailored checkout experience. Enhanced testability through an in-house PSP Emulator allowed developers to thoroughly test payment scenarios without relying on unstable external sandboxes.
CdXz5zHNQW_E1JU738XpJ.jpeg

GraphQL Data Mocking at Scale with LLMs and @generateMock

Generating realistic mock GraphQL data has been a persistent industry challenge. Airbnb developed a solution using a new GraphQL client directive, @generateMock, to automate this process. This directive allows engineers to add context and design URLs to their GraphQL operations. The Niobe command-line tool integrates this directive into the code generation workflow. Niobe collects query details, schema context, and design information to create prompts for Large Language Models. The Gemini 2.5 Pro model is used due to its large context window and efficient performance. After LLM generation, the mock data undergoes validation against the GraphQL schema to ensure type safety. If validation fails, the errors are fed back to the LLM for correction, creating a self-healing mechanism. This approach eliminates manual mock creation, saving engineers time and effort. The generated mocks are highly realistic, matching design mockups and improving prototyping and testing capabilities.
CdXz5zHNQW_3KccFBJWsg.jpeg

From Static Rate Limiting to Adaptive Traffic Management in Airbnb’s Key-Value Store

Airbnb's key-value store, Mussel, originally used simple QPS rate-limiting to prevent single clients from overwhelming the system. As traffic grew and became more complex, this approach proved insufficient due to cost variance and traffic skew. To address this, Mussel evolved to implement a multi-layered quality of service (QoS) system. The first layer, Resource-Aware Rate Control (RARC), charges requests in Request Units (RU) that account for rows, bytes, and latency, reflecting the actual backend cost. This system uses token buckets with static RU quotas for each caller. The second layer, load shedding, provides real-time protection when capacity is strained or hotspots develop. It combines traffic criticality, a latency ratio indicating system stress, and a CoDel-inspired queueing policy. This allows high-priority traffic to remain responsive and gracefully backs off other traffic when latency increases. The third layer, hot-key detection and DDoS defense, identifies and mitigates surges of identical requests targeting specific data. It uses an in-memory top-k counter for real-time detection, local caching on dispatcher pods, and request coalescing to send only one request to the storage layer for duplicate hot-key lookups. These layered controls have significantly improved Mussel's ability to handle traffic spikes and maintain reliability. Key takeaways include the value of early impact for validating concepts, preferring local control loops for scalability, and employing mechanisms that operate on different time scales. This sophisticated QoS stack ensures Mussel remains fast and dependable, even under extreme and volatile traffic conditions.
CdXz5zHNQW_D4i1N1yN56.jpeg

Building a Next-Generation Key-Value Store at Airbnb

Airbnb rearchitected its core key-value store, Mussel, from V1 to V2 to meet evolving real-time data demands. Mussel V2 addresses V1's operational complexity, capacity limitations, and consistency issues with a cloud-native NewSQL backend. Key improvements include automated Kubernetes deployment, dynamic range sharding for better capacity management, and flexible consistency options. The new architecture features a stateless Dispatcher service for translating API calls and an event-driven write path using Kafka for durability. Bulk loading capabilities are preserved, enhanced with stateless controllers and stateful worker fleets for high throughput. Automated Time-To-Live (TTL) expiration is now topology-aware and more efficient at scale. The migration from V1 to V2 employed a complex blue/green strategy with dual writes and shadow reads to ensure zero downtime and data loss. Critical lessons learned include managing consistency complexities and the importance of presplitting for range-based partitioning. Kafka played a crucial role in maintaining eventual consistency during the migration process. Mussel V2 successfully merges bulk ingestion, high-speed writes, and low-latency reads in a single platform, simplifying data infrastructure for engineering teams.
CdXz5zHNQW_xJf1iLzWz1.jpeg

Viaduct, Five Years On: Modernizing the Data-Oriented Service Mesh

Viaduct, Airbnb's data-oriented service mesh, is now open-source. Over five years, Viaduct usage at Airbnb has grown significantly, with traffic increasing eightfold and the number of teams doubling. It continues to be guided by three core principles: a central schema, hosted business logic, and re-entrancy. The central schema integrates data across the company, making Viaduct a go-to data mesh. Hosting business logic directly in Viaduct, enabled by a serverless platform, simplifies operations for developers. Re-entrancy allows hosted logic to compose with other logic via GraphQL fragments and queries, maintaining modularity. A recent overhaul, "Viaduct Modern," addresses past complexities by simplifying the developer-facing Tenant API to two resolver types: node and field resolvers. This modernization also introduces tenant modularity, formalizing "tenant modules" as units of schema and code owned by a single team, composed via GraphQL. Framework modularity has also been improved, creating stronger abstraction boundaries between the GraphQL execution engine, the tenant API, and hosted code. The new engine API is dynamically typed, while the tenant API is statically typed, allowing independent evolution. Viaduct offers gradual migration by running both the Classic and Modern Tenant APIs on the new engine simultaneously. Other improvements include enhanced observability, faster build times through schema-first development and direct-to-bytecode generation, and a dispatcher for Kubernetes scaling and blast radius mitigation. Airbnb is open-sourcing Viaduct to benefit from community contributions and believes it can be valuable for both large-scale and nascent GraphQL projects. The Modern API is currently in alpha, but the new engine is in full production.
CdXz5zHNQW_Ry0cTKX3TG.jpeg

Taming Service-Oriented Architecture Using A Data-Oriented Service Mesh

Airbnb developed Viaduct, a data-oriented service mesh, to improve modularity in its microservices-based architecture. Traditional service meshes are procedure-oriented, while Viaduct focuses on data, using a GraphQL schema as its central organizing principle. This schema defines types, queries, and mutations, abstracting service dependencies from consumers. Viaduct allows data consumers to access information from multiple microservices without direct dependencies, simplifying the architecture. The central schema enables easier API and database schema changes, enhancing data agility and reducing coordination efforts. Viaduct integrates serverless functions for derived data, minimizing the number and complexity of microservices. The system, built with graphql-java, offers features like fine-grained field selection, data observability, and intra-request caching. It leverages the GraphQL ecosystem's tooling and powers a significant portion of Airbnb's API traffic. Viaduct addresses challenges related to spaghetti-like dependency graphs prevalent in large Service-Oriented Architectures. The shift towards a data-oriented approach aims to streamline data access and improve overall system maintainability. The architecture started with a clean schema and has grown to include numerous core entities. The solution, already deployed in production, illustrates Airbnb's commitment to evolving its SOA.
CdXz5zHNQW_YzRZpul0zT.jpeg

Migrating Airbnb’s JVM Monorepo to Bazel

Airbnb migrated its massive JVM monorepo from Gradle to Bazel, a process spanning 4.5 years. The primary reasons for the switch were Bazel's superior speed, reliability, and unified build infrastructure. Bazel's caching and remote execution significantly accelerated build and test times, improving developer productivity. The migration addressed Gradle's unreliability stemming from non-hermetic builds and resource contention. A proof-of-concept with Airbnb's Viaduct service demonstrated Bazel's effectiveness, leading to broader adoption. A crucial component was an automated build file generator to minimize developer effort and maintain co-existing build systems. This generator efficiently parsed source files, managed dependency cycles, and supported fine-grained build graphs. The migration also involved addressing challenges like multi-version third-party library support and ensuring deployment compatibility. Finally, rigorous testing, particularly startup and integration tests for services, validated the correctness of Bazel-built deployments. The overall result is a dramatically improved build system with increased developer satisfaction.
CdXz5zHNQW_RVap6w2FjE.jpeg

Seamless Istio Upgrades at Scale

Airbnb has successfully upgraded its Istio service mesh 14 times, managing tens of thousands of pods across dozens of Kubernetes clusters and thousands of VMs. Their upgrade process prioritizes zero downtime and gradual rollouts, allowing independent upgrades without user intervention. The architecture involves a management cluster for Istiod and multiple workload clusters. Upgrades follow a canary model, running current and new Istio versions concurrently. This is achieved by coordinating control plane (Istiod) and data plane (istio-proxy) updates. Crucially, older istio-proxy versions are not used with newer Istiod; they are updated atomically. A central management file, rollouts.yml, dictates the desired Istio version distribution across namespaces. For Kubernetes, an in-house tool called Krispr injects Istio revision labels into deployments during CI and pod admission. This mechanism ensures workloads are upgraded even if they don't deploy frequently. For virtual machines, upgrades are managed by an on-host daemon, mxagent, which installs artifacts based on VM tags. A central controller, mxrc, updates these tags to align with rollouts.yml. Mxrc also monitors VM health, ensuring a controlled upgrade process. This approach effectively decouples infrastructure upgrades from application deployments. Airbnb’s continuous investment in maintainability and safety has enabled these complex, large-scale Istio upgrades.
CdXz5zHNQW_deZowKwyHD.jpeg

Achieving High Availability with distributed database on Kubernetes at Airbnb

Organizations traditionally used costly standalone servers with sharding for database scaling, but this approach proved problematic for maintenance as data demands grew. Running horizontally scalable, open-source databases reliably in the cloud with high availability, low latency, and scalability at a reasonable cost is a significant challenge. Airbnb adopted an innovative strategy of deploying a distributed database cluster across multiple Kubernetes clusters for improved reliability and operability. Managing stateful services like databases on Kubernetes is difficult, especially concerning node replacement and upgrades, as Kubernetes lacks data distribution awareness. To mitigate this, Airbnb attached storage volumes to nodes using AWS EBS, enabling automatic reattachment to new virtual machines via Kubernetes Persistent Volume Claims. Custom Kubernetes operators were developed to manage node replacement events, categorizing them into database-initiated, proactive infrastructure, and unplanned failures. For database-initiated and proactive events, operators ensure all nodes are running before replacement and intercept pod evictions to coordinate safe deletions. Unplanned failures cannot be coordinated, but ongoing maintenance is protected by blocking replacements until failed hardware is fixed. To ensure high regional availability, Airbnb deploys each database across three independent Kubernetes clusters in different AWS availability zones, limiting the blast radius of issues. Overprovisioning database clusters guarantees sufficient capacity even if an entire AZ, Kubernetes cluster, or all storage nodes in a zone go down. AWS EBS provides rapid reattachment for node replacements and superior durability, allowing a highly available cluster with only three replicas. Tail latency spikes in EBS are mitigated by implementing storage read timeouts and allowing reads from replicas to reduce latency and avoid cross-AZ costs, with stale reads further optimizing read performance. This multi-cluster Kubernetes strategy, leveraging AWS EBS and custom operators, allows open-source distributed storage systems to achieve high availability, low latency, and scalability in cloud environments, enabling robust data management.
CdXz5zHNQW_5ISBolXzut.jpeg

Understanding and Improving SwiftUI Performance

Airbnb adopted SwiftUI in 2022, which improved engineer productivity but introduced new performance challenges. To address these issues, Airbnb created new tooling to identify and validate performance-critical code patterns. The company's feature architecture uses declarative UI patterns and unidirectional data flow systems, which simplified the adoption of SwiftUI. However, SwiftUI features using this architecture didn't perform as well as expected, and understanding SwiftUI's performance characteristics is crucial for building performant features. SwiftUI's view diffing algorithm, which determines when a view's body needs to be re-evaluated, is often overlooked and not officially documented. The algorithm compares each of the view's stored properties, but common code patterns can confound it, leading to unnecessary view body evaluations. To address this, Airbnb created a new @Equatable macro that generates Equatable conformances for views, allowing engineers to selectively decide which properties should be compared when diffing. This approach ensures that views are diffable and prevents regressions from creeping in later. Additionally, Airbnb implemented a custom SwiftLint rule to help engineers identify when a view body is too complex and needs to be refactored into smaller, diffable pieces. By breaking down views into smaller pieces, SwiftUI can efficiently update only the parts of the view that actually changed, maintaining performance as features grow more complex.
CdXz5zHNQW_IYTQREK32w.jpeg

Load Testing with Impulse at Airbnb

Airbnb's system-level load testing is crucial for reliability and efficiency, identifying bottlenecks, evaluating capacity, establishing performance baselines, and detecting errors. Impulse is an internal load-testing-as-a-service framework that provides tools to generate synthetic loads, mock dependencies, and collect traffic data from production environments. Impulse includes four main components: a load generator, a traffic collector, a dependency mocker, and a testing API generator. The load generator allows service owners to conduct context-aware load tests, generating requests on the fly and mocking dependencies. The traffic collector captures both upstream and downstream traffic, allowing Impulse to accurately replay production traffic during load testing. The dependency mocker mocks downstream responses with latency, eliminating interference between services and reducing communication costs. The testing API generator creates HTTP APIs during the CI stage, enabling load testing tools to send traffic to these synthetic APIs, allowing asynchronous flows to be exercised as if they were synchronous. Impulse is designed to minimize manual effort, seamlessly integrate with Airbnb's observability stack, and empower teams to proactively address potential issues. The framework has received positive feedback, helping teams identify and address potential issues in their services. Impulse is currently being implemented in several customer support backend services and is under review with teams across the company.
CdXz5zHNQW_MMZuwGyO3E.jpeg

Listening, Learning, and Helping at Scale: How Machine Learning Transforms Airbnb’s Voice Support…

Airbnb has improved its Interactive Voice Response (IVR) system using machine learning to understand users and assist agents more effectively. The IVR system listens, understands, and responds in real-time, allowing callers to describe their problems naturally and receive instant support. The system uses automated speech recognition, contact reason detection, and language models to understand users and assist agents. The automated speech recognition system has been improved to reduce errors, and a contact reason detection model has been developed to identify the intent behind a caller's query. A help article retrieval system provides relevant information to users, and a paraphrasing model ensures users understand the solution before being routed to an agent. The system has improved user satisfaction, reduced the need for human intervention, and provided a more intuitive voice support experience.
CdXz5zHNQW_Xyuw3fwzOq.jpeg

How Airbnb Measures Listing Lifetime Value

Airbnb uses a framework to determine the value of listings for guests, calculating listing lifetime value (LTV). This framework includes baseline, incremental, and marketing-induced incremental LTV. Baseline LTV estimates total bookings over 365 days, using machine learning and listing data. Incremental LTV addresses the challenge of cannibalization, estimating value added by each listing. Marketing-induced LTV measures the impact of Airbnb's initiatives on listing value. Accurate measurement of baseline LTV is crucial, requiring model training and evaluation over time. Accounting for incrementality involves estimating a production function to link supply and demand to bookings. Handling uncertainty, particularly during the pandemic, involved updating LTV estimates with realized bookings. This LTV framework helps Airbnb identify valuable listings, inform supply strategies, and evaluate internal initiatives. The approach can also be extended to other areas such as Airbnb Experiences. The framework's ongoing development aims to improve the experience for both hosts and guests.
CdXz5zHNQW_bq2A63Ci4v.jpeg

Embedding-Based Retrieval for Airbnb Search

Airbnb's search function plays a crucial role in helping guests find the perfect stay, but it's a challenging task due to the large number of available homes and complex search queries. To tackle this, Airbnb built an Embedding-Based Retrieval (EBR) search system to narrow down the initial set of eligible homes into a smaller pool. The EBR system consists of three key components: constructing training data, designing the model architecture, and developing an online serving strategy. The training data pipeline leverages contrastive learning to map homes and search queries into numerical vectors. The model architecture follows a traditional two-tower network design, with one tower processing features about the home listing and the other processing features related to the search query. The listing tower is computed offline daily, reducing online latency. For online serving, Airbnb explored approximate nearest neighbor (ANN) solutions and chose an inverted file index (IVF) due to its better trade-off between speed and performance. The IVF solution clusters listings beforehand and retrieves homes from the top clusters by treating cluster assignments as a standard search filter. The EBR system led to a statistically significant gain in overall bookings when A/B tested, effectively incorporating query context and ranking homes more accurately during retrieval. The system has been fully launched in both Search and Email Marketing production.
CdXz5zHNQW_ZoLewXZBkI.jpeg

Accelerating Large-Scale Test Migration with LLMs

Airbnb used an LLM-driven approach to migrate nearly 3.5K React component test files from Enzyme to React Testing Library (RTL). They initially estimated this would take 1.5 years but completed it in just six weeks. The migration involved refactoring test files while preserving the intent of the original tests and code coverage. They built an automated pipeline with validation, refactor steps, and retry loops. The LLM was given increasing context, including component code and related tests, to improve success rates. They used a "sample, tune, and sweep" approach to improve prompts and scripts, increasing completion from 75% to 97%. Automation efficiently handled the vast majority of the migration, leaving a small percentage for manual fixes. The project maintained test intent and code coverage and proved to be much faster and cheaper than a manual migration. This experience highlights the power of LLMs for large-scale code transformation. Airbnb plans to expand their use of LLMs for developer productivity. They are also hiring engineers who love solving complex problems at scale.
CdXz5zHNQW_HzB4Yizqpc.jpeg

Improving Search Ranking for Maps

Airbnb has been adapting its ranking algorithm for its map interface to better connect guests with hosts. Initially, the ranking algorithm was based on booking probabilities, but this approach doesn't work for maps where user attention is spread equally across pins. To improve user experience, Airbnb tested different models of user attention, including restricting the number of map pins and creating two tiers of pins based on booking probabilities. These changes led to significant improvements in bookings and user satisfaction. The company also developed an algorithm to re-center the map to showcase listings with the highest booking probabilities, which further improved bookings and reduced map moves. Despite these advancements, there is still a challenge in representing the full range of available listings on the map, which is a focus for future work.

Airbnb at KDD 2024

Airbnb had a significant presence at the 2024 KDD conference in Barcelona, Spain, with three full ADS track papers, one workshop, and seven workshop papers and invited talks accepted into the main conference proceedings. The topics of the work spanned Deep Learning & Search Ranking, Online Experimentation & Measurement, and Two-sided Marketplaces. The company's contributions to the workshop on Two-sided Marketplace Optimization discussed the evolution of content ranking, recommendation systems, and data mining in solving for producers and consumers on these platforms. Airbnb's work on guest intention modeling for personalization uses a deep learning approach to predict travel plans and produces multiple user intent signals. The company also presented a paper on guest demand understanding, combining economic modeling with causal inference techniques to estimate price sensitivity among guest segments.

My Journey To Airbnb | Vijaya Kaza

Vijaya Kaza is the Chief Security Officer and Head of Engineering for Trust and Safety at Airbnb, leading teams that develop technology to safeguard the community and secure infrastructure. She is also the executive co-sponsor of Airbnb Tech's Diversity Council. Vijaya grew up in a large family in India, where she was expected to excel academically, and she developed a strong affinity for science and math, studying electrical engineering in college. After college, she landed a job at Cisco as a software engineer and accidentally stumbled into the security field, following a manager she liked. She spent 17 years at Cisco, leading product development for a $1 billion security product portfolio, and later worked at FireEye and Lookout, a startup in San Francisco focused on mobile security. Vijaya was approached for the CSO role at Airbnb, which she was initially hesitant about, but was impressed by the company's vision and mission. She joined Airbnb in 2019, drawn to the company's dedication to delivering a positive user experience and its mission-driven approach. Vijaya leads two teams, Trust and Safety and Security, which share the common mission of safeguarding users and the platform, but have different techniques, threats, and focus areas. Outside of work, Vijaya has pursued improv comedy, which has taught her valuable leadership lessons, such as thinking on her feet and responding to new scenarios in the moment. She advises others to maintain focus, keep a steady head, and persist forward undeterred in the face of professional setbacks.

From Data to Insights: Segmenting Airbnb’s Supply

Airbnb uses data-driven segmentation to understand the availability patterns of its hosts. This process involves analyzing the availability rate, streakiness, and seasonality of listings to differentiate between hosts with similar profiles. By applying a K-means clustering algorithm, Airbnb identifies eight distinct clusters of hosts based on their availability patterns. These clusters include "Always On," "Short Seasonal," "Event Motivated," and others, each with unique characteristics and preferences. The company validates these segments through A/B testing, correlates them with known attributes, and conducts UX research to ensure they align with real-world behavior. This segmentation model is then scaled to all listings using a decision tree algorithm and integrated into the data warehouse for use by various teams. This approach helps Airbnb develop targeted strategies, products, and messaging to better support its hosts and improve the overall user experience.

Building a User Signals Platform at Airbnb

Airbnb developed a stream processing platform called User Signals Platform (USP) to enhance user experience through personalization. The platform leverages user engagement data to provide tailored interactions during the booking process. It consists of a data pipeline layer and an online serving layer, with Flink streaming for near real-time processing and batch processing for data correction and backfill. USP supports various user event processing, including User Signals, User Segments, and Session Engagements. It also uses metrics such as event latency, ingestion latency, job latency, and transform latency to measure the performance of streaming jobs. The platform allows developers to define transforms and user segments without worrying about streaming components. To improve Flink job stability, Airbnb uses standby Task Managers to ensure continuous processing when a Task Manager fails.

Airbnb’s AI-powered photo tour using Vision Transformer

Airbnb has developed an AI-powered photo tour feature to enhance the guest experience by providing detailed information about listings. The feature uses vision transformers to classify and organize listing photos into 16 different room types. To improve model accuracy, Airbnb employed pretraining, multi-task learning, ensemble learning, and knowledge distillation. The pretraining process involved training a Vision Transformer model on millions of Airbnb listing photos. Multi-task learning utilized a diverse dataset to improve the model's ability to interpret visuals. Ensemble learning combined the strengths of multiple models for robust predictions, and knowledge distillation enabled efficient deployment without sacrificing accuracy. The AI-powered photo tour was launched as part of Airbnb's 2023 Winter Release, and the company continues to refine the models for a better user experience.

Adopting Bazel for Web at Scale

Airbnb recently adopted Bazel, Google's open source build tool, as their universal build system across backend, web, and iOS platforms. The company's large-scale web monorepo, consisting of over 11 million lines of code, presented challenges with bespoke build scripts and caching logic that were difficult to maintain and scaled poorly. To address these issues, Airbnb migrated to Bazel, which offered sophistication, parallelism, caching, and performance. The migration process began in 2021, but there was no publicized industry precedent for integrating Bazel with web at scale outside of Google. The team had to overcome performance issues when transmitting large files to the remote environment and established migration principles that included improving or maintaining overall performance and reducing the impact on developers contributing to the monorepo during the transition. To prepare the repository for Bazel, the team performed cycle breaking and automated BUILD.bazel file generation. They also migrated CI jobs to Bazel, starting with type checking, linting, and unit testing. The team enabled TypeScript, ESLint, and Jest, and introduced caching to reduce input size and improve performance. To prevent backsliding, the team moved tests from "hidden" to "required" via a rule attribute and ensured a single source of truth by not running tests under the Jest setup being replaced. They also wrote a script to compare before and after Bazel to determine migration-readiness using metrics such as test runtime, code coverage stats, and failure rate. In tandem with the CI migration, the team ensured that developers can run Bazel locally to reproduce and iterate on CI failures. They delivered a local Bazel experience that is on par with or superior to the existing developer experience and performance, allowing developers to continue using familiar tools and opt into Bazel when beneficial.

Transforming Location Retrieval at Airbnb: A Journey from Heuristics to Reinforcement Learning

Airbnb has transformed the way people travel around the globe, but providing guests with relevant options in their search results has become increasingly complex due to the diverse locations and property types in their inventory. To address this challenge, Airbnb shifted from using simple heuristics to advanced machine learning and reinforcement learning techniques to transform their location retrieval process. Initially, Airbnb relied on heuristics to define map areas based on the type of search, but these heuristics had limitations and couldn't differentiate between different types of searches or adapt well to new data. Airbnb then explored statistics to improve location retrieval by building a dataset for each travel destination that recorded where guests booked listings when searching for that destination. However, this statistical approach still had limitations and treated all searches for a location the same, regardless of specific search parameters. This led Airbnb to believe that location retrieval may require more advanced techniques such as machine learning. Airbnb constructed a machine learning model that could learn from various search parameters, such as the number of guests and stay duration, and predict more relevant map areas for each search. The machine learning system increased the recall of booked listings by 7.12% and reduced the size of the retrieval map area by 40.83%, resulting in a cumulative impact of +1.8% in uncancelled bookers on the platform. Airbnb then introduced reinforcement learning to the location retrieval process, allowing the system to continuously learn from guest interactions and adjust the retrieval map area based on guest booking behavior. The reinforcement learning system successfully explored more for less-traveled locations and explored less for locations that are often searched and booked, resulting in a cumulative 0.51% increase in uncanceled bookers and 0.71% increase in 5-star trip rate. Airbnb's journey from simple heuristics to sophisticated machine learning and reinforcement learning models demonstrates the power of data-driven approaches in transforming complex systems. The transformation cumulatively results in a 2.66% increase in uncanceled bookers, a major achievement for a company operating at Airbnb's scale.

Automation Platform v2: Improving Conversational AI at Airbnb

Airbnb's Automation Platform v2 is a conversational AI platform designed to support emerging large language model (LLM) applications. The platform allows developers to build LLM applications that enhance customer support efficiency and response times. It includes several key components such as Chain of Thought workflow, context management, and guardrails framework. The Chain of Thought workflow uses LLMs as reasoning engines to determine which tools to use and in which order. Context management ensures the LLM has access to necessary contextual information, while the guardrails framework monitors communications with the LLM to ensure it is helpful, relevant, and ethical. The platform is evolving to accommodate transformative technologies, expand Chain of Thought tool capabilities, and investigate LLM application simulation.

Sandcastle: data/AI apps for everyone

Airbnb developed a platform called Sandcastle to help data scientists and ML practitioners bring their data- and AI-powered product ideas to life in an interactive way. Sandcastle integrates with Onebrain, a packaging framework for data science and prototyping code, and kube-gen, a code-generation layer on top of Kubernetes. This platform allows developers to create and share interactive web applications within 10-15 minutes of checking in their code. Over 175 live prototypes were developed in the last year, with 6 of them being used for high-impact use cases. These prototypes were visited by over 3.5k unique internal visitors across over 69k distinct active days.

Riverbed Data Hydration — Part 1

Riverbed, part of Airbnb's tech stack, optimizes data consumption from system-of-record stores to update read-optimized stores. It uses a Lambda architecture with streaming and batch components. The streaming aspect focuses on constructing materialized views from change data capture (CDC) events. The Notification Pipeline consumes notification events and queries dependent data sources to build materialized views, which are then written to sink stores. The Join operation uses a DAG-like structure to efficiently join data sources, leveraging JoinConditionsDag for metadata and JoinResultsDag for storing results. The Stitch operation transforms joined results into a usable model, the StitchModel. Riverbed supports multiple sinks, including Apache Hive and Kafka, for flexibility. The streaming system efficiently updates materialized views from CDC events, enabling scalability, efficient data fetching, and enhanced filtering and search capabilities. The Source Pipeline, discussed in the next blog post, plays a crucial role in concurrency and versioning. By leveraging DAG-based data structures, Riverbed optimizes streaming data joins, reducing memory usage and improving efficiency.

Building Postcards for “Airbnb” Scale

Airbnb's Media team developed a postcard generation system for group travel bookings, leveraging a novel destination matching algorithm. The system required localized text layout, design flexibility, and high performance. To address localized text layout, a compromise was made to manually format translations for top booking destinations. A flexible template data model allowed for easy configuration of text positioning and color. To improve performance, an asynchronous postcard creation flow was implemented, minimizing latency and utilizing existing media serving infrastructure. A matching algorithm was developed to match postcards to destinations based on listing-specific artwork, popular destinations, taxonomy tags, and a default fallback. Pre-generation of postcards for top destinations minimized the use of generic postcards. The solution highlights the need for internal tooling, image and text processing capabilities, and destination matching logic. Postcards have been well-received and showcase the power of media capabilities in enhancing Airbnb's group travel experience.

Personal Data Classification

Airbnb's data classification system identifies and protects personal data, ensuring trust and compliance. The system comprises three pillars: cataloging to locate data, detection to identify personal data, and reconciliation to verify classifications. Automated detection uses metadata, content, and machine learning to classify personal data. Human input confirms classifications to minimize false positives and facilitate resolution. Quality metrics assess recall, precision, and speed to ensure effectiveness. Challenges include post-processing classification, inconsistent classifications, and process costs. Airbnb advocates "shifting left" by integrating data classification into data schemas at creation to address these challenges. This approach empowers data owners to manage and annotate their data, leveraging lineage information for automated annotation and reducing manual effort. Airbnb's data classification system provides a comprehensive framework for organizations facing similar challenges, promoting data protection and compliance.

Apache Flink® on Kubernetes

Airbnb's streaming processing architecture evolved from Hadoop Yarn to Kubernetes, with Flink replacing Spark Streaming as the primary platform. The transition involved eliminating Airflow as the job scheduler and introducing a lightweight streaming job scheduler. Moving to Kubernetes offered improved developer experience, enhanced monitoring, and streamlined service discovery. The current architecture involves five primary components: job configurations, image management, CI/CD, Flink portal, and Flink job runtime. Benefits of the Kubernetes-based architecture include faster developer velocity, improved job availability and latency, and cost savings. Future work focuses on enhancing job availability, enabling job autoscaling, and exploring the Flink Kubernetes Operator.

How Airbnb Smoothly Upgrades React

Airbnb's frontend has upgraded to React 18 using the React Upgrade System, which allows for incremental and testable upgrades without requiring a long-running feature branch. Module aliasing and environment targeting split React versions into separate build artifacts and runtime environments. TypeScript discrepancies were handled using shims, type augmentation, and progressive TypeScript error fixes. Comprehensive testing included visual regression, integration, and unit testing, with unit tests run under both React 16 and 18. Progressive rollout controlled traffic to both React 16 and 18 environments, allowing for internal testing and gradual surface upgrades. The system allowed for the testing of React 19 canary releases without pointing to React 18. Performance improvements were observed after adopting React 18 features like new root APIs and concurrent rendering. The system promotes continuous upgrade efforts, avoiding large, one-off changes. The React team's focus on backwards compatibility facilitated this upgrade approach. Airbnb's frontend is now running on React 19 beta, providing a head start for future React upgrades.

Rethinking Text Resizing on Web

Airbnb's efforts to improve web accessibility focused on ensuring that text remained legible when enlarged by 200% (Resize Text). Browser zoom, while effective on desktop, proved challenging on mobile devices due to the limited viewport. Airbnb opted for font scaling, which adjusts text size independently of overall page zoom. To support font scaling, Airbnb chose rem units, which scale proportionally with the font size based on the root element. This approach provided consistent text scaling without affecting other layout elements. To simplify the transition to rem units, Airbnb automated the conversion process and provided tools for designers to simulate font scaling during the design phase. Managing the conversion across two CSS-in-JS systems, React-with-Styles and Linaria, presented additional challenges. Airbnb leveraged Linaria's custom property support and post-CSS plugins to convert most font-related properties. An escape hatch was provided to allow developers to use px units when necessary. The implementation of these accessibility improvements significantly reduced the number of reported issues for Resize Text and enhanced the user experience for individuals with vision impairments. By prioritizing user needs and adopting best practices, Airbnb demonstrated its commitment to inclusive web design.