Migrating to Apache Iceberg

Adobe Experience Platform is an open system for driving real-time personalized experiences. Customers use it to centralize and standardize their data across the enterprise resulting in a 360-degree view of their data of interest. That view can then be used with intelligent services to drive experiences across multiple devices, run targeted campaigns, classify profiles and other entities into segments, and leverage advanced analytics. At the center of our Data Lake Architecture is the underlying storage. Data Lake relies on a Hadoop Distributed File System (HDFS) compatible backend for data storage, which today is the cloud-based storage provided by Azure (Azure’s Gen2 Data Lake Service (ADLS)).

Figure 1: Adobe Experience Platform Architecture

The Problem

Adobe Experience Platform Catalog Service provides a way of listing, searching, and provisioning a DataSet, which is our equivalent of a Table in a Relational Database. It is helpful in providing information such as name, description, schema, and applying for permissions, and all metadata recorded on Adobe Experience Platform. As more data is ingested over time, it becomes difficult to query metadata from Catalog. With the introduction of Iceberg, we see a transitional shift in how metadata is captured and recorded. Iceberg records Adobe Experience Platform single-tenant storage architecture exposes us to some interesting challenges when migrating customers to Iceberg. For each tenant, we can have one of three scenarios:

Figure 2: Adobe Experience Platform with Apache Iceberg

Here’s a snapshot of differently sized datasets across all the clients we’ve migrated.

Figure 3: Size and Count of production datasets migrated to Apache Iceberg

Motivations for Migration

Iceberg migration is the process of moving data from one table format to Iceberg or introducing Iceberg as a table format to the data. While this might seem pretty straightforward, it involves a change in storage and table format. Apart from the benefits that Apache Iceberg provides out-of-the-box, there were several factors that motivated us to migrate customers to Iceberg:

Customer Categorizations

When clients interact with Data Lake, they read and write data to DataSets based on its unique GUID and/or name. The metadata for datasets is stored in Catalog and is used to configure partitioning behavior, schema information, policy enforcement, and other behaviors within Adobe Experience Platform. We needed to devise a plan that not only catered to each customer’s downtime and availability constraints but also considered their need for maintaining metadata in Catalog or data on ADLS. Each customer had a different degree of comfort dropping data and/ or metadata. Based on our evaluations, we bucketed customers into four categories:

Figure 4: Type of Migrations

Depending on total account activity and volume (size), we prioritized each bucket and devised a migration plan for each customer. For the purposes of this blog, we will focus on how we migrated customers (data and metadata) to Iceberg.

Migration Best Practices

Regardless of the type of data, we are migrating, the goal of a migration strategy is to enhance performance and competitiveness. Less successful migrations can result in inaccurate data that contains redundancies or is corrupted in the process. This can happen even when source data is fully usable and adherent to data policies. Furthermore, any issues that did exist in the source data can be amplified when it’s ported over to Iceberg.

A complete data migration strategy prevents a subpar experience that ends up creating more problems than it solves. Aside from missing deadlines and exceeding budgets, incomplete plans can cause migration projects to fail altogether. While planning and strategizing the work, we set foundational criteria to govern the overall migration framework. A strategic data migration plan should include consideration of these critical factors:

Migration Strategies

There is more than one way to build a migration strategy and most strategies fall into one of two categories: “big bang” or “trickle”. In a big bang migration, the full dataset transfer is completed within a limited window of time and data can be unavailable for the duration of the migration. On the contrary, trickle migrations complete the migration process in phases. During implementation, the old system and the new are run in parallel, which eliminates downtime or operational interruptions. In the case of Iceberg migration, we had to be more creative about the migration model we choose because:

Based on our specific business needs and requirements, we explored two strategies to migrate customer datasets to Iceberg. Both strategies are catered towards our specific use case; choosing selectively from the big bang and trickle migration approaches.

In-Place Upgrade

An in-place migration strategy is where datasets are upgraded to Iceberg table format without the need for data restatement or reprocessing. This implies data files are not modified during migration and all Iceberg metadata (manifests, manifest lists, and snapshots) are generated outside of the purview of the data. We are essentially recreating all metadata in an isolated environment and co-locating them with the data files. This utility requires that the dataset we are migrating must be registered with Spark’s Session Catalog.

Figure 5: Adobe Experience Platform Architecture with in-place upgrade

Pros

Cons

Shadow Migration

In the case of shadow migration, we’ve followed a hydration model where we would create a new dataset that shadows the source dataset in terms of batches. Once the shadow has caught up, we would flip a switch that swaps the shadowed dataset with the source. Here’s our migration architecture showcasing the critical pieces of the overall workflow.

Figure 6: Adobe Experience Platform Architecture with shadow migration

Migration Service

The Migration Service (MS) is a stateless, scalable, and tenant agnostic migration engine. MS is designed for migrating data on the Data Lake that has been collected by touch-points enabled through Adobe’s Digital Marketing Solutions. MS maintains a pool of migrant workers and each worker is responsible for migrating an Experience Data Model (XDM) compliant dataset to Iceberg. Each worker maintains an army of dedicated helpers that are tasked to fan out the migration workflow. For instance, we migrate multiple batches of a dataset simultaneously and at the same time, MS can handle the migration of many such datasets belonging to a single client. A client’s datasets on the Data Lake can be migrated by a single or a pool of MS instances. Let’s go over a few terms used in describing the migration workflow:

Figure 7: Migration definitions

Here’s a step-by-step representation:

Figure 8: Shadow and Source workflow

Pros

Finally, data corruption and loss are highly unlikely because:

Cons

Choosing a Strategy

Weighing our options, we decided to choose Shadow Migration instead of the In-Place Migration strategy. Here’s why:

Figure 9: Migration strategy analysis

Lessons Learnt

Migrating datasets to Apache Iceberg with Shadow Migration strategy definitely had its challenges. Here are a few pitfalls we hit.

Traffic Isolation and Scalability

The design of the Migration Service re-used the existing ingestion architecture to execute its business logic. We repurposed the existing ingestion framework — compute power and plumbing; to materialize this workflow. This implies there was a contention of resources for ingesting data that belonged to the migration process and customer-triggered ingestion traffic; either live data or a periodic backfill. We needed knobs to isolate migration process-related traffic and scale individual pieces of the workflow.

Rewriting Iceberg Metadata

One of the critical design elements of the Shadow migration workflow was our ability to successfully regenerate Iceberg metadata for the Source table based on that of the Shadow. Depending on the nature of the data and our configurations for buffering writes, some tables had a large number of snapshots generated and the execution of this logic was done on a single node on the cluster (driver). This exposed us to two challenges -

References

#apache-iceberg #azure #big-data #migration #query-latency