50% – 70% Compute Cost Savings: EMR to Databricks Migration for a Global AdTech Enterprise

About the Client

A global AdTech enterprise operating a centralized ad data platform that powers ad auctions, real-time bidding (RTB), header bidding, and visitor identity workflows across digital advertising channels. The platform processes more than 20 million ad events per hour, with peak streaming throughput reaching 500,000 rows per second and large batch cycles processing approximately 2.3 TB of data per run. Its data ecosystem is built on AWS infrastructure (S3, MSK), Apache Airflow for orchestration, and Snowflake as the analytics serving layer. Downstream consumers span Looker dashboards, ad server systems, third-party data shares, and internal data science workloads. As ad event volumes climbed and platform demands grew, the foundational Spark compute layer became a structural concern.

Impact Delivered

50 to 70%
compute cost savings
Zero idle
infrastructure cost
Sub-15-minute
end-to-end SLA
500K
rows per second peak streaming
2.3 TB
per large batch run
15+
production pipelines unified on Databricks

Standing at a Turning Point: Compute as a Structural Constraint

The Cost and Brittleness of Always-On Spark

The platform’s Spark workloads originally ran on AWS EMR clusters orchestrated by Airflow. As volumes grew, the EMR operating model began to surface compounding limitations. Long-running clusters consumed compute capacity continuously, accumulating idle infrastructure spend during low-traffic hours. Cluster tuning required manual intervention at each scaling boundary, and long-lived nodes carried the risk of accumulating stale runtime state. Batch and streaming workloads lived in separate codebases, doubling engineering effort. Unified monitoring across pipelines was absent, slowing incident response. As workloads reached hundreds of GB to TB scale per run, these operational gaps translated directly into reduced predictability for downstream reporting SLAs.

The Risk of Staying the Course

Continuing on the existing EMR architecture meant accepting a compounding tradeoff: rising unit cost per insight, or sustained engineering toil to keep pace with growing volumes and new data sources. Each new pipeline required additional rework, slowing the platform team in supporting business demand. With the centralized ad data platform sitting at the core of revenue-generating workflows and downstream BI, a compute layer that grew more expensive and more brittle with scale was no longer acceptable. The organization needed a structural reset of its Spark compute model, not an incremental tuning exercise.

Solutioning

Zimetrics approached the engagement as a redesign of how Spark compute itself should operate at scale. The architectural principle adopted was ephemeral, job-scoped compute: clusters that exist only for the duration of work and shut down cleanly on completion. This removed idle cost as a structural problem. Databricks Job Clusters fit this model because they spin up per job, auto-scale workers based on data volume, and tear down on their own when the job finishes.

On top of that, the team layered three more design decisions. AWS Spot instances were used for most worker nodes, cutting compute spend by 50 to 70 percent, with automatic fallback to on-demand instances if Spot capacity was unavailable. Snowflake was kept as the analytics layer, keeping compute (Databricks) and reporting (Snowflake) cleanly separated. Airflow stayed as the orchestration backbone, with a standard pattern of resetting job configuration before each run and triggering Databricks through the Jobs API. A single PySpark and Scala codebase replaced the two separate codebases for batch and streaming, so the platform team no longer had to maintain two pipelines where one would do.

Engineering the Transformation

Raw ad events flow continuously from the ad server into Kafka topics on AWS MSK. The ad events topic is partitioned across 12 partitions to support parallel consumption. Kafka Connect connectors land Avro files in S3 approximately every five minutes. Airflow S3 sensors detect file arrival and trigger downstream Databricks jobs with the latest JAR version, eliminating manual intervention from the ingestion-to-compute handoff.

Each Spark workload runs on a short-lived Databricks Job Cluster created fresh per run and destroyed on completion. Spark applications are compiled as Scala JARs, versioned on S3, and deployed via the Databricks Jobs API. Notebook tasks (Python and Scala) are used for ML scoring, metadata exports, and bid price optimization. Cluster sizing is workload-aware: light hourly batches run on 3 to 5 workers, while heavy daily aggregations such as header bidding reporting scale to 60 workers. For RTB workloads, Structured Streaming processes 40-second micro-batches at up to 500,000 rows per second peak. AWS Spot instances cover most worker nodes, with automatic on-demand fallback on Spot interruption.

• JAR Tasks: Pre-compiled Scala/Spark code packaged as versioned JARs on S3. Rolling back is a single configuration change pointing to a previous JAR version, which simplifies incident recovery.
• Auto-Scale Workers: Clusters grow and shrink with data volume. A light hour runs on 3 workers; a heavy backfill scales to 60, with no manual tuning between the two.
• Structured Streaming: Continuous reads from Kafka in 40-second micro-batches power RTB, ad event, video, and page-level streams using the same codebase that handles batch workloads.

Processed data flows through a medallion pattern split across S3 and Snowflake. The Bronze layer sits in S3, holding raw Avro and Parquet files landed from Kafka and written by Databricks. Snowflake holds the Silver and Gold layers, loaded via BULK COPY: Silver applies deduplication and enrichment, while Gold holds attribution-ready datasets that feed downstream consumers including Looker dashboards, ad server systems, third-party data shares (such as Datorama and Instavails), and internal data science workloads.

Airflow remains the orchestration plane. Sensors validate upstream completeness, job configurations are reset before each run, and post-job Snowflake loads are triggered only on success. Versioned JARs enable one-config rollbacks to a previous build. The standardized execution pattern, combined with centralized logging and monitoring, gave the platform team a single operational interface across more than 15 production pipelines covering ad auctions, RTB, header bidding, visitor identity, and ML model scoring (CTR, viewability, and bid price optimization).

Future Outlook

With ephemeral Databricks compute now operating as the platform Spark layer, the organization is positioned to extend the architecture into adjacent workloads. The medallion-based data architecture and unified codebase reduce the marginal cost of onboarding new data sources or analytical workloads. Future evolution paths may include expanded ML scoring patterns built on the existing notebook-based bid price optimization and CTR models, and deeper observability across the integration landscape as additional downstream systems are onboarded.

Zimetrics Team Perspective

“Migrating Spark compute was an opportunity to redesign how cost, scale, and reliability behave together. With ephemeral clusters and a unified codebase, platform teams stop paying for capacity they are not using and stop maintaining two pipelines where one will do just fine.”

Related Stories

Let's Talk

To find out more about us, email
biz@zimetrics.com or complete the form below.