The classification platform that powers the client’s contextual advertising business had been built on a fleet of ECS-based distributed services running custom Spark and ETL frameworks. As adoption grew, the architecture hit hard structural limits. Scaling for peak ad traffic required manual intervention, ML inference response times slowed under load, and reporting pipelines for ad exchange data became increasingly difficult to scale and maintain as volumes climbed.
These constraints surfaced at exactly the moment the business was running a company-wide infrastructure optimizations effort, with compute-intensive workloads growing faster than infrastructure efficiency. Fragmented batch and streaming environments multiplied the operational complexity, AI experimentation was throttled by the lack of a unified platform, and every new model or use case added more bespoke services to maintain. Without a structural reset, the platform would have continued to incur cost faster than it absorbed traffic, and contextual signal latency, the core product, would have degraded as scale grew.
Zimetrics positioned the engagement towards a unified AI platform reset. The architectural principle was to consolidate streaming ETL, batch processing, ML inference, and ad-hoc data science onto a single Spark-native lakehouse, while leaving genuinely latency-sensitive serving paths (real-time bidding lookups, ad decisioning) on the purpose-built systems already engineered for millisecond response.
Databricks was selected as the central data processing and lakehouse layer because it offered four capabilities the legacy stack could not deliver together:
• Managed Spark compute that removed the burden of self-managed clusters
• Unified batch and Structured Streaming on a common runtime
• Integrated MLflow for model tracking and registry
• Unity Catalog for governance across the data estate.
Zimetrics approached it as an AI platform engineering problem, deliberately leaving DynamoDB, DAX, Redis, Kafka, and GPU inference clusters in place for workloads where they were already the right choice, and using Databricks for the data and ML workloads where it was.
The new end-to-end flow takes the shape: web pages → Kafka → Databricks Structured Streaming → ML classification → signal API and S3. Spark-based processing on Databricks replaces the fragmented ECS service mesh that previously orchestrated content extraction and classification, while Kafka and S3 integration preserves the existing event backbone the rest of the platform depends on.
• GPU-Enabled Inference: Model inference for NLP and computer vision workloads (including IAB classification, threat detection, sentiment, keyword extraction, and image threat classification) runs on GPU-enabled Databricks clusters, replacing CPU-bound ECS services for the workloads where GPU economics win.
• Structured Streaming for Real-Time Signals: Contextual signal generation moved to Databricks Structured Streaming, allowing signals to be produced continuously rather than through bespoke micro-batch services.
• Operational Streaming Coverage: Streaming coverage now extends across four operational modules running in production on Databricks: Ad Events, Inventory, Real Time Bidding (RTB) and
• RTB User Sync Statistics: Together these handle the high-volume ad serving, inventory, realtime bidding, and user sync workloads that underpin the contextual advertising business.
• Databricks Workflows for Orchestration: Workflow orchestration was centralized on Databricks Workflows, replacing scattered scheduling and dependency logic spread across the ECS environment.
The platform layer was extended for the data science organization: MLflow handles model tracking, experiment management, and the model registry; notebooks support multi-language exploration in Python, Scala, and SQL on the same governed data; and the lakehouse pattern unifies the data substrate that contextual, NLP, computer vision, brand safety, and page classification models all draw from. The net result is that new models can be built, evaluated, and promoted on shared infrastructure instead of bespoke per-team stacks.
Unity Catalog was introduced for data governance and metastore management across the organization, replacing fragmented metadata practices. Alongside the platform build, Zimetrics ran an explicit cost engineering workstream covering AWS Spot-based workloads for elastic compute, storage tiering, Savings Plans optimization, and GPU migration for inference. The hybrid posture was deliberate: DynamoDB with DAX continues to serve ultra-low-latency lookups in the 2-3 millisecond range, Redis and Kafka remain the operational substrate for serving and event handling, and Databricks is positioned for the processing, analytics, and ML workloads where Spark-native compute is the right tool.