Databricks

From Manual EC2 to Continuous Learning: ML Platform Modernization for a Global AdTech Leader

About the Client

The client is a leading Global AdTech Company operating at the intersection of contextual AI and digital advertising. The organization runs production-grade machine learning models that classify web content and detect unsafe material across the open web, powering brand-safe contextual targeting for advertisers, publishers, and ad networks.

Its AI capability spans three production model families:

• NLP threat classifier for unsafe textual content
• NLP IAB classifier that maps page content to standard IAB taxonomy categories
• A computer vision threat classifier that detects unsafe imagery from Open Graph images.

These models are central to the company’s commercial offering and require continuous retraining to keep pace with evolving content patterns across the internet.

Its underlying ML infrastructure ran primarily on AWS, with EC2-based compute powering model training and experiment tracking distributed across multiple disconnected tools.

Impact Delivered

Eliminated

manual infrastructure overhead

Faster

model iteration cycles

Increased

training frequency

Scalable foundation

for NLP and CV workloads

Scalable handling

of growing data volumes

Improved

collaboration and reproducibility

The Operational Drag on AI Innovation

When ML Training Infrastructure Becomes the Bottleneck

Each of the three production model families required regular retraining to stay accurate against changing content patterns on the web. Yet the platform that trained those models was anchored in manual workflows. Training jobs ran on standalone EC2 instances. Environment setup, dependency installation, and compute provisioning were handled by engineers by hand. Scaling compute meant manual reconfiguration. Experiment tracking, monitoring, and workflow orchestration were fragmented across multiple disconnected tools.

The Drag of Fragmented Infrastructure

The operational consequences were structural as engineers were absorbed in infrastructure work rather than model improvement. Iteration cycles were long, collaboration across teams was limited because development environments were isolated, and there was no centralized view of experiments, model lineage, or job status. As training data volumes grew, the underlying EC2 model became increasingly difficult to scale without proportional manual effort.

In a contextual AI business, the speed and reliability of model retraining is a direct competitive lever. Continuing on the existing platform would have meant slower adaptation to new content patterns on the web, growing platform fragility, and rising operational cost per training run, all at a time when the company was looking to expand model coverage and onboard new model types.

Solutioning

Design & Stack Decisions

Zimetrics positioned the engagement as building a governed ML platform that turns retraining into a continuous, repeatable enterprise capability. The architectural principle was deliberate: separate the work of running ML pipelines from the work of managing infrastructure, and make the entire training lifecycle observable, reproducible, and team-shareable by default.

Databricks was selected as the foundation for reasons specific to this engagement. Preconfigured ML runtimes eliminated the dependency and environment overhead that had consumed engineering time on EC2. Autoscaling GPU and CPU clusters allowed compute to flex with workload demand rather than being statically provisioned. Native notebook orchestration and Databricks Workflows enabled modular pipelines that could be authored, scheduled, and retried as governed enterprise jobs. The Databricks File System (DBFS) on top of Amazon S3 provided a single storage layer for raw data, annotations, model artifacts, and metrics, replacing fragmented storage across the legacy environment.

Designing an Active Continuous Learning Pipeline

We designed a closed-loop Active Learning pipeline that turns each production inference cycle into the source of the next training cycle. Model predictions identify the most informative samples, a monthly annotation step labels them, quarterly retraining incorporates them, and evaluation gates the updated model for deployment. This shifts the platform from scheduled retraining to continuous learning, where classification accuracy improves systematically with each cycle.

Weights and Biases (W&B) was integrated for centralized experiment tracking and monitoring, aligning with the client’s existing data science team practices. Airflow was retained as the enterprise orchestration layer, with DAGs triggering Databricks notebooks and workflows from within existing enterprise pipelines, complete with retry and alerting. The result is an ML platform that fits cleanly into the rest of the client’s engineering operating model.

Engineering the Transformation

Execution was organized around a five-stage closed loop, with two operating cadences. The first three stages run monthly. The final two stages run quarterly. Together they form a continuous improvement cycle in which production traffic feeds the next generation of training data.

• Sample Selection: Model predictions identify informative new data points from incoming production data, prioritizing samples that are most likely to improve classifier performance.
• Annotation Platform Integration: Selected samples are routed for labeling, producing newly annotated datasets on a monthly schedule.
• Dataset Preparation (Train, Validation, Test): Newly annotated data is combined with existing labeled data, cleaned, pre-processed, and split into training, validation, and test sets.
• Model Training: The model is fine-tuned on the prepared dataset, with model artifacts and training metadata saved for reproducibility.
• Model Evaluation: The updated model is evaluated against validation and test sets, with metrics tracked and error analysis performed before the model is promoted as ready for deployment or further iteration.

The training platform was assembled from Databricks-native capabilities, each selected to remove a specific manual operation from the legacy model.

• Databricks Workflows: Centralized automation for end-to-end ML pipelines, replacing scattered cron jobs and manual triggers.
• Notebook orchestration: Modular, sequenced notebooks that run as governed pipelines rather than one-off scripts, making workflows easier to manage and extend.
• GPU and CPU cluster autoscaling: Compute clusters provisioned and scaled automatically based on workload demand, eliminating manual EC2 sizing decisions.
• Scheduled ML jobs: Training and retraining executed automatically at predefined intervals without engineering intervention.
• Databricks Repos for Git-based version control: Code tracked, reviewed, and reproduced through standard Git workflows, supporting collaboration across the data science team.
• Airflow DAG integration: Existing enterprise Airflow DAGs trigger Databricks notebooks and workflows, with full retry and alerting embedded into the orchestration layer.
• DBFS on Amazon S3: A unified, secure storage layer for raw data, annotations, processed datasets, model artifacts, metrics, and logs. Single source of truth across clusters, users, and pipelines, with access control, auditing, and versioning inherited from S3.

Although Databricks offers MLflow natively, the engagement adopted Weights and Biases for centralized experiment tracking and monitoring, aligning with the client’s existing data science team practices. The choice was made deliberately to minimize change-management friction for data scientists while still consolidating experiments into a single, queryable system of record. MLflow remains a candidate future option as the platform matures.

Future Outlook

The platform is now positioned for several next-phase expansions. Advanced MLOps capabilities such as drift detection and automatic retraining were not part of the initial build and can be added in a future phase. End-to-end deployment automation can be extended so that validated models flow from evaluation into production without manual approval steps. The annotation stage in the active learning pipeline, currently manual, is a strong target for partial automation through pre-labeling and weak supervision techniques.

The active learning architecture also generalizes well beyond the current three model families. The same five-stage pattern (sample selection, annotation, dataset preparation, training, evaluation) can support new model types as they enter production, with no platform rework required. As MLflow adoption matures across the industry, it remains available as a future tracking option alongside W&B.

Zimetrics Team Perspective

“With this project, the client moved from a modernized ML infrastructure towards building a foundation for continuous AI. A foundation where production traffic itself feeds the next training cycle, where new model families scale into the same governed platform, and where each iteration of the loop compounds into a more accurate model.”