Each of the three production model families required regular retraining to stay accurate against changing content patterns on the web. Yet the platform that trained those models was anchored in manual workflows. Training jobs ran on standalone EC2 instances. Environment setup, dependency installation, and compute provisioning were handled by engineers by hand. Scaling compute meant manual reconfiguration. Experiment tracking, monitoring, and workflow orchestration were fragmented across multiple disconnected tools.
The operational consequences were structural as engineers were absorbed in infrastructure work rather than model improvement. Iteration cycles were long, collaboration across teams was limited because development environments were isolated, and there was no centralized view of experiments, model lineage, or job status. As training data volumes grew, the underlying EC2 model became increasingly difficult to scale without proportional manual effort.
In a contextual AI business, the speed and reliability of model retraining is a direct competitive lever. Continuing on the existing platform would have meant slower adaptation to new content patterns on the web, growing platform fragility, and rising operational cost per training run, all at a time when the company was looking to expand model coverage and onboard new model types.
Zimetrics positioned the engagement as building a governed ML platform that turns retraining into a continuous, repeatable enterprise capability. The architectural principle was deliberate: separate the work of running ML pipelines from the work of managing infrastructure, and make the entire training lifecycle observable, reproducible, and team-shareable by default.
Databricks was selected as the foundation for reasons specific to this engagement. Preconfigured ML runtimes eliminated the dependency and environment overhead that had consumed engineering time on EC2. Autoscaling GPU and CPU clusters allowed compute to flex with workload demand rather than being statically provisioned. Native notebook orchestration and Databricks Workflows enabled modular pipelines that could be authored, scheduled, and retried as governed enterprise jobs. The Databricks File System (DBFS) on top of Amazon S3 provided a single storage layer for raw data, annotations, model artifacts, and metrics, replacing fragmented storage across the legacy environment.
We designed a closed-loop Active Learning pipeline that turns each production inference cycle into the source of the next training cycle. Model predictions identify the most informative samples, a monthly annotation step labels them, quarterly retraining incorporates them, and evaluation gates the updated model for deployment. This shifts the platform from scheduled retraining to continuous learning, where classification accuracy improves systematically with each cycle.
Weights and Biases (W&B) was integrated for centralized experiment tracking and monitoring, aligning with the client’s existing data science team practices. Airflow was retained as the enterprise orchestration layer, with DAGs triggering Databricks notebooks and workflows from within existing enterprise pipelines, complete with retry and alerting. The result is an ML platform that fits cleanly into the rest of the client’s engineering operating model.
Execution was organized around a five-stage closed loop, with two operating cadences. The first three stages run monthly. The final two stages run quarterly. Together they form a continuous improvement cycle in which production traffic feeds the next generation of training data.
• Sample Selection: Model predictions identify informative new data points from incoming production data, prioritizing samples that are most likely to improve classifier performance.
• Annotation Platform Integration: Selected samples are routed for labeling, producing newly annotated datasets on a monthly schedule.
• Dataset Preparation (Train, Validation, Test): Newly annotated data is combined with existing labeled data, cleaned, pre-processed, and split into training, validation, and test sets.
• Model Training: The model is fine-tuned on the prepared dataset, with model artifacts and training metadata saved for reproducibility.
• Model Evaluation: The updated model is evaluated against validation and test sets, with metrics tracked and error analysis performed before the model is promoted as ready for deployment or further iteration.
The training platform was assembled from Databricks-native capabilities, each selected to remove a specific manual operation from the legacy model.
• Databricks Workflows: Centralized automation for end-to-end ML pipelines, replacing scattered cron jobs and manual triggers.
• Notebook orchestration: Modular, sequenced notebooks that run as governed pipelines rather than one-off scripts, making workflows easier to manage and extend.
• GPU and CPU cluster autoscaling: Compute clusters provisioned and scaled automatically based on workload demand, eliminating manual EC2 sizing decisions.
• Scheduled ML jobs: Training and retraining executed automatically at predefined intervals without engineering intervention.
• Databricks Repos for Git-based version control: Code tracked, reviewed, and reproduced through standard Git workflows, supporting collaboration across the data science team.
• Airflow DAG integration: Existing enterprise Airflow DAGs trigger Databricks notebooks and workflows, with full retry and alerting embedded into the orchestration layer.
• DBFS on Amazon S3: A unified, secure storage layer for raw data, annotations, processed datasets, model artifacts, metrics, and logs. Single source of truth across clusters, users, and pipelines, with access control, auditing, and versioning inherited from S3.
Although Databricks offers MLflow natively, the engagement adopted Weights and Biases for centralized experiment tracking and monitoring, aligning with the client’s existing data science team practices. The choice was made deliberately to minimize change-management friction for data scientists while still consolidating experiments into a single, queryable system of record. MLflow remains a candidate future option as the platform matures.