Agentic AI

39.5% search relevancy uplift delivered for a Global AdTech Platform

About the Client

The client is a US-based AdTech company and operates at the intersection of search technology and performance advertising, offering a proprietary white-label search engine platform that publishers and e-commerce websites can embed directly into their own properties.

At the core of the business is a search-as-a-service model. Advertisers rely on the platform to connect with high-intent users through keyword-driven targeting, while publishers use it to monetize organic traffic. The platform serves product discovery across multiple integrated retail storefronts, with a catalog spanning categories aligned to Google taxonomy.

The technology backbone is built on Elasticsearch for indexing and retrieval, Redis for caching and runtime configuration, and a Java-based backend running on GCP infrastructure. The platform integrates proprietary keyword intent algorithms and a dynamic pricing mechanism to optimize advertiser cost-per-click performance. Despite significant investment in platform capabilities, search relevancy had not kept pace with the complexity and scale of the product catalog, creating a widening gap between user intent and search output.

Impact Delivered

39.5%

overall relevancy improvement

80 to 90%

Reduction in daily manual performance validation effort

96.5%

reduction in the document retrieval pool per query

AI models deployed

15-47ms avg.

model response time

The Relevancy Gap

For a platform whose commercial value rests entirely on connecting buyers with the right products at the right moment, search relevancy is not a feature. It is the product. When a user types a query and receives irrelevant results, every downstream metric suffers: click-through rates fall, advertiser cost-per-click performance declines, publisher fill rates weaken, and platform credibility erodes.

The client’s search engine relied on conventional Elasticsearch techniques including text matching, stop word handling, plural normalization, and synonym expansion. These approaches worked for precise queries but broke down when users expressed intent in categorical or descriptive terms. A search for “portable pet food container” would surface results from Home and Garden rather than Animals and Pet Supplies. The root cause was structural: without an intelligent category-awareness layer, the Elasticsearch query had no mechanism to understand which product domain a user was shopping in before constructing the retrieval logic.

The scale of the problem made it acute. An unfiltered Elasticsearch query against the full catalog returned upwards of 648,000 results for a single search term. Category match accuracy, measured against a curated set of test keywords, stood at 57% at the time Zimetrics began the engagement. Nearly half of all search results were landing in the wrong category bucket entirely.

Standard Elasticsearch tuning had reached its ceiling. Boosting rules, synonym libraries, and query weight adjustments could nudge relevancy at the margins but could not solve the underlying intent recognition gap. The platform needed a fundamentally different approach: one that could classify user intent before the retrieval engine ran, narrow the result space intelligently, and do so at production speed without introducing latency or instability into a live search environment.

Solutioning

Zimetrics reframed the challenge from a search configuration problem into a search intelligence problem. The team proposed introducing an AI-powered category classification layer that would sit upstream of the Elasticsearch query, predicting the most likely product categories for any given search term before the retrieval engine ran its matching logic.

The governing architectural principle was: understand intent first, then retrieve. By classifying a user’s search query into the correct position within the Google taxonomy hierarchy before querying Elasticsearch, the system could apply targeted boosting to surface results from contextually appropriate product buckets while suppressing irrelevant ones. Critically, this would reduce the effective result space from hundreds of thousands of documents to tens of thousands, improving both precision and response time simultaneously.

The solution was designed around a two-tier classification architecture. A primary BERT-based model would map any incoming search query to one of 72 root categories. A second layer of 37 subcategory-specific models would refine the prediction further within the matched primary category. The top predicted categories would be passed to the Elasticsearch query as category filters, with configurable boosting factors applied to signal result priority.

Two inference frameworks were evaluated for production deployment: TensorFlow and NVIDIA Triton. Both produced identical category match results at 75%, confirming that Triton could be adopted for its performance and scalability advantages without any sacrifice in model accuracy.

The design prioritized resilience from the outset. The classification service was implemented as an independent microservice called in parallel with the existing synonyms flow, ensuring zero disruption to the search pipeline. A production timeout threshold of 1,000 milliseconds was enforced so that if the classification service did not respond in time, the system would fall back gracefully to standard search behavior. Confidence thresholds ensured that low-confidence predictions would not pollute results. This governance-first approach distinguished the solution from a naive model deployment and made it safe to operate in a live production environment.

Engineering the Transformation

The classification system was built using BERT-based embeddings to capture semantic similarity between product titles and search queries. Large language model-generated queries were used during training to simulate diverse user intent patterns, ensuring the models could handle the natural variation and ambiguity present in real production search logs.

The full training pipeline covered the complete product catalog. A multi-class classifier was trained using Small BERT for the primary category model, with bert-mini models used for the 37 subcategory classifiers. The ONNX model format was adopted to optimize inference performance in production. The complete technology stack included TensorFlow, Keras, scikit-learn, the Hugging Face Transformers library, LangChain for prompt engineering, and ONNX runtime for deployment.

The result was a production system comprising 38 models in total: one primary category model covering 72 categories and 37 subcategory models providing hierarchical classification depth within each primary category.

A rigorous confidence threshold framework was implemented to ensure that only high-confidence category predictions influenced live search results. The default production confidence threshold was set at 0.65, with the testing environment configured at 0.6 to allow broader evaluation coverage. The maximum permitted threshold was capped at 0.8.

The threshold calibration process followed a structured protocol:

10 representative keywords per category were identified and validated against Elasticsearch, with keywords only accepted when search data relevancy reached 80% or above
Predicted category probabilities were aggregated across all categories to calculate an overall model probability average
A model was approved for production only when 80% of its prediction probabilities met or exceeded the overall probability average
Models falling below this threshold were automatically flagged for re-training rather than deployed with degraded accuracy

If the new average fell below the lower threshold of 0.5 after adjustments, the model was submitted for full re-training

The Elasticsearch query was updated to include a category classification clause within the bool filter must structure. A configurable category clause template was introduced, enabling the predicted category name to be injected into the query at runtime using a prefix match approach. This architecture reduced the effective document pool from over 648,000 results for an unfiltered query to approximately 22,700 results when exact category filtering was applied, a reduction of over 96%.

Boosting factors were determined through systematic pre- and post-deployment testing across three configurations. The primary category boosting factor started at 600 and was increased iteratively by 200 until the target category match percentage exceeded 80%. The subcategory boosting factor was fixed at 200. All boosting values were made configurable via Redis to allow real-time tuning in production without requiring redeployment.

The query design also included a safety mechanism: if a custom category filter was already passed in the API request by a publisher, the category classification layer was automatically disabled to prevent conflicts with explicit client configurations.

The production rollout followed a structured phased strategy. The initial deployment used a Katapult key with category classification switched off, allowing the full test suite to run against the live environment before the feature was activated. A second deployment with category classification enabled was then validated through sanity testing, plural testing, stopword testing, regression testing, and the full 720-keyword category prediction test suite before being promoted. Load testing on the AI models informed the decision on whether server splitting was required before full activation.

This phased approach ensured that any issues could be identified and rolled back without customer impact, and that the production infrastructure was validated under realistic load conditions before the classification layer went live.

Alongside the AI model development, Zimetrics built a comprehensive automated testing infrastructure to support ongoing validation and reduce manual overhead. Key components included:

A Playwright-based regression suite covering 18 or more search parameters including typo tolerance, plural handling, multi-keyword queries, and stop word behavior
A relevancy scoring automation utility using approximately 200 real production search keywords rated against a 0-to-5 scoring matrix, enabling quantified relevancy benchmarking across every release
Automated JMeter and Python-based performance testing pipelines capable of simulating up to 3,040 concurrent model requests with automated result aggregation and root cause analysis tooling
A category classification evaluation pipeline that automatically fetched API responses, parsed JSON output, extracted keyword-level predictions, and compared predicted versus expected categories without manual intervention

This automation layer reduced daily performance validation effort by 80 to 90 percent, compressing what had previously taken several days of manual work into a process that completed in minutes. It also enabled multiple validation runs per day during the on-premises migration period, significantly accelerating the iteration cycle for model improvements.

Zimetrics brought both technical depth and the right lens to the problem. Search relevancy is core to everything we do; they moved the needle in ways that made the product measurably more resilient and accurate.

Future Outlook

With a production-validated AI classification foundation in place, the next phase of the search intelligence roadmap is already defined. The client can focus on updating the Elasticsearch query to apply the top two category predictions simultaneously as filters, extending the system’s ability to handle ambiguous queries that sit across category boundaries. Alongside this, boosting factors can be moved to Redis-managed runtime controls, removing the need for deployment cycles when tuning is required. JUnit test coverage across the classification logic can also be expanded as the system scales.

The longer-term trajectory points toward fully autonomous search orchestration, where intent classification moves from filtering results to anticipating them. The architectural decisions made in this engagement were designed with that evolution in mind.

Zimetrics Team Perspective

Production AI is not about building the best model. It is about building a model that is fast enough, accurate enough, and resilient enough to create meaningful impact.