For a platform whose commercial value rests entirely on connecting buyers with the right products at the right moment, search relevancy is not a feature. It is the product. When a user types a query and receives irrelevant results, every downstream metric suffers: click-through rates fall, advertiser cost-per-click performance declines, publisher fill rates weaken, and platform credibility erodes.
The client’s search engine relied on conventional Elasticsearch techniques including text matching, stop word handling, plural normalization, and synonym expansion. These approaches worked for precise queries but broke down when users expressed intent in categorical or descriptive terms. A search for “portable pet food container” would surface results from Home and Garden rather than Animals and Pet Supplies. The root cause was structural: without an intelligent category-awareness layer, the Elasticsearch query had no mechanism to understand which product domain a user was shopping in before constructing the retrieval logic.
The scale of the problem made it acute. An unfiltered Elasticsearch query against the full catalog returned upwards of 648,000 results for a single search term. Category match accuracy, measured against a curated set of test keywords, stood at 57% at the time Zimetrics began the engagement. Nearly half of all search results were landing in the wrong category bucket entirely.
Standard Elasticsearch tuning had reached its ceiling. Boosting rules, synonym libraries, and query weight adjustments could nudge relevancy at the margins but could not solve the underlying intent recognition gap. The platform needed a fundamentally different approach: one that could classify user intent before the retrieval engine ran, narrow the result space intelligently, and do so at production speed without introducing latency or instability into a live search environment.
Zimetrics reframed the challenge from a search configuration problem into a search intelligence problem. The team proposed introducing an AI-powered category classification layer that would sit upstream of the Elasticsearch query, predicting the most likely product categories for any given search term before the retrieval engine ran its matching logic.
The governing architectural principle was: understand intent first, then retrieve. By classifying a user’s search query into the correct position within the Google taxonomy hierarchy before querying Elasticsearch, the system could apply targeted boosting to surface results from contextually appropriate product buckets while suppressing irrelevant ones. Critically, this would reduce the effective result space from hundreds of thousands of documents to tens of thousands, improving both precision and response time simultaneously.
The solution was designed around a two-tier classification architecture. A primary BERT-based model would map any incoming search query to one of 72 root categories. A second layer of 37 subcategory-specific models would refine the prediction further within the matched primary category. The top predicted categories would be passed to the Elasticsearch query as category filters, with configurable boosting factors applied to signal result priority.
Two inference frameworks were evaluated for production deployment: TensorFlow and NVIDIA Triton. Both produced identical category match results at 75%, confirming that Triton could be adopted for its performance and scalability advantages without any sacrifice in model accuracy.
The design prioritized resilience from the outset. The classification service was implemented as an independent microservice called in parallel with the existing synonyms flow, ensuring zero disruption to the search pipeline. A production timeout threshold of 1,000 milliseconds was enforced so that if the classification service did not respond in time, the system would fall back gracefully to standard search behavior. Confidence thresholds ensured that low-confidence predictions would not pollute results. This governance-first approach distinguished the solution from a naive model deployment and made it safe to operate in a live production environment.
The classification system was built using BERT-based embeddings to capture semantic similarity between product titles and search queries. Large language model-generated queries were used during training to simulate diverse user intent patterns, ensuring the models could handle the natural variation and ambiguity present in real production search logs.
The full training pipeline covered the complete product catalog. A multi-class classifier was trained using Small BERT for the primary category model, with bert-mini models used for the 37 subcategory classifiers. The ONNX model format was adopted to optimize inference performance in production. The complete technology stack included TensorFlow, Keras, scikit-learn, the Hugging Face Transformers library, LangChain for prompt engineering, and ONNX runtime for deployment.
The result was a production system comprising 38 models in total: one primary category model covering 72 categories and 37 subcategory models providing hierarchical classification depth within each primary category.
A rigorous confidence threshold framework was implemented to ensure that only high-confidence category predictions influenced live search results. The default production confidence threshold was set at 0.65, with the testing environment configured at 0.6 to allow broader evaluation coverage. The maximum permitted threshold was capped at 0.8.
The threshold calibration process followed a structured protocol:
If the new average fell below the lower threshold of 0.5 after adjustments, the model was submitted for full re-training
The Elasticsearch query was updated to include a category classification clause within the bool filter must structure. A configurable category clause template was introduced, enabling the predicted category name to be injected into the query at runtime using a prefix match approach. This architecture reduced the effective document pool from over 648,000 results for an unfiltered query to approximately 22,700 results when exact category filtering was applied, a reduction of over 96%.
Boosting factors were determined through systematic pre- and post-deployment testing across three configurations. The primary category boosting factor started at 600 and was increased iteratively by 200 until the target category match percentage exceeded 80%. The subcategory boosting factor was fixed at 200. All boosting values were made configurable via Redis to allow real-time tuning in production without requiring redeployment.
The query design also included a safety mechanism: if a custom category filter was already passed in the API request by a publisher, the category classification layer was automatically disabled to prevent conflicts with explicit client configurations.
The production rollout followed a structured phased strategy. The initial deployment used a Katapult key with category classification switched off, allowing the full test suite to run against the live environment before the feature was activated. A second deployment with category classification enabled was then validated through sanity testing, plural testing, stopword testing, regression testing, and the full 720-keyword category prediction test suite before being promoted. Load testing on the AI models informed the decision on whether server splitting was required before full activation.
This phased approach ensured that any issues could be identified and rolled back without customer impact, and that the production infrastructure was validated under realistic load conditions before the classification layer went live.
Alongside the AI model development, Zimetrics built a comprehensive automated testing infrastructure to support ongoing validation and reduce manual overhead. Key components included:
This automation layer reduced daily performance validation effort by 80 to 90 percent, compressing what had previously taken several days of manual work into a process that completed in minutes. It also enabled multiple validation runs per day during the on-premises migration period, significantly accelerating the iteration cycle for model improvements.
Zimetrics brought both technical depth and the right lens to the problem. Search relevancy is core to everything we do; they moved the needle in ways that made the product measurably more resilient and accurate.