B2B Relevancy Improvements with Vector Search

Russell Proud

Russell Proud

Co-Founder @ Decided.AI

Michael Cizmar

Michael Cizmar

Managing Director @ MC+A

Introduction to Vectors and Search

Search has come a long way from the days of using keywords to identify the most relevant results. The most recent iteration and improvement to search is semantic search, which is to extract meaning from content within documents (images and text) and extract meaning from queries at search time to generate more relevant results. The added benefit of this is also limiting zero result searches, as a search for “Red running shoes with white stripes” that previously using semantic search may have returned no results can now return products that are similar, eg it may return a pair of running shoes that are blue with white stripes. This is also a great way to create “related product” lists.

Companies and search engineers have a number of approaches that can be taken to deliver semantic search across their search infrastructure, and with the explosion of ChatGPT and LLMs in general, more and more companies are looking at how they can implement semantic search.

In this article we explore and compare some of the methods available with Elasticsearch. Specifically, we’re going to look at ELSER [reference], Elastic’s one click semantic search function and alternative approaches via ELAND, deploying and using some of the most recent text transformer models from Hugging Face to Elastic.

The Importance of the Judgment Set in B2B Relevancy

Before we delve into the details and approaches, it is important to call out the importance of having a query and result judgment set. Without having a judgment set, measuring relevancy is a very slow and tedious process, and likely results in anecdotal evidence as opposed to empirical results. The results contained within this article are from a real world example of a client of ours who we explored semantic search methods in Elastic. This client is a large B2B provider of goods and has an eCommerce interface for their customers to search and buy goods from. They provided a judgment set at the start of the engagement and we utilized two tools to score relevancy:
  1. Elastic _rank_eval service [reference]
  2. Quepid [reference]
Rank eval is great, quick and simple to use, though, it is hard to visualize the results without implementing some additional interfaces which were out of scope for this undertaking. Quepid was opted as a visualization tool that allowed the team to see the results of each query iteration and approach. As will become evident throughout this article, having a static judgment set (as we had for this engagement) based on past searches is great for baseline and some improvements, but, without constantly reviewing the result set from each approach and updating the judgment set, you end up relying on less empirical measurement approaches, such as A/B testing and manual review.

Knowing Your Data

Implementing semantic search, either via ELSER or ELAND or another method requires you to make decisions early in the piece that will impact the value and results that are returned. It is important that you understand which fields are relevant to the search terms. In the case of an eCommerce store, these will likely be;
  • Title
  • Description
  • Brand
  • Category Taxonomy
Why is it important to understand this? At ingestion time, you need to decide which fields are values are going to be converted to vectors or ranked_features [reference] in the case of ELSER. Your ingestion pipeline will contain that configuration. Whilst our client and us had a very strong understanding of the important fields in their documents, we did still experiment with combinations of these fields initially. We landed on the following fields:
  • Title
  • Description
  • Department (L0)
  • Category (L1)
  • Sub Category (L2)
Further, we experimented with concatenating these fields into a single field and vector.

Baseline Results

Prior to any work being undertaken, the judgment set and current lexical query was scored via the rank evaluation endpoint. This client’s use case was B2B which has typically a set of very knowledgeable users who are constant ‘active buyers’, resulting in a starting base level relevancy that was strong but unpredictable in many cases for what should be obvious. The results of the baseline query are below.

Search Method Rank Eval Score
Current lexical query 0.6347583710026408
We will progressively add to this table as we move through this article.

One Click Semantic Search?

With the release of Elastic Search 8.8, Elastic added an out of the box semantic search model called ELSER. ELSER is a text expansion model, it takes fields within a document and creates a list of learned words that are relevant based on the input fields, along with a numerical value that represents weighting of that word being relevant and an associated score. Elastic is very clear in explaining these are not synonyms for your document keywords, but learned words that are related and connected to the source text.
Example Text Expansion Results from ELSER
At query time, the technique in using ELSER takes the input query and passes it through the ELSER model, returning the most relevant documents based on the query and document fields. Elastic refers to ELSER as one click semantic search, being you deploy the model, index the documents through the pipeline and then can search, and it realistically is. It’s very well packaged and easy to set up on the cluster (assuming you have ML nodes). You can follow the tutorial from Elastic to get it up and running here.

How Does it Perform?

Elastic has some specific examples of the performance of ELSER vs other search methods (lexical and semantic) which you can see here. In their cases, ELSER always resulted in increased relevancy, and in its purest form vs those other methods, we agree, it does, as our results will show throughout. We compared the current lexical query to ELSER and then combined ELSER and the current lexical query and measured the results.
Search Method Rank Eval Score
Current lexical query 0.6347583710026408
ELSER 0.3742842674545238
ELSER combined with Lexical 0.6318659471766467

In our clients case, using ELSER showed a large reduction in relevancy, though is this really the case? As we touched on, the judgment set is static, that means, if products weren’t in the prior result set of the lexical query, they’re not going to add to the score, even if they are highly relevant.

We used Quepid to execute the searches and manually inspected the results and re-scored some known low performing queries.

To give you an example of one of these queries, below is the query “AP Flour”. This term currently scores 0.51 using the nDCG@10 scoring method. When we manually reviewed and rescored for ELSER, this queries score improved to 0.81. Finally, the combined Lexical and ELSER query resulted in a score of 1, meaning the top 10 results were highly relevant.

Search Term Search Method Quepid Score
ap flour Lexical 0.51
ap flour ELSER 0.81
ap flour EISER & Lexical Combined 1.00
Taking another example, the search term “bowls”. As is evident below, the lexical outperforms the ELSER only query significantly. Why would this be the case? ELSER is a text expansion model, meaning it takes words and derives relationships and additional related words from the input data. Opposed to the “ap flour” term which is a shorthand for “All Purpose Flower”, Bowls is a generic term that text expansion isn’t adding any additional value based on our customers catalog.
Search Term Search Method Quepid Score
ap flour Lexical 0.51
ap flour ELSER 0.81
ap flour EISER & Lexical Combined 1.00
bowls Lexical 0.75
bowls ELSER 0.23
bowls EISER & Lexical Combined 0.77

Taking it to the next level, eland and transformers

As we’ve shown, ELSER is a strong out of the box solution that can be used to deliver a level of semantic search for Elastic customers, especially when combining it with a lexical base query that performs adequately. It isn’t the perfect solution, and in truth, there isn’t a perfect solution. It is a combination of approaches that can be used to build the most relevant search system. ELAND [reference] is Elastic’s Python client for analyzing and deploying custom machine learning models to ElasticSearch. As part of the work stream, we undertook the exploration of sentence transformer models from Hugging Face and analyzed the results. Based on prior work and understanding, we focused on two specific models for this analysis, all-minilm-l12-v2 [reference] and all-mpnet-base-v2 [reference]. Both of these models are designed for converting text to vectors that can then be used for semantic search, classification and clustering. Minilm delivers a 384 dimensional dense vector space whereas mpnet delivers a 768 dimensional dense vector space. In theory, a higher dimensional space allows for greater accuracy, though it comes at a cost of performance.  The process of deploying these models to Elastic and being able to use them to search is quite simple and straight forward. Specifically, you must:
  1. Import the model into your Elastic Cluster
  2. Deploy the model so it is available to use
  3. Create an index with the relevant knn fields and ensure you have set the model via the query_vector_builder parameters on those fields.
  4. Create an ingestion pipeline that passes the relevant fields through the model and outputs the vectors for storing alongside the document
  5. Index the documents via the pipeline, either by re-indexing from source or copying an existing index
  6. Modify or create a new query that searches the index

How Did They Perform?

We undertook testing similar to that of ELAND, where we experimented with fields and concatenations of data to find the right combination for this use case. We then undertook the same 3 query comparison, current lexical, each model independently and finally a combination of both lexical and the model. Below we add these results to our running performance table.
Search Method Rank Eval Score
Current lexical query 0.6347583710026408
ELSER 0.3742842674545238
ELSER combined with Lexical 0.6318659471766467
all-minilm-v12-l2 0.35326441721855384
Combined lexical and all-minilm-v12-l2 0.6343283590195509
all-mpnet-base-v2 0.3503362897757086
Combined lexical and all-mpnet-base-v2 0.6342153512985733
Before we delve into discussing the results, what is very interesting at this point is the performance of minilm and mpnet are very close in all combinations. The computational cost of mpnet is far higher than that of minilm. For reference, indexing 23032 documents with minilm took 8 minutes via the _reindex API. For mpnet, this took 32 minutes. Search inference is far slower using mpnet as well. The computational cost and overhead did not match up to the increase in relevancy achieved. For that reason, we will not be evaluating mpnet further throughout this article. It’s important to call out again that the judgment sets were static and therefore biased towards the lexical query. Using Quepid to visualize results and manually re-score some known problem queries, we explored the impact of both. Continuing our table from above:
Search Term Search Method Quepid Score
ap flour Lexical 0.51
ap flour ELSER 0.81
ap flour EISER & Lexical Combined 1.00
ap flour all-minilm-l12-v2 0.89
ap flour all-minilm-l12-v2 & Lexical Combined 1.00
bowls Lexical 0.75
bowls ELSER 0.23
bowls EISER & Lexical Combined 0.77
bowls all-minilm-l12-v2 0.81
bowls all-minilm-l12-v2 & Lexical Combined 0.92

TL;DR Summary

What we’ve undertaken here shows us that:

  • Combining lexical and a machine learning model yields comparative results to that of the pure lexical when using a static judgment set
  • anecdotally, all-minilm-l12-v2 out performs ELSER and a good lexical search query based on our manual re-scoring of some known low performing terms
  • ELSER is simple to deploy and yields better results when combined with Lexical, but has its limitations for our use case
  • All-minilm-l12-v2 is likely superior to ELSER

This undertaking for our client was not aimed at providing a definitive answer for them to follow, hence, time was not invested to rescore every single term used in the judgment set. It was intended to provide our client a path forward that they could implement internally and then measure those improvements using their internal analytics.

At the end of the day, there is only so much you can do with measuring results programmatically, the proof is in the pudding as they say. With strong indications like we have above (and more than we undertook as part of this work), the customer will take the work and implement an A/B test. They then measure the performance of each search term via click through rates, click position, conversions and zero result search terms and other internal metrics and continually refine.

If you’re looking to improve your B2B or eCommerce search, we have the experience, tools and capabilities, to deliver independently or alongside your team, the process and systems needed to implement Semantic Search and improve your customer experience.

Reference Articles

Trusted Advisor

Go Further with Expert Consulting

Launch your technology project with confidence. Our experts allow you to focus on your project’s business value by accelerating the technical implementation with a best practice approach. We provide the expert guidance needed to enhance your users’ search experience, push past technology roadblocks, and leverage the full business potential of search technology.

Recent Insights

Experience Intelligence – Investigation

Speakers Description In this webinar we discuss MC+A’s new solution and approach for Intelligence Experiences for investigation use cases using LLM technology and machine learning. We demo how the solutions can act as a catalyst in expediting investigative processes. LLM technology assist with investigation due to its ability to understand, interpret, and analyze vast swathes of data, thereby aiding in

Read More »

E-commerce Relevancy Improving B2B with Vectors

Join our panelists for a webinar where they discuss approaches for improving relevance for e-commerce search. They will cover ELAND and ELSER, promising to reshape the relevance landscape with vector-based search. Don’t miss out on an interesting discussion that could change your approach to e-commerce search relevancy.

Read More »
Scroll to Top