BLOG
BLOG
  • Design
  • Data
  • Discernment

We believe in AI and every day we innovate to make it better than yesterday. We believe in helping others to benefit from the wonders of AI and also in extending a hand to guide them to step their journey to adapt with future.

Know more

Our solutions in action for customers

DOWNLOAD

Featured Post

Business Intelligence & Data Analytics in Retail Industry

The traditional data analytics in retail industry is experiencing a radical shift as it prepares to deliver more intuitive demand data of the consumers. The rise of online shopping may have a major impact on the retail stores but the brick-and-mortar sales aren’t going anywhere soon. According to Euromonitor International, it is projected that 83% […]

Know More
Artificial Intelligence For Enhancing Business Security

If you are a business owner, you already know the importance of business security. Most of the businesses are just one ‘security mishap’ away from a temporary or a total failure. Security mishaps come in different sizes and shapes, such as the occurrence of fire or thefts happening inside your business premises.  All these mishaps […]

Know More

MENU

  • Visit Accubits Website
  • Artificial Intelligence
  • Blockchain
  • General
  • Health Care
  • ICOs

How to conduct vector similarity search with Elasticsearch

  • mm
    by Adarsh M S on Thu Sep 17

In this article, we’ll learn how to do vector similarity search using elasticsearch with an example. Before jumping into the tutorial, let’s brush up our knowledge a bit a familiarise the basics of semantic search, vector similarity, similarity search, etc. You’re welcome to skip the intro and jump to topics that interest you from the index below.

  1. What is Elasticsearch?
  2. What is Semantic search?
  3. What is vector similarity?
  4. Conducting semantic search
  5. Using embeddings for similarity search
  6. Tutorial : Implementing a QA system
  7. Setting up Elasticsearch
  8. Choosing an embedding model
  9. Setting up the environment
  10. Building a search API

What is Elasticsearch?

Elasticsearch is an open-source, distributed engine for search and analytics, built on Apache Lucene. It enables users to store, search, and analyze large volumes of data quickly and in near real time. Typically, Elasticsearch is utilized as an underlying technology that powers applications with complex search features and requirements. 

Elasticsearch along with its ecosystem of components known as Elastic Stack have found applications in a variety of areas including simple search on websites or documents, collecting and analyzing log data, data analysis and visualization.

Semantic search

An easy way to carry out similarity search would be to rank documents based on how many words they share with the query. But a document may be similar to the query even if they have very few words in common. This is why semantic search is important. 

Semantic search involves a type of search that is often seen in search engines. It retrieves content after comprehending the intent and meaning of the user’s search query. Semantic search is much more advanced than the traditional text and keyword match search. Traditional keyword search does not consider lexical variants or conceptual matches to the user’s search phrase. If the precise wording used in the query cannot be found in the overall content, incorrect results would be provided to the user. 

semantic search

This type of search is based on two concepts:

  • Search intent of the user: This refers to the intention behind the user’s search. It involves decoding the reason behind the question. This reasoning could be anything from wanting to gain more knowledge to finding a particular item for purchasing. By understanding the intent behind the query, search engines can retrieve the most accurate results for the users.
  • Relationship between the words in the search phrase: It is essential to decode the meaning of all the words together in the search phrase instead of the individual words in them. This means understanding the relationship between those words, thus displaying results that have a conceptual similarity to the user’s query.

Need help with your product development?

Reach out to us today and get started!

View Services

Use-cases of Semantic Similarity Search:

Question-answering system: Given a collection of frequently asked questions, the search can find questions with the same meaning as the user’s new query. It can then provide stored results of similar questions.

  • Image search – In a dataset of captioned images, it can find images whose caption is similar to the user’s description.
  • Document content search – Allows for searching through several documents to find one that matches the user’s requirements
  • Article search – In a collection of articles, it can return articles with a title that’s closely related to the user’s query.

What is vector similarity?

Vector similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. It is often used to measure document similarity in text analysis.

A document can be represented by thousands of attributes, each recording the frequency of a particular word (such as a keyword) or phrase in the document. Thus, each document is an object represented by what is called a term-frequency vector. Each element of the vector is associated with a word in the document and the value is the number of times that word is found in the document in question. The vector similarity is then computed between the two documents.

Conducting semantic search

Semantic search can be implemented through a variety of different approaches. NLP specialists have developed a unique technique known as text embeddings. Text embedding involves converting words and sentences into fixed-size dense numeric vectors. This means that any kind of unstructured text can be converted to vectors. These vectors enable the understanding of the contextual meaning of the text and can be used to find the similarity between the user query and the web pages. If the text embeddings to two texts are similar, it means that the two texts are semantically similar. These vectors can be indexed in Elasticsearch to perform semantic similarity searches.

Using embeddings for similarity search

Text embeddings can be used to retrieve questions that are similar to a user’s query. This is done though the following process-

  • During indexing, each question is run through a sentence embedding model to produce a numeric vector.
  • When a user enters a query, it is run through the same sentence embedding model to produce a vector. To rank the responses, we calculate the vector similarity between each question and the query vector. When comparing embedding vectors, it is common to use cosine similarity.

Tutorial : Implementing a QA system

Let’s implement a simple question-answering search engine using elasticsearch and a sentence embedding architecture.

Setting up Elasticsearch

  • Download and extract archive
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.1-linux-x86_64.tar.gz

tar -xzf elasticsearch-7.9.1-linux-x86_64.tar.gz
  • Launch elasticsearch server
cd elasticsearch-7.9.1
./bin/elasticsearch

All done! Now let’s see if the elasticsearch node is up and running by sending a HTTP request to port 9200 (default port of es node).

  • Run the following command from your terminal
curl -X GET "localhost:9200/?pretty"

The above command returns something similar to this:

{
  "name" : "desktop-pc",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "Qr8lzRyZQnuNlcmp4P_OPB",
  "version" : {
    "number" : "7.9.1",
    "build_flavor" : "default",
    "build_type" : "tar",
    "build_hash" : "f27399d",
    "build_date" : "2020-03-26T06:34:37.794943Z",
    "build_snapshot" : false,
    "lucene_version" : "8.6.2",
    "minimum_wire_compatibility_version" : "1.2.3",
    "minimum_index_compatibility_version" : "1.2.3"
  },
  "tagline" : "You Know, for Search"
}

Note: If you are using a non Linux distribution, download your OS compatible package from Elasticsearch Download Page

Choosing an embedding model

For performing semantic search, we need to generate embeddings for our textual information. Since we are dealing with questions, we will be using a sentence embedding model to generate the embeddings. Sentence embedding techniques represent whole sentences and their semantic information as vectors. This helps in understanding the context, intent, and other nuances throughout the text. Some of the state-of-the-art sentence embedding techniques are:

  • Doc2Vec
  • SentenceBERT
  • InferSent
  • Universal Sentence Encoder

We will be using Universal Sentence Encoder for generating sentence embeddings. The universal encoder supports several downstream tasks and thus can be adopted for multi-task learning, i.e, the generated sentence embeddings can be used for multiple tasks like sentiment analysis, text classification, sentence similarity, etc..

Architecture

Universal Sentence Encoder is based on two encoders, Transformer and Deep Averaging Network(DAN). Multi-task learning is possible because both models are general-purpose.

The input sentence is tokenized according to the PTB method (Penn Treebank) and passed through one of these models. 

  • Transformer

Transformer architecture was developed by Google in 2017. It leverages self-attention with multi blocks to learn the context aware word representation.

transformer architecture

  • Deep Averaging Network (DAN)

The Deep Averaging Network (DAN) is a very simple model in which the word embeddings of the input text are simply averaged and then fed to a feed forward neural network.

deep averaging network

The transformer architecture performs better, but requires more resources to train. Although DAN doesn’t work as well as the transformer architecture. The advantage of DAN is a simple model that requires fewer training resources.

Setting up the environment

  • Step 1 – Clone the project repo
git clone https://github.com/adarsh-ops/semantic-qa.git
cd semantic-qa
  • Step 2 – Install dependencies
pip install requirements.txt

Indexing the dataset

The dataset used in this project is from COVID-Q, a dataset of 1,690 questions about COVID-19 from thirteen online sources. The dataset is annotated by classifying questions into 15 question categories and by grouping questions that ask the same thing into 207 question classes.

For creating a search space, let’s first index our dataset in the elasticsearch node. 

Dataset Overview [download]

CategoryQuestion ID
QuestionSourceAnswers
Speculation - Pandemic Duration42will covid end soon
Google SearchMay 1st, I think, is completely unrealistic
Speculation - Pandemic Duration42will covid end
Yahoo SearchMay 1st, I think, is completely unrealistic
Speculation - Pandemic Duration42when covid will be overGoogle SearchMay 1st, I think, is completely unrealistic
Speculation - Pandemic Duration42when covid lockdown endsGoogle SearchMay 1st, I think, is completely unrealistic
Speculation - Pandemic Duration42will covid go awayGoogle SearchMay 1st, I think, is completely unrealistic

We will only be using the Question ID, Question and Answer fields of the dataset.

  • Step 1 – Define the elastic search index mapping
{
    "mappings": {
        "properties": {
            "question": {
                "type": "text"
            },
            "answer": {
                "type": "text"
            },
            "question_vec": {
                "type": "dense_vector",
                "dims": 512
            },
            "q_id": {
                "type": "long"
            }
        }
    }
}

a. Mappings – Defines the structure of the index

b. Properties – Defines the field/type relations

c. Fields –

    1. question – text field to hold questions in the dataset
    2. answer – text field to answer the respective question
    3. question_vec – 512 dimensional vector representation (embedding) of the question
    4. q_id – Id of type long to represent the question id
  • Step 2 – Define the required configs in config.py
# Universal Sentence Encoder Tf Hub url
MODEL_URL = "https://tfhub.dev/google/universal-sentence-encoder/4"

# Elasticsearch ip and port
ELASTIC_IP = "localhost"
ELASTIC_PORT = 9200
# Min score for the match
SEARCH_THRESH = 1.2
  • Step 3 – Index the question vectors and answers

Run the dump_qa.py file to index the dataset at data/COVID-QA.csv

python dump_qa.py

This creates an index named “covid-qa” in the elasticsearch node with the mapping defined in step 1.

Code Breakdown

  • Load the universal-sentence-encoder model
model = hub.load(config.MODEL_URL)
  • Connect to the elasticsearch node
connect_elastic(config.ELASTIC_IP, config.ELASTIC_PORT)
  • Read and clean the dataset
df = pd.read_csv("data/COVID-QA.csv")
df.dropna(inplace=True, subset=["Answers", "Question"])
  • Index each QA pair in the dataset along with the Question ID and generated embedding into covid-qa index
for _, row in tqdm(df.iterrows()):
    insert_qa({
        'question': row['Question'],
        'answer': row['Answers'],
        'question_vec': np.asarray(model([row['Question']])[0]).tolist(),
        'q_id': row['Question ID']
    })

Here, model([row[‘Question’]]) generates a 512 dimensional embedding for the given question.

Building a search API

With the search space created, all that is left to do is define the semantic search criteria, for this, let’s build an API for returning ‘top_n’ results given a search query (question).

  • Step 1 – Define the API server and load configs
from flask import Flask, request

app = Flask(__name__)
app.config.from_object('config')
  • Step 2 – Load the universal-sentence-encoder and connect to es node
model = hub.load(app.config['MODEL_URL'])
connect_elastic(app.config['ELASTIC_IP'], app.config['ELASTIC_PORT'])
  • Step 3 – Define the semantic search criteria
s_body = {
        "query": {
            "script_score": {
                "query": {
                    "match_all": {}
                },
                "script": {
                    "source": "cosineSimilarity(params.query_vector, 'question_vec') + 1.0",
                    "params": {"query_vector": query_vec}
                }
            }
        }
    }

For estimating the nearest ‘n’ records, cosine similarity between the query vector and the indexed question vectors are calculated. Ideally, the cosine similarity range is [-1, 1], to change the score into real positive values, adding ‘1’ to the score will update the range to [0, 2]. 

result = es_conn.search(index="covid-qa", body=s_body)

for hit in result["hits"]["hits"]:
    print("--\nscore: {} \n question: {} \n answer: {}\n--".format(hit["_score"], hit["_source"]['question'], hit["_source"]['answer']))
  • Step 4 – Define the API to perform semantic search
@app.route("/query", methods=["GET"])
def qa():
    if request.args.get("query"):
        query_vec = np.asarray(model([request.args.get("query")])[0]).tolist()
        records = semantic_search(query_vec, app.config['SEARCH_THRESH'])
    else:
        return {"error": "Couldn't process your request"}, 422
    return {"data": records}

For performing the semantic vector match, we need to represent the raw text query as embeddings, model([request.args.get(“query”)]) generates a 512 dimensional embedding for the input query. 

  • Step 5 – Run the API server
app.run(host="0.0.0.0", port=5000)

The server will be up and running on port 5000 of your machine.

So far, we’ve discussed semantic similarity, its applications, implementation techniques and built a simple QA engine using elasticsearch and a universal sentence encoder.

Need help with your product development?

Reach out to us today and get started!

View Services

Semantic matching is helpful in applications like paraphrase identification, question answering, natural language generation and intelligent tutoring systems. In short, if we want our system to be robust in terms of grammatical reasoning we would need to add a semantic analyzing technique to it. In some use cases, a hybrid approach may perform better. The QA engine we built earlier can be made more efficient by introducing some text pre-processing techniques like cleansing, stop words removal etc.

Author

  • mm
    Adarsh M S

Technology enthusiast with an urge to explore into vast areas of advancing technologies. Experienced in domains like Computer Vision, Natural Language Processing, Big data. Believes in open source contributions and loves to provide support to the community. Actively involved in building open source tools related to information retrieval.

Related Posts

Drive into stories worth discussing

  • mm
    Best Practices in On-demand Services App Development
    Arun Jayakumar
  • mm
    Flask Vs FastAPI which one should you choose?
    Raoof Naushad
  • mm
    Payment Service Providers Directive 2 or PSD2: what it means for FinTech businesses
    Praveen Menon
  • mm
    Getting started with Julia language- Part 3, Variables and types
    Adarsh M S

Categories

View articles by categories

  • Artificial Intelligence
  • Blockchain
  • General
  • Health Care
  • ICOs
  • Success Stories
  • Team Culture & Values

Subscribe now to get our latest posts

  • facebook
  • linkedin
  • twitter
  • youtube
All Rights Reserved. Accubits Technologies Inc 2020