In this article, we’ll learn how to do vector similarity search using elasticsearch with an example. Before jumping into the tutorial, let’s brush up our knowledge a bit a familiarise the basics of semantic search, vector similarity, similarity search, etc. You’re welcome to skip the intro and jump to topics that interest you from the index below.
Elasticsearch is an open-source, distributed engine for search and analytics, built on Apache Lucene. It enables users to store, search, and analyze large volumes of data quickly and in near real time. Typically, Elasticsearch is utilized as an underlying technology that powers applications with complex search features and requirements.
Elasticsearch along with its ecosystem of components known as Elastic Stack have found applications in a variety of areas including simple search on websites or documents, collecting and analyzing log data, data analysis and visualization.
An easy way to carry out similarity search would be to rank documents based on how many words they share with the query. But a document may be similar to the query even if they have very few words in common. This is why semantic search is important.
Semantic search involves a type of search that is often seen in search engines. It retrieves content after comprehending the intent and meaning of the user’s search query. Semantic search is much more advanced than the traditional text and keyword match search. Traditional keyword search does not consider lexical variants or conceptual matches to the user’s search phrase. If the precise wording used in the query cannot be found in the overall content, incorrect results would be provided to the user.
This type of search is based on two concepts:
Reach out to us today and get started!
Question-answering system: Given a collection of frequently asked questions, the search can find questions with the same meaning as the user’s new query. It can then provide stored results of similar questions.
Vector similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. It is often used to measure document similarity in text analysis.
A document can be represented by thousands of attributes, each recording the frequency of a particular word (such as a keyword) or phrase in the document. Thus, each document is an object represented by what is called a term-frequency vector. Each element of the vector is associated with a word in the document and the value is the number of times that word is found in the document in question. The vector similarity is then computed between the two documents.
Semantic search can be implemented through a variety of different approaches. NLP specialists have developed a unique technique known as text embeddings. Text embedding involves converting words and sentences into fixed-size dense numeric vectors. This means that any kind of unstructured text can be converted to vectors. These vectors enable the understanding of the contextual meaning of the text and can be used to find the similarity between the user query and the web pages. If the text embeddings to two texts are similar, it means that the two texts are semantically similar. These vectors can be indexed in Elasticsearch to perform semantic similarity searches.
Text embeddings can be used to retrieve questions that are similar to a user’s query. This is done though the following process-
Let’s implement a simple question-answering search engine using elasticsearch and a sentence embedding architecture.
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.1-linux-x86_64.tar.gz tar -xzf elasticsearch-7.9.1-linux-x86_64.tar.gz
cd elasticsearch-7.9.1 ./bin/elasticsearch
All done! Now let’s see if the elasticsearch node is up and running by sending a HTTP request to port 9200 (default port of es node).
curl -X GET "localhost:9200/?pretty"
The above command returns something similar to this:
{ "name" : "desktop-pc", "cluster_name" : "elasticsearch", "cluster_uuid" : "Qr8lzRyZQnuNlcmp4P_OPB", "version" : { "number" : "7.9.1", "build_flavor" : "default", "build_type" : "tar", "build_hash" : "f27399d", "build_date" : "2020-03-26T06:34:37.794943Z", "build_snapshot" : false, "lucene_version" : "8.6.2", "minimum_wire_compatibility_version" : "1.2.3", "minimum_index_compatibility_version" : "1.2.3" }, "tagline" : "You Know, for Search" }
Note: If you are using a non Linux distribution, download your OS compatible package from Elasticsearch Download Page
For performing semantic search, we need to generate embeddings for our textual information. Since we are dealing with questions, we will be using a sentence embedding model to generate the embeddings. Sentence embedding techniques represent whole sentences and their semantic information as vectors. This helps in understanding the context, intent, and other nuances throughout the text. Some of the state-of-the-art sentence embedding techniques are:
We will be using Universal Sentence Encoder for generating sentence embeddings. The universal encoder supports several downstream tasks and thus can be adopted for multi-task learning, i.e, the generated sentence embeddings can be used for multiple tasks like sentiment analysis, text classification, sentence similarity, etc..
Universal Sentence Encoder is based on two encoders, Transformer and Deep Averaging Network(DAN). Multi-task learning is possible because both models are general-purpose.
The input sentence is tokenized according to the PTB method (Penn Treebank) and passed through one of these models.
Transformer architecture was developed by Google in 2017. It leverages self-attention with multi blocks to learn the context aware word representation.
The Deep Averaging Network (DAN) is a very simple model in which the word embeddings of the input text are simply averaged and then fed to a feed forward neural network.
The transformer architecture performs better, but requires more resources to train. Although DAN doesn’t work as well as the transformer architecture. The advantage of DAN is a simple model that requires fewer training resources.
git clone https://github.com/adarsh-ops/semantic-qa.git cd semantic-qa
pip install requirements.txt
Indexing the dataset
The dataset used in this project is from COVID-Q, a dataset of 1,690 questions about COVID-19 from thirteen online sources. The dataset is annotated by classifying questions into 15 question categories and by grouping questions that ask the same thing into 207 question classes.
For creating a search space, let’s first index our dataset in the elasticsearch node.
Dataset Overview [download]
Category | Question ID | Question | Source | Answers |
---|---|---|---|---|
Speculation - Pandemic Duration | 42 | will covid end soon | Google Search | May 1st, I think, is completely unrealistic |
Speculation - Pandemic Duration | 42 | will covid end | Yahoo Search | May 1st, I think, is completely unrealistic |
Speculation - Pandemic Duration | 42 | when covid will be over | Google Search | May 1st, I think, is completely unrealistic |
Speculation - Pandemic Duration | 42 | when covid lockdown ends | Google Search | May 1st, I think, is completely unrealistic |
Speculation - Pandemic Duration | 42 | will covid go away | Google Search | May 1st, I think, is completely unrealistic |
We will only be using the Question ID, Question and Answer fields of the dataset.
{ "mappings": { "properties": { "question": { "type": "text" }, "answer": { "type": "text" }, "question_vec": { "type": "dense_vector", "dims": 512 }, "q_id": { "type": "long" } } } }
a. Mappings – Defines the structure of the index
b. Properties – Defines the field/type relations
c. Fields –
# Universal Sentence Encoder Tf Hub url MODEL_URL = "https://tfhub.dev/google/universal-sentence-encoder/4" # Elasticsearch ip and port ELASTIC_IP = "localhost" ELASTIC_PORT = 9200 # Min score for the match SEARCH_THRESH = 1.2
Run the dump_qa.py file to index the dataset at data/COVID-QA.csv
python dump_qa.py
This creates an index named “covid-qa” in the elasticsearch node with the mapping defined in step 1.
Code Breakdown
model = hub.load(config.MODEL_URL)
connect_elastic(config.ELASTIC_IP, config.ELASTIC_PORT)
df = pd.read_csv("data/COVID-QA.csv") df.dropna(inplace=True, subset=["Answers", "Question"])
for _, row in tqdm(df.iterrows()): insert_qa({ 'question': row['Question'], 'answer': row['Answers'], 'question_vec': np.asarray(model([row['Question']])[0]).tolist(), 'q_id': row['Question ID'] })
Here, model([row[‘Question’]]) generates a 512 dimensional embedding for the given question.
With the search space created, all that is left to do is define the semantic search criteria, for this, let’s build an API for returning ‘top_n’ results given a search query (question).
from flask import Flask, request app = Flask(__name__) app.config.from_object('config')
model = hub.load(app.config['MODEL_URL']) connect_elastic(app.config['ELASTIC_IP'], app.config['ELASTIC_PORT'])
s_body = { "query": { "script_score": { "query": { "match_all": {} }, "script": { "source": "cosineSimilarity(params.query_vector, 'question_vec') + 1.0", "params": {"query_vector": query_vec} } } } }
For estimating the nearest ‘n’ records, cosine similarity between the query vector and the indexed question vectors are calculated. Ideally, the cosine similarity range is [-1, 1], to change the score into real positive values, adding ‘1’ to the score will update the range to [0, 2].
result = es_conn.search(index="covid-qa", body=s_body) for hit in result["hits"]["hits"]: print("--\nscore: {} \n question: {} \n answer: {}\n--".format(hit["_score"], hit["_source"]['question'], hit["_source"]['answer']))
@app.route("/query", methods=["GET"]) def qa(): if request.args.get("query"): query_vec = np.asarray(model([request.args.get("query")])[0]).tolist() records = semantic_search(query_vec, app.config['SEARCH_THRESH']) else: return {"error": "Couldn't process your request"}, 422 return {"data": records}
For performing the semantic vector match, we need to represent the raw text query as embeddings, model([request.args.get(“query”)]) generates a 512 dimensional embedding for the input query.
app.run(host="0.0.0.0", port=5000)
The server will be up and running on port 5000 of your machine.
So far, we’ve discussed semantic similarity, its applications, implementation techniques and built a simple QA engine using elasticsearch and a universal sentence encoder.
Reach out to us today and get started!
Semantic matching is helpful in applications like paraphrase identification, question answering, natural language generation and intelligent tutoring systems. In short, if we want our system to be robust in terms of grammatical reasoning we would need to add a semantic analyzing technique to it. In some use cases, a hybrid approach may perform better. The QA engine we built earlier can be made more efficient by introducing some text pre-processing techniques like cleansing, stop words removal etc.