How to build a Semantic Search Engine in Rust

Sacha Arbonel
6 min readNov 5, 2022

--

“A Crab Searching for books in a book shelf using a magnifying glass”

Introduction

A semantic search engine is a type of recommender system that relies on the meaning of words to provide better search results. It is different from a traditional full text search engine, which relies on keyword matching to provide results.

A semantic search engine allows you to search for concepts, not just keywords. It understands meaning and the relationships between different concepts and can provide more relevant results based on those relationships.

In this article, we will discuss how to build a semantic search engine in Rust. We’ll explain what are embeddings, transformers, and nearest neighbor searches and how to perform them using KD-trees.

What Are Embeddings?

Embedding is a term used in machine learning and statistics. Embeddings are mathematical representations of data that can be used for dimensionality reduction, similarity measurement, and many other tasks.

In natural language processing (NLP), text embedding is a mapping of a word or phrase into a vector of real numbers. The vector represents the word or phrase in a high-dimensional space.

The idea of word embeddings was first proposed in the early 2000s, but it wasn’t until 2013 that they became widely used with the release of the word2vec algorithm and then BERT in 2018.

What Is a Transformer?

Embeddings are used in NLP tasks such as machine translation, question answering, and text classification. They are also used in recommender systems and knowledge graphs.

A transformer is a type of neural network that can be used for NLP tasks such as machine translation and question answering. The transformer was first proposed in 2017 in the paper Attention Is All You Need.

Transformers are different from traditional recurrent neural networks (RNNs) in that they do not require sequential data. This makes them well suited for tasks like machine translation, where the input and output are not necessarily in the same order.

Transformers are also more parallelizable than RNNs, which makes them faster to train.

Sentence Transformers is a state-of-the-art model for text and image embeddings.

How to Use Sentence Transformers for Text Embeddings?

The rust-bert crate provides ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,…) in Rust. The crate also provides a Sentence Embeddings model.

In this article, we will use the rust-bert library to generate text embeddings. We will use the all-MiniLM-L12-v2 model, which is a pre-trained model that uses the MiniLM architecture.

For example, to get the text embedding for the sentence “this is an example sentence” and “each sentence is converted”:

Nearest Neighbor Search

A nearest neighbor search is a method for finding the closest datapoint to a given query point. The query point can be any datapoint, not just a point in the dataset.

The goal of a nearest neighbor search is to find the datapoint that is most similar to the query point. Similarity can be measured using any distance metric, such as the Euclidean distance or the cosine similarity.

What Is a KD-Tree?

A KD-tree is a type of data structure that can be used for efficient nearest neighbor searches. A KD-tree is a binary tree where each node is divided into two children based on a splitting criterion.

The splitting criterion can be any function that maps a datapoint to a real number. The most common splitting criterion is the Euclidean distance.

KD-trees are very efficient for nearest neighbor searches because they reduce the search space by half at each level of the tree.

For example, if we want to find the datapoint that is nearest to the point (3.1, 0.9, 2.1), we can use a KD-tree to reduce the search space from the entire dataset to just the datapoints that are closest to (3.1, 0.9, 2.1).

In Rust, we can use the kd-tree crate to build a KD-tree. We can use the build_by_ordered_float function to create a KD-tree from a list of points.

Implementing a Semantic Search Engine

In this section, we will discuss how to build a semantic search engine in Rust. We will use the Rust NLP crates rust-bert and kd-tree to perform similarity searches on books. In applications, it is a feature usually referred to as “Similar Books”.

Now we can write the code for our semantic search engine. We will start by loading our data stored in json, encode the embeddings and feed them to our kdtree.

Prerequisites

Having Rust installed

Depedencies

Let’s create our project called “semantic-search”

cargo new semantic-search

And add our dependencies we need

cd semantic-search 
cargo add serde --features derive
cargo add rust-bert
cargo add anyhow
cargo add kd-tree
cargo add typenum

Representing Our Books in Json

Create a file called books.json to store our json data.

Deserializing Our Json Data Into A Struct

Our program will load the json data into a struct. We can use the serde crate to deserialize the json data into our struct.

We can now read the json from a file using the std::fs module and deserialize the json into our Book struct.

Encoding Book Summaries Into Embeddings

We can now use our model to encode each of our book summaries into embeddings. We will use the rust-bert crate to perform the encoding. We'll also need a convenient method to convert our book model into an Embedded book model.

Finally, call encode for each book and add it to the embeddedbooks vector:

Creating the KD-tree

Before performing our search, first, we need to create a KD-tree from our embeddedbooks vector. We can do this using the sort_by function from the kd-tree crate.

We also need to implement the KdPoint trait for our EmbeddedBook struct. This trait allows us to use our EmbeddedBook struct with the kd-tree crate.

Now we can use the sort_by function to create a KD-tree from our embeddedbooks vector.

Performing the Nearest Neighbour Search

Now that we have encoded our books into embeddings, we can use a nearest neighbor search to find the book that is most similar to a given query.

We need a convenient topic method on EmbeddedBook to create a dummy book with authors, title etc to for topic embedding.

We can use the kd-tree crate’s nearests function to perform the search. This function takes a query point and a number of results to return.

In this case, we will use the query “rich” and return 10 results.

After running our program, this is what we are getting

Conclusion

In this article, we’ve discussed how to build a basic semantic search engine in Rust. We’ve explained what are embeddings, transformers, and nearest neighbor searches, and how to perform them using KD-trees.

We’ve also discussed how to use the rust-bert crate to generate text embeddings, and how to use the kd-tree crate to perform nearest neighbor searches.

You can find the source code in this repository.

Where to go from here? What can we improve?

Well, keeping those embeddings in memory and running the model everytime we run the program is not a very scalable solution, in a real world scenario. Also this kd-tree crate doesn’t support removing or updating nodes in our tree. You could try this rs-annoy project. Annoy is what is used at Spotify to store embeddings of songs. There is also this hora completely written in Rust.

You could serialize the embedded books using the bincode crate on the EmbeddedBook struct along with the serde-big-array create for the serialization on our embedding field. There is an example on how to do serialize structs into binary in this Rust by example guide, the struct just have to derive the Serialize trait from serde.

Or you could use the pgvector Postgres extension and the corresponding Rust crate to respectively store them on postgres and query them using their builtin operator <-> for performing neighbour search.

Instead of our cli, we can also serve our recommendations to end users by wrapping our engine using a REST api with axum for example.

--

--

Sacha Arbonel
Sacha Arbonel

Responses (2)