Recently, I’ve been working on a side project where I use OpenAI’s text-embedding-ada-002 model to generate vector embeddings for text snippets. While this model is inexpensive, the cost can add up when dealing with thousands or millions of text snippets. Therefore, I decided to explore alternatives, particularly those that would allow me to run similar models locally instead of relying on OpenAI’s API. In this post, I’ll share my experience using the sentence-transformers library for this purpose and discuss the pros and cons.

Choosing the Right Model

I was mostly intrested in using light-weight models, since I’m planning to run this on my own laptop or some cheap VPS. After looking at the available models in sentence-transformers, I chose all-MiniLM-L6-v2 model. It’s a small BERT-based model which is a MiniLM model fine tuned on a large dataset of over 1 billion training pairs. It maps sentences and paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search. A few other models I considered were all-mpnet-base-v2, e5-base-v2, and hkunlp/instructor-xl but they are larger and heavier to run.

Text Length Matters

It’s crucial to note that for Transformer models like BERT, runtime and memory requirements increase quadratically with input length. So there’s a limit on the length of inputs these models can handle. Any text that exceeds the specific limit of the model gets truncated to the first N word pieces.

Using sentence-transformers

If you have already installed the library (pip install sentence-transformers), here’s how you can load a sentence-transformer model and adjust the maximum sequence length:

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("all-MiniLM-L6-v2")

model.max_seq_length = 256

Generating embeddings

Generating embeddings for sentences is quite straightforward:

# Our sentences we like to encode
sentences = ["this is the first sample sentence", "this is the second sample sentence"]

# Sentences are encoded by calling model.encode()
embeddings = model.encode(sentences, normalize_embeddings=True)

Setting normalize_embeddings to True ensures the returned vectors have a length of 1. This allows you to use the faster dot-product instead of cosine similarity.

Searching for Similar Sentences

Here’s how you can identify the most similar sentences to your query within a corpus:

import torch

# Corpus with example sentences
corpus = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'The girl is carrying a baby.',
          'A man is riding a horse.',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'A cheetah is running behind its prey.'
corpus_embeddings = model.encode(corpus, normalize_embeddings=True)

query = "What is the man eating?"
query_embedding = model.encode(query, normalize_embeddings=True)

# Since the embeddings are normalized, we can use dot_score to find the highest 5 scores
dot_scores = util.dot_score(query_embedding, corpus_embeddings)[0]
top_results = torch.topk(dot_scores, k=5)

for score, idx in zip(top_results[0], top_results[1]):
    print(corpus[idx], "(Score: {:.4f})".format(score))

It will give you the following:

A man is eating food. (Score: 0.7354)
A man is eating a piece of bread. (Score: 0.6634)
A man is riding a horse. (Score: 0.2435)
A man is riding a white horse on an enclosed ground. (Score: 0.2230)
A cheetah is running behind its prey. (Score: 0.1960)

My Use Case and Observations

I tested this approach with my dataset of approximately 75,000 short text snippets, intending to identify the most relevant snippets for a given query. While generating embeddings took roughly 25 minutes on my Macbook Air — significantly slower than using OpenAI — I found this acceptable since I only needed to perform this process once. Another limitation of running this locally is the maximum input lengthof 256 tokens (for the all-MiniLM-L6-v2 model). However, this wasn’t an issue for me as my text snippets were short.

I’m also thinking about using this method to create a local LLM assistant, that can look at my private notes and documents on my laptop or my own server. This way, it’s much cheaper than using a service like OpenAI and I don’t have to worry about my text data privacy. Overall, I think for many use cases where you have a smaller dataset and don’t need real-time responses, running your own sentence transformer model can be a viable option.

Comment? Reply via Email, Mastodon or Twitter.