Enhancing Text-to-SQL With Synthetic Summaries

LLMs are being experimented with to do so many things today, and one of the use cases that sound compelling is getting their help to generate insights from data. What if you could find the answer to your data question without begging a data analyst in your company? But this is easier said than done. To perform this task properly, LLMs need to know about your datasets, the tables, their schemas, and values stored in them. You can provide this information in the prompt itself if your dataset is tiny, but this is not possible in most real life scenarios, since the information will be huge and either it won’t fit the LLM’s context knowledge or it will be very expensive and not feasible. ...

2025-03-18 · 2 min

Label-Studio: Annotate Text and Image Data for AI and ML training

A few months ago I used streamlit to build a simple UI, so I can collect manually labeled data for a LLM fine-tuning task at work. Streamlit is fine, but the full process of creating a nice UI with required functionalities for data annotation and data storage management wasn’t trivial. Today I found out about label-studio which is an easy to use framework (backend and frontend) for data annotation task. It provides various annotation templates for text, image, audio, and video data! ...

2024-12-19 · 2 min

Understanding Input Masking in LLM Finetuning

I’ve been using conversational alpaca or sharegpt formats for fine-tuning LLMs with Axolotl , but it always felt unnecessary to limit the model on a conversational format when the use-case doesn’t require so. I’m currently working on a project to classify pull requests in my company’s code repositories. The model needs to look at the PR title, description, and code changes, then categorize the PR and explain its reasoning. I thought there must be a way to fine-tune these models with any format I see fitting this specific use-case, and sure there is: Template-free Axolotl ...

2024-06-29 · 5 min · Saeed Esmaili

To Chunk or Not to Chunk With the Long Context Single Embedding Models

In his excellent write up on state of the art embedding models, Aapo Tanskanen compares the retrieval score for when the source documents are split into chunks and when they’re not: Transformer-based single embedding models have traditionally had 512 token context windows because of the usage of the original BERT encoder. Newer models, like the BGE-M3, have expanded the token window to much larger scales. It could be tempting to forget chunking and just embed long texts as they are. However, that would mean mashing many topics and entities you might want to search for into a single vector representation, which doesn’t sound like a good idea. ...

2024-06-02 · 2 min

Running Python on a serverless GPU instance for machine learning inference

I was experimenting with some speech-to-text work using OpenAI’s Whisper models today, and transcribing a 15-minute audio file with Whisper tiny model on AWS Lambda (3 vcpu) took 120 seconds. I was curious how faster this could be if I ran the same transcription model on a GPU instance, and with a quick search, modal.com seemed like a nice option to spin up a GPU machine, run the code, and shut down the machine, similar to how AWS Lambda works. ...

2024-04-22 · 5 min · Saeed Esmaili

Topic Classification of Texts Locally Using BERTopic

I’ve been recently working on survey response data that in addition to aggregatable question types like Likert-scale and multiple-choice questions, includes optional free-text questions. Although we are lucky that thousands of the respondents spend time elaborating on questions and leaving comprehensive free-text responses, getting insights from these text responses is challenging. While investigating how to enrich this text data with proper metadata related to their topics, I came across BERTopic which introduces itself as a topic modeling technique to create clusters allowing for easily interpretable topics. In this post, I’ll explore BERTopic and will go through an example to explain what adjustments worked for me. ...

2023-09-12 · 7 min · Saeed Esmaili

Generating text embeddings locally using sentence-transformers

Recently, I’ve been working on a side project where I use OpenAI’s text-embedding-ada-002 model to generate vector embeddings for text snippets. While this model is inexpensive, the cost can add up when dealing with thousands or millions of text snippets. Therefore, I decided to explore alternatives, particularly those that would allow me to run similar models locally instead of relying on OpenAI’s API. In this post, I’ll share my experience using the sentence-transformers library for this purpose and discuss the pros and cons. ...

2023-07-02 · 4 min · Saeed Esmaili

Exploring OpenAI's Whisper with Non-English Voices

TL;DR: Whisper.cpp is the fastest when you’re trying to use the large Whisper model on a Mac. For top-quality results with languages other than English, I recommend to ask model to translate into English. About Whisper Whisper is OpenAI’s speech-to-text model and it’s well-known for its impressive results. Although I knew about it for a while, I didn’t get to test its real-world performance until recently. So, I spent a weekend seeing how it could handle converting speeches, in both English and other languages, into text. ...

2023-06-21 · 6 min · Saeed Esmaili