Understanding Input Masking in LLM Finetuning

I’ve been using conversational alpaca or sharegpt formats for fine-tuning LLMs with Axolotl , but it always felt unnecessary to limit the model on a conversational format when the use-case doesn’t require so. I’m currently working on a project to classify pull requests in my company’s code repositories. The model needs to look at the PR title, description, and code changes, then categorize the PR and explain its reasoning. I thought there must be a way to fine-tune these models with any format I see fitting this specific use-case, and sure there is: Template-free Axolotl ...

2024-06-29 · 5 min · Saeed Esmaili

To Chunk or Not to Chunk With the Long Context Single Embedding Models

In his excellent write up on state of the art embedding models, Aapo Tanskanen compares the retrieval score for when the source documents are split into chunks and when they’re not: Transformer-based single embedding models have traditionally had 512 token context windows because of the usage of the original BERT encoder. Newer models, like the BGE-M3, have expanded the token window to much larger scales. It could be tempting to forget chunking and just embed long texts as they are. However, that would mean mashing many topics and entities you might want to search for into a single vector representation, which doesn’t sound like a good idea. ...

2024-06-02 · 2 min

Topic Classification of Texts Locally Using BERTopic

I’ve been recently working on survey response data that in addition to aggregatable question types like Likert-scale and multiple-choice questions, includes optional free-text questions. Although we are lucky that thousands of the respondents spend time elaborating on questions and leaving comprehensive free-text responses, getting insights from these text responses is challenging. While investigating how to enrich this text data with proper metadata related to their topics, I came across BERTopic which introduces itself as a topic modeling technique to create clusters allowing for easily interpretable topics. In this post, I’ll explore BERTopic and will go through an example to explain what adjustments worked for me. ...

2023-09-12 · 7 min · Saeed Esmaili

Text Chunking and Headings Grouping: A Guide to Parsing Documents with Pandoc and Python

In my previous blog post I explored using the unstructured python library for loading and parsing documents. As I mentioned in the post, although unstructured seems a very useful library, it has a few issues. Since I’m planning to do a semantic search on the paragraphs and feed the relevant ones to a large language model, the library’s inability to reliably identify headings and paragraphs was a big problem for me. ...

2023-07-08 · 7 min · Saeed Esmaili

Demystifying Text Data with the unstructured Python Library (+alternatives)

In the world of data, textual data stands out as being particularly complex. It doesn’t fall into neat rows and columns like numerical data does. As a side project, I’m in the process of developing my own personal AI assistant. The objective is to use the data within my notes and documents to answer my questions. The important benefit is all data processing will occure locally on my computer, ensuring that no documents are uploaded to the cloud, and my documents will remain private. ...

2023-07-05 · 6 min · Saeed Esmaili

Generating text embeddings locally using sentence-transformers

Recently, I’ve been working on a side project where I use OpenAI’s text-embedding-ada-002 model to generate vector embeddings for text snippets. While this model is inexpensive, the cost can add up when dealing with thousands or millions of text snippets. Therefore, I decided to explore alternatives, particularly those that would allow me to run similar models locally instead of relying on OpenAI’s API. In this post, I’ll share my experience using the sentence-transformers library for this purpose and discuss the pros and cons. ...

2023-07-02 · 4 min · Saeed Esmaili

Exploring OpenAI's Whisper with Non-English Voices

TL;DR: Whisper.cpp is the fastest when you’re trying to use the large Whisper model on a Mac. For top-quality results with languages other than English, I recommend to ask model to translate into English. About Whisper Whisper is OpenAI’s speech-to-text model and it’s well-known for its impressive results. Although I knew about it for a while, I didn’t get to test its real-world performance until recently. So, I spent a weekend seeing how it could handle converting speeches, in both English and other languages, into text. ...

2023-06-21 · 6 min · Saeed Esmaili