Saeed Esmaili

I’m a data scientist based in Amsterdam, currently working at Spotify. I work on projects aimed at enhancing the productivity and experience of developers, using research and data analysis to inform our platform strategies.

Beyond work, I enjoy pursuing personal side projects focused on LLMs and recommendation systems, playing tennis, and taking photos. I also write short notes and share interesting links regularly about what I learn and discover.

I’m always open to collaboration or providing assistance on projects related to my areas of interest. If you think I can contribute to your work or if you’re keen on collaborating, don’t hesitate to reach out.

Understanding Input Masking in Llm Finetuning

I’ve been using conversational alpaca or sharegpt formats for fine-tuning LLMs with Axolotl , but it always felt unnecessary to limit the model on a conversational format when the use-case doesn’t require so. I’m currently working on a project to classify pull requests in my company’s code repositories. The model needs to look at the PR title, description, and code changes, then categorize the PR and explain its reasoning. I thought there must be a way to fine-tune these models with any format I see fitting this specific use-case, and sure there is: Template-free Axolotl This seemed exactly what I was looking for, but the emphais on “masking inputs” made me confused:...

2024-06-29 · 5 min · Saeed Esmaili

Running Python on a serverless GPU instance for machine learning inference

I was experimenting with some speech-to-text work using OpenAI’s Whisper models today, and transcribing a 15-minute audio file with Whisper tiny model on AWS Lambda (3 vcpu) took 120 seconds. I was curious how faster this could be if I ran the same transcription model on a GPU instance, and with a quick search, modal.com seemed like a nice option to spin up a GPU machine, run the code, and shut down the machine, similar to how AWS Lambda works....

2024-04-22 · 5 min · Saeed Esmaili

Migrating From Gatsby to Hugo

I’ve been using GatsbyJS for publishing my blog posts here, but I wanted to move to another static site generator that is more automation friendly (more on this later). That’s why I decided to migrate this blog to Hugo , which has a very active community and is developed with Go. At first, I was scared of this move, since I don’t know how to code in Go, but to my surprise the whole migration process didn’t require me to write any Go, and everything is handled via yaml, html, and jinja....

2024-02-17 · 6 min · Saeed Esmaili

Hand-drawn xkcd style charts with matplotlib

I’m a big fan of unique charting styles and I avoid using the default matplotlib style whenever possible, as I find it boring and soleless. This preference is not limited to charts, and I also like the hand-drawn styles for fonts and diagrams (excalidraws is a fantastic tool that I use frequently). The hand-drawn style is especially useful when presenting a proof of concept idea. Something very interesting that I’ve recently stumbled upon is an xkcd chart style for matplotlib ....

2023-12-09 · 2 min · Saeed Esmaili

Why you should try Alfred

Alfred has definitely boosted my productivity since I started using it almost a year ago. Sharing some fragments of my experience using Alfred with a few friends and colleagues made me realize I need to write down a summary of what I find helpful on Alfred, so I can share it with people in future when needed. What’s Alfred Alfred is an app for macOS which boosts your efficiency with hotkeys, keywords, text expansion and more....

2023-10-26 · 5 min · Saeed Esmaili

Getting started with developing browser extensions

For a personal project, I was looking for some guidance on how to develop a simple browser extension, but the information on this topic was so fragmanted and difficult to grasp. I then came accross a book named Building Browser Extensions which is focused on the same topic and I found it useful for getting an overall understanding of how to develop a simple extension, but it’s quite long, so I started summarizing the important parts for myself....

2023-09-25 · 8 min · Saeed Esmaili

Topic Classification of Texts Locally Using BERTopic

I’ve been recently working on survey response data that in addition to aggregatable question types like Likert-scale and multiple-choice questions, includes optional free-text questions. Although we are lucky that thousands of the respondents spend time elaborating on questions and leaving comprehensive free-text responses, getting insights from these text responses is challenging. While investigating how to enrich this text data with proper metadata related to their topics, I came across BERTopic which introduces itself as a topic modeling technique to create clusters allowing for easily interpretable topics....

2023-09-12 · 7 min · Saeed Esmaili

Text Chunking and Headings Grouping: A Guide to Parsing Documents with Pandoc and Python

In my previous blog post I explored using the unstructured python library for loading and parsing documents. As I mentioned in the post, although unstructured seems a very useful library, it has a few issues. Since I’m planning to do a semantic search on the paragraphs and feed the relevant ones to a large language model, the library’s inability to reliably identify headings and paragraphs was a big problem for me....

2023-07-08 · 7 min · Saeed Esmaili

Demystifying Text Data with the unstructured Python Library (+alternatives)

In the world of data, textual data stands out as being particularly complex. It doesn’t fall into neat rows and columns like numerical data does. As a side project, I’m in the process of developing my own personal AI assistant. The objective is to use the data within my notes and documents to answer my questions. The important benefit is all data processing will occure locally on my computer, ensuring that no documents are uploaded to the cloud, and my documents will remain private....

2023-07-05 · 6 min · Saeed Esmaili

Generating text embeddings locally using sentence-transformers

Recently, I’ve been working on a side project where I use OpenAI’s text-embedding-ada-002 model to generate vector embeddings for text snippets. While this model is inexpensive, the cost can add up when dealing with thousands or millions of text snippets. Therefore, I decided to explore alternatives, particularly those that would allow me to run similar models locally instead of relying on OpenAI’s API. In this post, I’ll share my experience using the sentence-transformers library for this purpose and discuss the pros and cons....

2023-07-02 · 4 min · Saeed Esmaili

TIL: Simplifying URL Parsing with Python's urlparse Library

Background For quite a while now, I’ve been using Pocket as my go-to read-it-later app. A few months back, I found myself wanting a solution to listen to my saved articles. This led me to explore the text-to-speech feature in the Reader app. Reader gave me the convenience of linking with my Pocket account, synchronizing its inbox automatically with my saved articles. This system has worked well. I’ve been listening to my saved articles on Reader while continuing to save new articles on Pocket....

2023-06-25 · 2 min · Saeed Esmaili

Exploring OpenAI's Whisper with Non-English Voices

TL;DR: Whisper.cpp is the fastest when you’re trying to use the large Whisper model on a Mac. For top-quality results with languages other than English, I recommend to ask model to translate into English. About Whisper Whisper is OpenAI’s speech-to-text model and it’s well-known for its impressive results. Although I knew about it for a while, I didn’t get to test its real-world performance until recently. So, I spent a weekend seeing how it could handle converting speeches, in both English and other languages, into text....

2023-06-21 · 6 min · Saeed Esmaili

Cleaning up incorrect and duplicates in a 1password account using its CLI

Recently, I decided to switch my primary password manager to 1Password , using it across my Macbooks and Pixel phones on various browsers, including Firefox and Chrome. One bonus feature that I particularly enjoy is its compatibility with Alfred , an app that helps me be more productive on my Mac. However, the transition was not all smooth sailing. When I transferred my passwords from Firefox and Chrome into 1Password, a couple of things didn’t go quite right....

2023-06-09 · 3 min · Saeed Esmaili

Why do we need to have a central place for TODOs

I keep hearing from people that they find it difficult to keep track of what needs to be done. Every one of us has a list of different things to remember, work on, and accomplish. This list includes not only the very well-defined tasks that we call work, but also any Slack messages and emails that we have to reply to, documents that we have to read and review, and vacation time that we need to submit....

2023-03-26 · 5 min · Saeed Esmaili

What I dislike about Google Docs (and what I like about it)

I use Google Docs every day at work to communicate with various groups of people. It’s a very effortless tool to start writing and share the output with whoever I want once I’m finished. But it definitely lacks a few critical features that hinder my productivity. What I like: Convenience and Collaboration Google Docs makes it easy to write and share documents with others and to invite them to review and collaborate....

2023-03-14 · 3 min · Saeed Esmaili

My RSS-based Content Consumption Workflow

TLDR: Here is a simplified diagram of how I keep reading articles, books, and other materials without overwhelming myself or having a never-ending pile of to-read items. In this blog post, I will go into the details of each part of the workflow, the tools I currently use, the alternatives I have tried, why they work for me, and what I can improve in the process. Feel free to use the above table of contents to jump to the section that is more interesting to you....

17 min · Saeed Esmaili