Every morning, my RSS reader greets me with hundreds of new posts. Tech blogs, indie developers’ journals, photography content - they all compete for attention. While I’ve gotten good at quickly scanning through these feeds , I keep wondering about all the great content I might be missing from sources I’ve had to ignore simply because their signal-to-noise ratio doesn’t justify daily checking.
On the other hand, the posts that I shortlist from my RSS feeds and read or listen to, end up on a curated repository of articles that have passed my personal quality threshold, so I have access to a valuable collection of content (on Pocket ) that is relevant to my interests. This made me wonder, can I utilize this data, and create a content recommendation system tailored to my preferences? Can I build a system that would review new posts from feeds where only 1 in 20 posts might match my content priorities, and filter those for me?
Over the following weeks, my goal is to build this content recommendation system, and to document my thoughts, process, and lessons learned in a blog post series, of which this is the first one. The rough process in my mind includes analyzing the corpus of previously liked content, building a model of personal interests based on content embeddings, and comparing new content against this model to predict the likelihood of interest. If you want to follow along, follow my blog’s RSS feed or subscribe to my newsletter to get notified about new posts.
The process
The heart of this project is experimenting with different recommendation algorithms, but solid groundwork comes first. Before diving into the mechanics of the recommender system, I need to establish a clean, reliable dataset. Here’s my initial roadmap:
- Get a list of the liked articles from Pocket
- I can use Pocket API to fetch the data and store it in a sqlite db or a csv file.
- Extract and clean up the text content from liked articles
- Using content extraction tools like r.jina.ai or readability , which strip away advertisements and formatting while preserving the core article text.
- Summarize the text content into a few paragraphs using a large language model
- The articles I’ve saved on Pocket range from a few short paragraphs to tens of pages. The corpus needs to be normalized somehow, and I can use Gemini flash or a local LLM to summarize each content in 2-5 paragraphs.
- The extracted texts and summaries will need to be inspected and cleaned up, as I anticipate some URLs returning 404 or blocking my scraper.
- Generate embeddings for each content summary using an embeddings API (Gemini or OpenAI) or a local embedding model with sentence-transformers
- Store the embeddings for comparison with new content
- Probably in a sqlite db using sqlite-vec
- Build the content recommendation system
- Honestly I’m not quite sure how this will be done yet.
- Check for new content from desired sources every day
- Using GitHub Actions or some other sort of cron job.
- Extract the text, summarize, and generate embeddings for the new content
- Find the ones that are close to my interests and add them to an RSS feed
Admittedly, the later stages of building the recommendation system are unclear, but that’s exactly why I’m documenting this process. The immediate focus is clear: preparing a clean, reliable dataset. Meanwhile, I’ll be diving deeper into recommendation system architectures to prepare for the more complex challenges ahead.
Open questions
Thinking about the challenges ahead, I’m faced with some questions I currently don’t have answers to. If you can provide any guidance or feedback, feel free to reach out via email , bsky , or twitter .
- How do I build the content recommendation system?! No, but seriously, how do we compare the embeddings of liked content to those of new, unknown content?
- Calculate the similarity score (cosine similarity) of the average of the liked content with the new content? But I have a diverse set of liked content in the dataset.
- Should I cluster the liked content into categories and figure out if the new content is close enough to one of the clusters?
- What if a new content is close to a cluster (e.g.
Photography
) but still not relevant to my interests at all (e.g. I read photography stuff only if they are related to street or landscape photography, or about the Sony camera I own)?
Read the part two of this blog series: Data Processing and Cleaning