Text Chunking and Headings Grouping: A Guide to Parsing Documents with Pandoc and Python

In my previous blog post I explored using the unstructured python library for loading and parsing documents. As I mentioned in the post, although unstructured seems a very useful library, it has a few issues. Since I’m planning to do a semantic search on the paragraphs and feed the relevant ones to a large language model, the library’s inability to reliably identify headings and paragraphs was a big problem for me. ...

2023-07-08 · 7 min · Saeed Esmaili

Demystifying Text Data with the unstructured Python Library (+alternatives)

In the world of data, textual data stands out as being particularly complex. It doesn’t fall into neat rows and columns like numerical data does. As a side project, I’m in the process of developing my own personal AI assistant. The objective is to use the data within my notes and documents to answer my questions. The important benefit is all data processing will occure locally on my computer, ensuring that no documents are uploaded to the cloud, and my documents will remain private. ...

2023-07-05 · 6 min · Saeed Esmaili