Text Chunking and Headings Grouping: A Guide to Parsing Documents with Pandoc and Python

In my previous blog post I explored using the unstructured python library for loading and parsing documents. As I mentioned in the post, although unstructured seems a very useful library, it has a few issues. Since I’m planning to do a semantic search on the paragraphs and feed the relevant ones to a large language model, the library’s inability to reliably identify headings and paragraphs was a big problem for me. ...

2023-07-08 · 7 min · Saeed Esmaili