In his excellent write up on state of the art embedding models, Aapo Tanskanen compares the retrieval score for when the source documents are split into chunks and when they’re not:

Transformer-based single embedding models have traditionally had 512 token context windows because of the usage of the original BERT encoder. Newer models, like the BGE-M3, have expanded the token window to much larger scales. It could be tempting to forget chunking and just embed long texts as they are. However, that would mean mashing many topics and entities you might want to search for into a single vector representation, which doesn’t sound like a good idea.

I conducted a quick experiment to find out if the advertised long context window truly works well. I embedded all test set articles without chunking using the BGE-M3 dense single embedding representation with the maximum 8192 context window setting. Then, we can compare test set results against the chunked version (maximum of 512 tokens per chunk).

Model MRR@1 MRR@3 MRR@5 MRR@10 MRR@20 MRR@50
BAAI/bge-m3 dense chunked 0.842 0.890 0.892 0.893 0.894 0.895
BAAI/bge-m3 dense no chunking 0.739 0.807 0.820 0.825 0.826 0.826

As we can see, long context single embedding retrieval gets very poor MRR@K scores compared to the chunked version. … Mashing the whole article into a single embedding clearly does not perform well when there are targeted questions about different parts of the content.

This is a great finding. Another advantage of chunking the source documents and retrieving chunked texts when building a RAG service is you’ll end up with much smaller number of tokens in your LLM prompt, and this saves you in cost and latency.


Comment? Reply via Email, Mastodon or Twitter.