In the previous lesson we discovered how we can leverage the TokenTextSplitter (or CharacterTextSplitter) to divide our long documents in chunks that have a similar length.
That’s good because it helps us provide smaller documents that an embed model can easily consume, but at the same time we can lose a lot of informations about it.
On top of that, especially because we are working on a document-related application, it is understandable to assume that these documents have some sort of separation. A new line, a title, a code block and so on…
And that why in this lesson we get introduced to the concept of structural chinking. Technique that we can still achieve by leveraging a LangChain method called RecursiveCharacterTextSplitter capable of accepting an array of separators that we can use to help the algorithm by suggesting the separator it can leverage to create chunks.
This constructor is a powerful tool that you can leverage because it let’s you set your own separators and it can adapt to any text.
Since in this exercise we’re talking about a MarkDown file, the Matt’s TypeScript book, we are tasked to sign specific breakpoints, or separator if you want to keep the analogy, that are markdown related like:
- heading (all the levels except the first)
- code blocks
- custom chapter separator introduced for the book
- horizontal lines
Since we have ready copy/paste examples to implement in the exercise, let me just paste the entire instance of our splitter.
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 800,
chunkOverlap: 150,
separators: [
// Separators for chapter markers (e.g., "--- CHAPTER ---")
'\n--- CHAPTER ---\n',
// Separators for horizontal lines
'\n\n***\n\n',
'\n\n---\n\n',
'\n\n___\n\n',
// Separators for headings (not including h1's)
'\n## ',
'\n### ',
'\n#### ',
'\n##### ',
'\n###### ',
// Separators for code blocks
'```\n',
'```\n\n',
// Separators for paragraphs
'\n\n',
// Fallback separators
'\n',
'',
],
});
Please also note that the order of the separators is really important here!
The RecursiveCharacterTextSplitter will give more importance to the separators it finds first and then tries to apply the other while looping over the separators array.
If you are sure that your documents will not have custom code in it (like the
-- CHAPTER ---) and are written in MarkDown you can simply call theMarkdownTextSplitteras it comes pre-configured with the standard separators used in this markup language. But be aware that you cannot specify your own customseparators!