How to split by character
Prerequisites
This guide assumes familiarity with the following concepts:
This is the simplest method for splitting text. This splits based on a
given character sequence, which defaults to "\n\n"
. Chunk length is
measured by number of characters.
- How the text is split: by single character separator.
- How the chunk size is measured: by number of characters.
To obtain the string content directly, use .splitText()
.
To create LangChain
Document
objects (e.g., for use in downstream tasks), use .createDocuments()
.
import { CharacterTextSplitter } from "@langchain/textsplitters";
import * as fs from "node:fs";
// Load an example document
const rawData = await fs.readFileSync(
"../../../../examples/state_of_the_union.txt"
);
const stateOfTheUnion = rawData.toString();
const textSplitter = new CharacterTextSplitter({
separator: "\n\n",
chunkSize: 1000,
chunkOverlap: 200,
});
const texts = await textSplitter.createDocuments([stateOfTheUnion]);
console.log(texts[0]);
Document {
pageContent: "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th"... 839 more characters,
metadata: { loc: { lines: { from: 1, to: 17 } } }
}
You can also propagate metadata associated with each document to the output chunks:
const metadatas = [{ document: 1 }, { document: 2 }];
const documents = await textSplitter.createDocuments(
[stateOfTheUnion, stateOfTheUnion],
metadatas
);
console.log(documents[0]);
Document {
pageContent: "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th"... 839 more characters,
metadata: { document: 1, loc: { lines: { from: 1, to: 17 } } }
}
To obtain the string content directly, use .splitText()
:
const chunks = await textSplitter.splitText(stateOfTheUnion);
chunks[0];
"Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th"... 839 more characters
Next stepsβ
Youβve now learned a method for splitting text by character.
Next, check out a more advanced way of splitting by character, or the full tutorial on retrieval-augmented generation.