What it is
When building RAG systems, you usually don’t embed an entire document at once. Instead, you split it into smaller chunks, embed each chunk, and store them for retrieval.
The textsplitter package provides utilities to split text into size-bounded chunks with an optional overlap.
Chunk overlap helps preserve context across boundaries.
Key ideas
- Chunk size: the maximum size of each chunk (the examples measure this in bytes)
- Overlap: how much content is repeated between adjacent chunks to avoid hard cutoffs
- Downstream usage: chunks are fed into embeddings and stored in a vector store
Example: splitting a file into chunks
The examples/rag-chatbot program loads a local text file and splits it before ingestion.
// From examples/rag-chatbot (edited for brevity)
b, err := os.ReadFile(filePath)
if err != nil {
// handle error
}
splitter := textsplitter.NewRecursiveCharacterTextSplitter(chunkSize, chunkOverlap)
chunks := compactStrings(splitter.SplitText(string(b)))
if len(chunks) == 0 {
// handle error
}
// chunks are then embedded and ingested into a vector store
In that example, chunkSize and chunkOverlap are CLI flags so you can tune them per dataset.
Run the example
The RAG Chatbot example exercises the splitter as part of a full ingestion + retrieval flow. Provider setup depends on your chosen LLM/embedder; follow the example README.
# from repo root
go run ./examples/rag-chatbot -file /path/to/your/file.txt
# tuning knobs
# -chunk-size / -chunk-overlap
Practical tuning tips
- If answers miss context: increase overlap so facts aren’t split apart
- If retrieval feels noisy: decrease chunk size so chunks are more specific
- If ingestion is slow: increase chunk size (fewer chunks) and keep overlap modest
Related packages
- rag: ingests and retrieves chunks
- embedding: turns chunks/queries into vectors
- vectorstore: stores vectors for similarity search