What it is
When building RAG systems, you usually don’t embed an entire document at once. Instead, you split it into smaller chunks, embed each chunk, and store them for retrieval.
The textsplitter package defines the Splitter interface; concrete implementations
live in sub-packages. All splitters read a source file and emit document.Document values
via a lazy Go iterator (iter.Seq2).
// Splitter is implemented by all strategies:
type Splitter interface {
Split(ctx context.Context) iter.Seq2[document.Document, error]
}
Available splitters
-
textsplitter/recursive: recursively splits on an ordered list of separators (\n\n,\n, space, empty string). Good for generic plain-text documents. -
textsplitter/markdown: same strategy, but the default separator list prioritises Markdown heading levels (##,###, …) before falling back to paragraph and line boundaries.
Key ideas
- Chunk size: the maximum size of each chunk (measured by a configurable length function, default is byte length)
- Overlap: how much content is repeated between adjacent chunks to avoid hard cutoffs
- Metadata: each
document.Documentcarriessource,chunk_index,start_offset, andend_offsetmetadata keys automatically - Downstream usage: chunks are fed into embeddings and stored in a vector store
Example: splitting a file into chunks
Splitters read from a file path given at construction time. The iterator yields
document.Document values one at a time so large files are not held in memory all at once.
import (
"context"
"github.com/henomis/phero/textsplitter/recursive"
)
const (
chunkSize = 1000
chunkOverlap = 200
)
splitter := recursive.New("./my-document.txt", chunkSize, chunkOverlap)
var chunks []string
for doc, err := range splitter.Split(context.Background()) {
if err != nil {
// handle error
}
chunks = append(chunks, doc.Content)
}
// chunks are then embedded and ingested into a vector store
For Markdown files, use markdown.New from textsplitter/markdown:
import "github.com/henomis/phero/textsplitter/markdown"
splitter := markdown.New("./README.md", chunkSize, chunkOverlap)
for doc, err := range splitter.Split(ctx) {
// doc.Metadata["source"], doc.Metadata["chunk_index"], ...
}
Run the example
The RAG Chatbot example exercises the splitter as part of a full ingestion + retrieval flow. Provider setup depends on your chosen LLM/embedder; follow the example README.
# from repo root
go run ./examples/rag-chatbot -file /path/to/your/file.txt
# tuning knobs
# -chunk-size / -chunk-overlap
Practical tuning tips
- If answers miss context: increase overlap so facts aren’t split apart
- If retrieval feels noisy: decrease chunk size so chunks are more specific
- If ingestion is slow: increase chunk size (fewer chunks) and keep overlap modest
Related packages
- rag: ingests and retrieves chunks
- embedding: turns chunks/queries into vectors
- vectorstore: stores vectors for similarity search