Textsplitter

Chunk documents into size-bounded pieces for RAG.

What it is

When building RAG systems, you usually don’t embed an entire document at once. Instead, you split it into smaller chunks, embed each chunk, and store them for retrieval.

The textsplitter package defines the Splitter interface; concrete implementations live in sub-packages. All splitters read a source file and emit document.Document values via a lazy Go iterator (iter.Seq2).

// Splitter is implemented by all strategies:
type Splitter interface {
    Split(ctx context.Context) iter.Seq2[document.Document, error]
}

Available splitters

Key ideas

Example: splitting a file into chunks

Splitters read from a file path given at construction time. The iterator yields document.Document values one at a time so large files are not held in memory all at once.

import (
    "context"

    "github.com/henomis/phero/textsplitter/recursive"
)

const (
    chunkSize    = 1000
    chunkOverlap = 200
)

splitter := recursive.New("./my-document.txt", chunkSize, chunkOverlap)

var chunks []string
for doc, err := range splitter.Split(context.Background()) {
    if err != nil {
        // handle error
    }
    chunks = append(chunks, doc.Content)
}

// chunks are then embedded and ingested into a vector store

For Markdown files, use markdown.New from textsplitter/markdown:

import "github.com/henomis/phero/textsplitter/markdown"

splitter := markdown.New("./README.md", chunkSize, chunkOverlap)

for doc, err := range splitter.Split(ctx) {
    // doc.Metadata["source"], doc.Metadata["chunk_index"], ...
}

Run the example

The RAG Chatbot example exercises the splitter as part of a full ingestion + retrieval flow. Provider setup depends on your chosen LLM/embedder; follow the example README.

# from repo root

go run ./examples/rag-chatbot -file /path/to/your/file.txt

# tuning knobs
# -chunk-size / -chunk-overlap

Practical tuning tips

Related packages